parrot-datasets/datasets/the-smack
Jeff Moe a9e5a38ef4 Read the docs theme 2023-11-24 21:37:40 -07:00
..
doc Read the docs theme 2023-11-24 21:37:40 -07:00
the_stack_licenses.egg-info mv, setup.py, eggs, etc 2023-11-24 20:07:13 -07:00
.gitignore scriptlet build cruft 2023-11-24 21:16:48 -07:00
BUILD.md scriptlet build cruft 2023-11-24 21:16:48 -07:00
Makefile scriptlet build cruft 2023-11-24 21:16:48 -07:00
README.md scriptlet build cruft 2023-11-24 21:16:48 -07:00
__init__.py mv, setup.py, eggs, etc 2023-11-24 20:07:13 -07:00
example-metadata.py example path 2023-11-24 15:34:57 -07:00
requirements.txt Read the docs theme 2023-11-24 21:37:40 -07:00
setup.py mv, setup.py, eggs, etc 2023-11-24 20:07:13 -07:00
test_the_stack_licenses Add test stub for license scriptlet 2023-11-24 18:37:27 -07:00
the_stack_headers.py scriptlet build cruft 2023-11-24 21:16:48 -07:00
the_stack_licenses.py mv, setup.py, eggs, etc 2023-11-24 20:07:13 -07:00

README.md

The Smack Dataset

The Smack Dataset doesn't exist.

Should it happen to exist someday, it will be a libre build of The Stack dataset, but not using the dataset directly, so as not to be encumbered by The Stack's non-libre (not "open source") license.

The Stack Metadata

The Stack has a metadata repo with details about The Stack dataset, without containing the dataset itself. One reason for this (as they discussed in an issue/post) is so researchers can learn about the dataset contents without being encumbered by the license. For example, how can you agree to a license without knowing the licenses of the contents? Using the metadata files can help with this issue.

Downloading Metadata

While metadata is far less than the total dataset, it is still relatively large. The git repo is a bit over one terabyte.

Here is a link to the git repository:

git clone https://huggingface.co/datasets/bigcode/the-stack-metadata

Reading Metadata

The Stack metadata is in parquet format, which is swell. The parquet files are currently 562 gigabytes, numbering 2,832 files, in 945 directories.

Selecting Repos

Write a script to select appropriate repos per libre criteria.

Cloning Repos

Write a script to go clone the repos.

Train

Train, using libre code from Bigcode (makers of The Stack).

Scripts

The following scripts are available.

  • the_stack_headers.py --- Reads header names from The Stack parquet files.
  • the_stack_licenses.py --- Reads licenses and records from The Stack license file.

Code Assist

The following scripts were written using Parrot code assist. The Phind-CodeLlama-34B-v2_q8.guff model from TheBloke was used.

  • the_stack_headers.py
  • the_stack_licenses.py