![]() |
||
---|---|---|
.. | ||
doc | ||
the_stack_licenses.egg-info | ||
.gitignore | ||
BUILD.md | ||
Makefile | ||
README.md | ||
__init__.py | ||
example-metadata.py | ||
requirements.txt | ||
setup.py | ||
test_the_stack_licenses | ||
the_stack_headers.py | ||
the_stack_licenses.py |
README.md
The Smack Dataset
The Smack Dataset doesn't exist.
Should it happen to exist someday, it will be a libre build of The Stack dataset, but not using the dataset directly, so as not to be encumbered by The Stack's non-libre (not "open source") license.
The Stack Metadata
The Stack has a metadata repo with details about The Stack dataset, without containing the dataset itself. One reason for this (as they discussed in an issue/post) is so researchers can learn about the dataset contents without being encumbered by the license. For example, how can you agree to a license without knowing the licenses of the contents? Using the metadata files can help with this issue.
Downloading Metadata
While metadata is far less than the total dataset, it is still relatively large. The git repo is a bit over one terabyte.
Here is a link to the git repository:
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
Reading Metadata
The Stack metadata is in parquet format, which is swell. The parquet files are currently 562 gigabytes, numbering 2,832 files, in 945 directories.
Selecting Repos
Write a script to select appropriate repos per libre criteria.
Cloning Repos
Write a script to go clone the repos.
Train
Train, using libre code from Bigcode (makers of The Stack).
Scripts
The following scripts are available.
the_stack_headers.py
--- Reads header names from The Stack parquet files.the_stack_licenses.py
--- Reads licenses and records from The Stack license file.
Code Assist
The following scripts were written using Parrot code assist.
The Phind-CodeLlama-34B-v2_q8.guff
model from TheBloke was used.
the_stack_headers.py
the_stack_licenses.py