deepcrayon

History

Jeff Moe a9e5a38ef4 Read the docs theme		2023-11-24 21:37:40 -07:00
..
doc	Read the docs theme	2023-11-24 21:37:40 -07:00
the_stack_licenses.egg-info	mv, setup.py, eggs, etc	2023-11-24 20:07:13 -07:00
.gitignore	scriptlet build cruft	2023-11-24 21:16:48 -07:00
BUILD.md	scriptlet build cruft	2023-11-24 21:16:48 -07:00
Makefile	scriptlet build cruft	2023-11-24 21:16:48 -07:00
README.md	scriptlet build cruft	2023-11-24 21:16:48 -07:00
__init__.py	mv, setup.py, eggs, etc	2023-11-24 20:07:13 -07:00
example-metadata.py	example path	2023-11-24 15:34:57 -07:00
requirements.txt	Read the docs theme	2023-11-24 21:37:40 -07:00
setup.py	mv, setup.py, eggs, etc	2023-11-24 20:07:13 -07:00
test_the_stack_licenses	Add test stub for license scriptlet	2023-11-24 18:37:27 -07:00
the_stack_headers.py	scriptlet build cruft	2023-11-24 21:16:48 -07:00
the_stack_licenses.py	mv, setup.py, eggs, etc	2023-11-24 20:07:13 -07:00

README.md

The Smack Dataset

The Smack Dataset doesn't exist.

Should it happen to exist someday, it will be a libre build of The Stack dataset, but not using the dataset directly, so as not to be encumbered by The Stack's non-libre (not "open source") license.

The Stack Metadata

The Stack has a metadata repo with details about The Stack dataset, without containing the dataset itself. One reason for this (as they discussed in an issue/post) is so researchers can learn about the dataset contents without being encumbered by the license. For example, how can you agree to a license without knowing the licenses of the contents? Using the metadata files can help with this issue.

Downloading Metadata

While metadata is far less than the total dataset, it is still relatively large. The git repo is a bit over one terabyte.

Here is a link to the git repository:

git clone https://huggingface.co/datasets/bigcode/the-stack-metadata

Reading Metadata

The Stack metadata is in parquet format, which is swell. The parquet files are currently 562 gigabytes, numbering 2,832 files, in 945 directories.

Selecting Repos

Write a script to select appropriate repos per libre criteria.

Cloning Repos

Write a script to go clone the repos.

Train

Train, using libre code from Bigcode (makers of The Stack).

Scripts

The following scripts are available.

the_stack_headers.py --- Reads header names from The Stack parquet files.
the_stack_licenses.py --- Reads licenses and records from The Stack license file.

Code Assist

The following scripts were written using Parrot code assist. The Phind-CodeLlama-34B-v2_q8.guff model from TheBloke was used.

the_stack_headers.py
the_stack_licenses.py