readme markdown is now in sphinx

main
Jeff Moe 2023-11-25 11:38:40 -07:00
parent 612a323adf
commit 9972f98afa
1 changed files with 0 additions and 59 deletions

View File

@ -1,59 +0,0 @@
# The Smack Dataset
The Smack Dataset doesn't exist.
Should it happen to exist someday, it will be a libre build of The Stack dataset,
but not using the dataset directly, so as not to be encumbered by The Stack's
non-libre (not "open source") license.
# The Stack Metadata
The Stack has a metadata repo with details about The Stack dataset, without
containing the dataset itself. One reason for this (as they discussed in an
issue/post) is so researchers can learn about the dataset contents without
being encumbered by the license. For example, how can you agree to a license
without knowing the licenses of the contents? Using the metadata files can
help with this issue.
# Downloading Metadata
While metadata is far less than the total dataset, it is still relatively large.
The git repo is a bit over one terabyte.
Here is a link to the git repository:
```
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
```
# Reading Metadata
The Stack metadata is in parquet format, which is swell.
The parquet files are currently 562 gigabytes, numbering 2,832 files,
in 945 directories.
# Selecting Repos
Write a script to select appropriate repos per libre criteria.
# Cloning Repos
Write a script to go clone the repos.
# Train
Train, using libre code from Bigcode (makers of The Stack).
# Scripts
The following scripts are available.
* `the_stack_headers.py` --- Reads header names from The Stack parquet files.
* `the_stack_licenses.py` --- Reads licenses and records from The Stack license file.
# Code Assist
The following scripts were written using Parrot code assist.
`The Phind-CodeLlama-34B-v2_q8.guff` model from TheBloke was used.
* `the_stack_headers.py`
* `the_stack_licenses.py`