readme markdown is now in sphinx
parent
612a323adf
commit
9972f98afa
|
@ -1,59 +0,0 @@
|
|||
# The Smack Dataset
|
||||
The Smack Dataset doesn't exist.
|
||||
|
||||
Should it happen to exist someday, it will be a libre build of The Stack dataset,
|
||||
but not using the dataset directly, so as not to be encumbered by The Stack's
|
||||
non-libre (not "open source") license.
|
||||
|
||||
|
||||
# The Stack Metadata
|
||||
The Stack has a metadata repo with details about The Stack dataset, without
|
||||
containing the dataset itself. One reason for this (as they discussed in an
|
||||
issue/post) is so researchers can learn about the dataset contents without
|
||||
being encumbered by the license. For example, how can you agree to a license
|
||||
without knowing the licenses of the contents? Using the metadata files can
|
||||
help with this issue.
|
||||
|
||||
|
||||
# Downloading Metadata
|
||||
While metadata is far less than the total dataset, it is still relatively large.
|
||||
The git repo is a bit over one terabyte.
|
||||
|
||||
Here is a link to the git repository:
|
||||
|
||||
```
|
||||
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
|
||||
```
|
||||
|
||||
|
||||
# Reading Metadata
|
||||
The Stack metadata is in parquet format, which is swell.
|
||||
The parquet files are currently 562 gigabytes, numbering 2,832 files,
|
||||
in 945 directories.
|
||||
|
||||
|
||||
# Selecting Repos
|
||||
Write a script to select appropriate repos per libre criteria.
|
||||
|
||||
|
||||
# Cloning Repos
|
||||
Write a script to go clone the repos.
|
||||
|
||||
|
||||
# Train
|
||||
Train, using libre code from Bigcode (makers of The Stack).
|
||||
|
||||
|
||||
# Scripts
|
||||
The following scripts are available.
|
||||
|
||||
* `the_stack_headers.py` --- Reads header names from The Stack parquet files.
|
||||
* `the_stack_licenses.py` --- Reads licenses and records from The Stack license file.
|
||||
|
||||
|
||||
# Code Assist
|
||||
The following scripts were written using Parrot code assist.
|
||||
`The Phind-CodeLlama-34B-v2_q8.guff` model from TheBloke was used.
|
||||
|
||||
* `the_stack_headers.py`
|
||||
* `the_stack_licenses.py`
|
Loading…
Reference in New Issue