readme markdown is now in sphinx

2023-11-25 11:38:40 -07:00 · 2023-11-25 11:38:40 -07:00 · 9972f98afa
parent 612a323adf
commit 9972f98afa
1 changed files with 0 additions and 59 deletions
--- a/datasets/the-smack/README.md
+++ b/datasets/the-smack/README.md
@ -1,59 +0,0 @@
-# The Smack Dataset
-The Smack Dataset doesn't exist.
-
-Should it happen to exist someday, it will be a libre build of The Stack dataset,
-but not using the dataset directly, so as not to be encumbered by The Stack's
-non-libre (not "open source") license.
-
-
-# The Stack Metadata
-The Stack has a metadata repo with details about The Stack dataset, without
-containing the dataset itself. One reason for this (as they discussed in an
-issue/post) is so researchers can learn about the dataset contents without
-being encumbered by the license. For example, how can you agree to a license
-without knowing the licenses of the contents? Using the metadata files can
-help with this issue.
-
-
-# Downloading Metadata
-While metadata is far less than the total dataset, it is still relatively large.
-The git repo is a bit over one terabyte.
-
-Here is a link to the git repository:
-
-```
-git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
-```
-
-
-# Reading Metadata
-The Stack metadata is in parquet format, which is swell.
-The parquet files are currently 562 gigabytes, numbering 2,832 files,
-in 945 directories.
-
-
-# Selecting Repos
-Write a script to select appropriate repos per libre criteria.
-
-
-# Cloning Repos
-Write a script to go clone the repos.
-
-
-# Train
-Train, using libre code from Bigcode (makers of The Stack).
-
-
-# Scripts
-The following scripts are available.
-
-* `the_stack_headers.py` --- Reads header names from The Stack parquet files.
-* `the_stack_licenses.py` --- Reads licenses and records from The Stack license file.
-
-
-# Code Assist
-The following scripts were written using Parrot code assist.
-`The Phind-CodeLlama-34B-v2_q8.guff` model from TheBloke was used.
-
-* `the_stack_headers.py`
-* `the_stack_licenses.py`