Parrot Datasets main name for docs, rearrange, etc

main
Jeff Moe 2023-11-25 12:30:58 -07:00
parent e053d14155
commit 68cbf51151
3 changed files with 73 additions and 81 deletions

View File

@ -1,29 +1,12 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = "The Smack"
project = "Parrot Datasets"
copyright = "2023, Jeff Moe"
author = "Jeff Moe"
release = "v0.0.9"
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
extensions = [
"sphinx.ext.autodoc",
]
templates_path = ["_templates"]
exclude_patterns = ["_build"]
# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = "sphinx_rtd_theme"
html_static_path = ["_static"]
htmlhelp_basename = "TheStackdoc"
htmlhelp_basename = "ParrotDatasetsdoc"

View File

@ -1,63 +1,6 @@
The Smack Dataset
Parrot Dataset
=================
The Smack Dataset does not exist. In the future, if it arises, it will be a libre build of The Stack dataset without using the original dataset directly due to non-libre (non-"open source") license encumbrances.
The Stack Metadata
------------------
The Stack has a separate metadata repository containing information about the dataset without hosting the dataset itself. This practice is beneficial as it allows researchers to understand dataset contents without being bound by licenses. For instance, how can one agree to a license when they're unaware of the content's licenses? By using metadata files, this issue can be mitigated.
Link to the Git Repository:
.. code-block:: bash
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
Downloading Metadata
--------------------
The metadata is considerably less than the entire dataset, but still substantially large. The Git repository is approximately one terabyte in size.
Reading Metadata
----------------
The Stack's metadata is stored in parquet format, a welcomed choice. The parquet files span 562 gigabytes and consist of 2,832 individual files across 945 directories.
Selecting Repos
---------------
Write a script to filter appropriate repositories based on libre criteria.
Cloning Repos
-------------
Write a script to clone the selected repositories.
Train
-----
Utilize libre code from Bigcode (creators of The Stack) for model training.
Scripts
-------
The following scripts are available:
* ``the_stack_headers.py`` - Retrieves header names from The Stack's parquet files.
* ``the_stack_licenses.py`` - Extracts licenses and records from The Stack's license file.
Code Assist
-----------
The following scripts were developed using Parrot code assist:
* ``the_stack_headers.py``
* ``the_stack_licenses.py``
These scripts were created with the `The Phind-CodeLlama-34B-v2_q8.guff` model from TheBloke.
There is no Parrot Dataset, por ahora.
.. toctree::
:maxdepth: 2

View File

@ -1,7 +1,73 @@
The Smack
=========
The Smack Dataset
=================
This is the documentation for The Smack.
The Smack Dataset does not exist. In the future, if it arises, it will be a libre build of The Stack dataset without using the original dataset directly due to non-libre (non-"open source") license encumbrances.
The Stack Metadata
------------------
The Stack has a separate metadata repository containing information about the dataset without hosting the dataset itself. This practice is beneficial as it allows researchers to understand dataset contents without being bound by licenses. For instance, how can one agree to a license when they're unaware of the content's licenses? By using metadata files, this issue can be mitigated.
Link to the Git Repository:
.. code-block:: bash
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
Downloading Metadata
--------------------
The metadata is considerably less than the entire dataset, but still substantially large. The Git repository is approximately one terabyte in size.
Reading Metadata
----------------
The Stack's metadata is stored in parquet format, a welcomed choice. The parquet files span 562 gigabytes and consist of 2,832 individual files across 945 directories.
Selecting Repos
---------------
Write a script to filter appropriate repositories based on libre criteria.
Cloning Repos
-------------
Write a script to clone the selected repositories.
Train
-----
Utilize libre code from Bigcode (creators of The Stack) for model training.
Scripts
-------
The following scripts are available:
* ``the-stack-headers`` - Retrieves header names from The Stack's parquet files.
* ``the-stack-licenses`` - Extracts licenses and records from The Stack's license file.
Code Assist
-----------
The following scripts were developed using Parrot code assist:
* ``the-stack-headers``
* ``the-stack-licenses``
These scripts were created with the `The Phind-CodeLlama-34B-v2_q8.guff` model from TheBloke.
.. toctree::
:maxdepth: 2
:caption: Contents:
.. toctree::
:maxdepth: 1
:caption: Contents:
* :ref:`genindex`
.. automodule:: the_smack
:members: