docs headers

formatting
machine translation note
2023-12-01 13:53:18 -07:00 · 2023-12-01 13:07:24 -07:00 · 2023-12-01 09:57:24 -07:00 · 2023-11-30 19:40:49 -07:00 · 2023-11-30 19:17:24 -07:00 · 2023-11-29 08:01:55 -07:00
19 changed files with 368 additions and 153 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,12 @@
 .~lock.*.ods#
-*.swp
+.pytest_cache/
+.python-version
+build
+env
 tmp
+venv
+*.swp
+**/dist/
+**/*.egg-info/
+*/target/
+__pycache__
--- a/BUILD.md
+++ b/BUILD.md
@ -0,0 +1,22 @@
+# Build
+Build, perhaps like this:
+
+```
+deactivate  ; rm -rf venv env dist ; virtualenv env ; source env/bin/activate ; pip install -U setuptools wheel pip ; pip install -r requirements.txt ; pip install -e . ; cd docs/ ; make clean ; make html ; cd .. ; python -m build
+```
+
+Cleanish:
+
+```
+source env/bin/activate ; cd docs/ ; make clean ; cd .. ; deactivate  ; rm -rf venv env dist src/*-info src/*/__pycache__
+```
+
+
+# Versions
+
+```
+vim CHANGELOG.txt docs/source/conf.py pyproject.toml
+# git commit CHANGELOG.txt docs/source/conf.py pyproject.toml -m "v0.0.0"
+# git tag v0.0.0
+# git push ; git push --tags
+```
--- a/CHANGELOG.txt
+++ b/CHANGELOG.txt
@ -1,3 +1,8 @@
+v0.0.9        Fix versions.
+v0.0.8        Re-arrange project structure.
+v0.0.7        The Smack, use pyproject.toml, update Sphinx docs.
+v0.0.6        Read the Docs Sphinx for The Smack.
+v0.0.5        Sphinx docs for The Smack.
 v0.0.4        The Stack scriplet to generate license list.
 v0.0.3        The Smack started.
 v0.0.2        Dataset table.
--- a/README.md
+++ b/README.md
@ -1,53 +1,67 @@
-# Parrot Datasets
+
+.. _parrot-datasets:
+
+Parrot Datasets
+===============
+
 Datasets for Parrot Libre AI IDE.

-https://parrot.codes
+.. _parrot-libre-datasets:

+Libre Datasets
+---------------

-# Libre Datasets
-A list of libre datasets suitable for training a libre instruct model
-shall be listed.
-
+A list of libre datasets suitable for training a libre instruct model shall be listed.
 Note other well known datasets, and their license suitability.

+.. _parrot-dataset-licensing:

-# Parrot Dataset Licensing
-The model may use data that is under a license that appears on one
-of these three lists as an acceptable free/open license:
+Parrot Dataset Licensing
+-------------------------
+
+The model may use data that is under a license that appears on one of these three lists as an acceptable free/open license:

 * https://www.gnu.org/licenses/license-list.html
-
 * https://opensource.org/licenses/
-
 * https://commons.wikimedia.org/wiki/Commons:Licensing

+.. _unsuitable-licenses:

-# Unsuitable Licenses
-Licenses that are not free, libre, open, even if they may claim to
-be "open source".
+Unsuitable Licenses
+--------------------

+Licenses that are not free, libre, open, even if they may claim to be "open source".
 These are not "Wikipedia Commons compatible", for example:

 * Creative Commons Non-commercial (NC).
 * Proprietary licenses.
 * Any "custom" license that hasn't been reviewed by the general community.

+.. _datasets-table:
+
+Datasets Table
+--------------

-# Datasets Table
 Table of datasets. See also the spreadsheet `datasets.ods`.

-![Table of Datasets](img/datasets-table.png)
+.. image:: img/datasets-table.png
+   :alt: Table of Datasets

+.. _datasets:
+
+Datasets
+--------

-# Datasets
 Datasets perhaps to be built and used.

-## The Smack
-Libre version of The Stack.
-See: `datasets/the-smack`.
+* The Smack
+  Libre version of The Stack. See: `datasets/the-smack`.

+.. _license:
+
+License
+-------

-# License
 Creative Commons Attribution-ShareAlike 4.0 International

-*Copyright &copy; 2023, Jeff Moe.*
+Copyright © 2023, Jeff Moe.
--- a/datasets/the-smack/.gitignore
+++ b/datasets/the-smack/.gitignore
@ -1,4 +0,0 @@
-**/target/
-Cargo.lock
-venv
-env
--- a/datasets/the-smack/README.md
+++ b/datasets/the-smack/README.md
@ -1,57 +0,0 @@
-# The Smack Dataset
-The Smack Dataset doesn't exist.
-
-Should it happen to exist someday, it will be a libre build of The Stack dataset,
-but not using the dataset directly, so as not to be encumbered by The Stack's
-non-libre (not "open source") license.
-
-
-# The Stack Metadata
-The Stack has a metadata repo with details about The Stack dataset, without
-containing the dataset itself. One reason for this (as they discussed in an
-issue/post) is so researchers can learn about the dataset contents without
-being encumbered by the license. For example, how can you agree to a license
-without knowing the licenses of the contents? Using the metadata files can
-help with this issue.
-
-
-# Downloading Metadata
-While metadata is far less than the total dataset, it is still relatively large.
-The git repo is a bit over one terabyte.
-
-Here is a link to the git repository:
-
-```
-git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
-```
-
-
-# Reading Metadata
-The Stack metadata is in parquet format, which is swell.
-The parquet files are currently 562 gigabytes, numbering 2,832 files,
-in 945 directories.
-
-
-# Selecting Repos
-Write a script to select appropriate repos per libre criteria.
-
-
-# Cloning Repos
-Write a script to go clone the repos.
-
-
-# Train
-Train, using libre code from Bigcode (makers of The Stack).
-
-
-# Scripts
-The following scripts are available.
-
-* `the-stack-headers` --- Reads header names from The Stack parquet files.
-
-
-# Code Assist
-The following scripts were written using Parrot code assist.
-`The Phind-CodeLlama-34B-v2_q8.guff` model from TheBloke was used.
-
-* `the-stack-headers`
--- a/datasets/the-smack/example-metadata.py
+++ b/datasets/the-smack/example-metadata.py
@ -1,63 +0,0 @@
-import datasets
-from pathlib import Path
-from tqdm.auto import tqdm
-import pandas as pd
-
-# assuming metadata is cloned into the local folder /data/hf_repos/the-stack-metadata
-#          the stack is cloned into the local folder /data/hf_repos/the-stack-v1.1
-#          destination folder is in  /repo_workdir/numpy_restored
-the_stack_meta_path = Path('/srv/ml/huggingface/datasets/bigcode/the-stack-metadata')
-the_stack_path = Path('/data/hf_repos/the-stack-v1.1')
-repo_dst_root = Path('/repo_workdir/numpy_restored')
-repo_name = 'numpy/numpy'
-
-# Get bucket with numpy repo info
-# meta_bucket_path = None
-#for fn in tqdm(list((the_stack_meta_path/'data').glob('*/ri.parquet'))):
-#    df = pd.read_parquet(fn)
-#    if any(df['name'] == repo_name):
-#        meta_bucket_path = fn
-#        break
-meta_bucket_path = the_stack_meta_path / 'data/255_944'
-
-
-# Get repository id from repo name
-ri_id = pd.read_parquet(
-    meta_bucket_path / 'ri.parquet'
-).query(
-    f'`name` == "{repo_name}"'
-)['id'].to_list()[0]
-
-# Get files information for the reopository
-files_info = pd.read_parquet(
-    meta_bucket_path / 'fi.parquet'
-).query(
-    f'`ri_id` == {ri_id} and `size` != 0 and `is_deleted` == False'
-)
-
-# Convert DF with files information to a dictionary by language and then file hexsha
-#   there can be more than one file with the same  hexsha in the repo so we gather
-#   all instances per unique hexsha
-files_info_dict = {
-    k: v[['hexsha', 'path']].groupby('hexsha').apply(lambda x: list(x['path'])).to_dict()
-    for k, v in files_info.groupby('lang_ex')
-}
-
-# Load Python part of The Stack
-ds = datasets.load_dataset(
-    str(the_stack_path/'data/python'),
-    num_proc=10, ignore_verifications=True
-)
-
-# Save file content of the python files in the numpy reposirotry in their appropriate locations
-def save_file_content(example, files_info_dict, repo_dst_root):
-    if example['hexsha'] in  files_info_dict:
-        for el in files_info_dict[example['hexsha']]:
-            path = repo_dst_root / el
-            path.parent.mkdir(parents=True, exist_ok=True)
-            path.write_text(example['content'])
-ds.map(
-    save_file_content,
-    fn_kwargs={'files_info_dict': files_info_dict['Python'], 'repo_dst_root': repo_dst_root},
-    num_proc=10
-)
--- a/datasets/the-smack/requirements.txt
+++ b/datasets/the-smack/requirements.txt
@ -1,5 +0,0 @@
-datasets
-tqdm
-pandas
-pathlib
-termcolor
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -0,0 +1,12 @@
+project = "Parrot Datasets"
+copyright = "2023, Jeff Moe"
+author = "Jeff Moe"
+release = "v0.0.9"
+extensions = [
+    "sphinx.ext.autodoc",
+]
+templates_path = ["_templates"]
+exclude_patterns = ["_build"]
+html_theme = "sphinx_rtd_theme"
+html_static_path = ["_static"]
+htmlhelp_basename = "ParrotDatasetsdoc"
--- a/docs/source/img/datasets-table.png
+++ b/docs/source/img/datasets-table.png
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -0,0 +1,69 @@
+Dataset
+=======
+Datasets for Parrot Libre AI IDE.
+
+There is no Parrot Dataset, at present.
+
+.. note:: Parrot is in early development, not ready for end users.
+
+.. _parrot-libre-datasets:
+
+Libre Datasets
+---------------
+
+A list of libre datasets suitable for training a libre instruct model shall be listed.
+Note other well known datasets, and their license suitability.
+
+.. _parrot-dataset-licensing:
+
+Dataset Licensing
+-----------------
+
+The model may use data that is under a license that appears on one of these three lists as an acceptable free/open license:
+
+* https://www.gnu.org/licenses/license-list.html
+* https://opensource.org/licenses/
+* https://commons.wikimedia.org/wiki/Commons:Licensing
+
+.. _unsuitable-licenses:
+
+Unsuitable Licenses
+--------------------
+
+Licenses that are not free, libre, open, even if they may claim to be "open source".
+These are not "Wikipedia Commons compatible", for example:
+
+* Creative Commons Non-commercial (NC).
+* Proprietary licenses.
+* Any "custom" license that hasn't been reviewed by the general community.
+
+.. _datasets-table:
+
+Datasets Table
+--------------
+
+Table of datasets. See also the spreadsheet ``datasets.ods``.
+
+.. image:: img/datasets-table.png
+   :alt: Table of Datasets
+
+.. _libre_datasets:
+
+Libre Datasets
+--------------
+
+Datasets perhaps to be built and used.
+
+The Smack
+^^^^^^^^^
+  | Libre version of The Stack.
+  | See: :doc:`The Smack <the_smack>`.
+
+.. toctree::
+   the_smack
+   :maxdepth: 1
+   :caption: Contents:
+
+
+.. note:: Parrot documentation is written in English and uses AI machine translation for other languages.
+
--- a/docs/source/the_smack.rst
+++ b/docs/source/the_smack.rst
@ -0,0 +1,94 @@
+The Smack Dataset
+=================
+
+The Smack Dataset does not exist.
+In the future,
+if it arises,
+it will be a libre build of The Stack dataset without using the original dataset directly due to non-libre (non-"open source") license encumbrances.
+
+.. note:: Parrot is in early development, not ready for end users.
+
+The Stack Metadata
+------------------
+
+The Stack has a separate metadata repository containing information about the dataset without hosting the dataset itself.
+This practice is beneficial as it allows researchers to understand dataset contents without being bound by licenses.
+For instance,
+how can one agree to a license when they're unaware of the content's licenses?
+By using metadata files,
+this issue can be mitigated.
+
+Link to the Git Repository:
+
+.. code-block:: bash
+
+    git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
+
+Downloading Metadata
+--------------------
+
+The metadata is considerably less than the entire dataset,
+but still substantially large.
+The Git metadata repository is approximately one terabyte in size.
+
+Reading Metadata
+----------------
+
+The Stack's metadata is stored in parquet format.
+The parquet files span 562 gigabytes and consist of 2,832 individual files across 945 directories.
+
+Selecting Repos
+---------------
+
+Write a script to filter appropriate repositories based on libre criteria.
+
+Cloning Repos
+-------------
+
+Write a script to clone the selected repositories.
+
+Train
+-----
+
+Utilize libre code from Bigcode (creators of The Stack) for model training.
+
+Scripts
+-------
+
+The following scripts are available:
+
+* ``the-stack-headers`` --
+  Retrieves header names from The Stack's parquet files.
+
+* ``the-stack-licenses`` --
+  Extracts licenses and records from The Stack's license file.
+
+
+Code Assist
+-----------
+
+The following scripts were developed using Parrot code assist:
+
+* ``the-stack-headers``
+
+* ``the-stack-licenses``
+
+
+These scripts were created with the
+`The Phind-CodeLlama-34B-v2_q8.guff`
+model from TheBloke.
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+.. automodule:: the_smack
+   :members:
+
+.. automodule:: the_smack.the_stack_licenses
+   :members:
+
+
+.. note:: Parrot documentation is written in English and uses AI machine translation for other languages.
+
--- a/pyproject.toml
+++ b/pyproject.toml
@ -0,0 +1,12 @@
+
+[build-system]
+requires = ["setuptools", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "the_smack"
+version = "0.0.9"
+
+[project.scripts]
+the-stack-licenses = "the_smack.the_stack_licenses:main"
+the-stack-headers = "the_smack.the_stack_headers:main"
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,10 @@
+datasets
+tqdm
+pandas
+pathlib
+termcolor
+pytest
+sphinx
+sphinx_rtd_theme
+build
+toml
--- a/src/the_smack/init.py
+++ b/src/the_smack/init.py
@ -0,0 +1,5 @@
+# __init__.py
+
+import the_smack
+
+__all__ = ["the_smack"]
--- a/datasets/the-smack/the-stack-headers
+++ b/datasets/the-smack/the-stack-headers
--- a/datasets/the-smack/the-stack-licenses
+++ b/datasets/the-smack/the-stack-licenses
@ -1,5 +1,17 @@
 #!/usr/bin/env python3
-"""Script to read and print specific records from the lic.parquet file in a numbered directory under the data/ subdirectory."""
+"""
+This script is designed to read and print specific records from the lic.parquet file in a numbered directory under the data/ subdirectory.
+
+Example usage: python3 script.py --records 1-5 -c
+
+Command-line options:
+  -h, --help            show this help message and exit
+  --version             show program's version number and exit
+  -r RANGE, --records=RANGE
+                        record number or range to print (e.g., 1, 5-7)
+  -c, --color           colorize the output
+  -l, --list_licenses   list unique licenses in the file
+"""

 import argparse
 import os
@ -9,6 +21,16 @@ from termcolor import colored


 def get_records(dataframe, args):
+    """
+    Extract records from a DataFrame based on user-specified range.
+
+    Parameters:
+        dataframe (DataFrame): The pandas DataFrame to extract records from.
+        args (Namespace): A namespace object containing parsed command line arguments.
+
+    Returns:
+        DataFrame: The extracted records as a new DataFrame.
+    """
    if "-" in args.records:
        start, end = map(int, args.records.split("-"))
        return dataframe[start - 1 : end]
@ -18,6 +40,13 @@ def get_records(dataframe, args):


 def print_records(dataframe, color):
+    """
+    Print the records in a DataFrame with optional colorization.
+
+    Parameters:
+        dataframe (DataFrame): The pandas DataFrame to print.
+        color (bool): If True, colorize the output.
+    """
    for index, row in dataframe.iterrows():
        if color:
            for col in row.index:
@ -31,6 +60,12 @@ def print_records(dataframe, color):


 def print_unique_licenses(dataframe):
+    """
+    Print the unique licenses in a DataFrame, sorted alphabetically.
+
+    Parameters:
+        dataframe (DataFrame): The pandas DataFrame to extract licenses from.
+    """
    licenses = dataframe["license"].unique().tolist()
    licenses.sort(
        key=lambda x: [int(i) if i.isdigit() else i for i in re.split("([0-9]+)", x)]
@ -40,6 +75,9 @@ def print_unique_licenses(dataframe):


 def main():
+    """
+    Main function to parse command line arguments and run the script.
+    """
    parser = argparse.ArgumentParser(
        description="Specify the directory and record range to use"
    )
--- a/tests/test_the_stack_licenses
+++ b/tests/test_the_stack_licenses
@ -0,0 +1,34 @@
+import pytest
+import pandas as pd
+from the_stack_licenses import (
+    get_records,
+    print_records,
+    print_unique_licenses,
+)
+
+
+def test_get_records():
+    df = pd.DataFrame({"a": [1, 2, 3]})
+
+    assert get_records(df, "-1-2").equals(pd.DataFrame({"a": [1, 2]}))
+    assert get_records(df, "1").equals(pd.DataFrame({"a": [1]}))
+
+
+def test_print_records():
+    df = pd.DataFrame({"a": [1, 2]})
+
+    # Mocking print function using built-in unittest.mock module
+    with unittest.mock.patch("builtins.print") as mock_print:
+        print_records(df)
+        mock_print.assert_called_with(df)
+
+
+def test_print_unique_licenses():
+    df = pd.DataFrame({"license": ["MIT", "GPL", "Apache"]})
+
+    # Mocking print function using built-in unittest.mock module
+    with unittest.mock.patch("builtins.print") as mock_print:
+        print_unique_licenses(df)
+        mock_print.assert_called_with(
+            pd.Series(["Apache", "GPL", "MIT"]).sort_values()
+        )  # assuming sorting is done in function
Author	SHA1	Message	Date
Jeff Moe	507d40c45c	docs headers	2023-12-01 13:53:18 -07:00
Jeff Moe	af5d09a709	formatting	2023-12-01 13:07:24 -07:00
Jeff Moe	b2df8ac079	machine translation note	2023-12-01 09:57:24 -07:00
Jeff Moe	ea01745dbc	format docs	2023-11-30 19:40:49 -07:00
Jeff Moe	7107db27c2	formatting docs	2023-11-30 19:17:24 -07:00
root	d04b0b4cb1	Development notice banner	2023-11-29 08:01:55 -07:00
root	3272692be2	rm index, fewer toctree	2023-11-28 14:20:32 -07:00
Jeff Moe	a0d35874bc	libre datasets....	2023-11-28 12:46:44 -07:00
Jeff Moe	a70dbaec9f	Add project README to sphinx docs	2023-11-25 12:43:16 -07:00
Jeff Moe	68cbf51151	Parrot Datasets main name for docs, rearrange, etc	2023-11-25 12:30:58 -07:00
Jeff Moe	e053d14155	more rename/re-org	2023-11-25 12:20:48 -07:00
Jeff Moe	873960c927	clean, noted	2023-11-25 12:05:19 -07:00
Jeff Moe	f8a2d265a8	rm upstream example	2023-11-25 11:55:55 -07:00
Jeff Moe	cbe6149161	v0.0.9	2023-11-25 11:53:00 -07:00
Jeff Moe	924c6a6e0e	Version revving, noted, fixed	2023-11-25 11:51:54 -07:00
Jeff Moe	b641a66111	Version revving, noted	2023-11-25 11:51:18 -07:00
Jeff Moe	d1c44f1f7d	v0.0.8	2023-11-25 11:44:16 -07:00
Jeff Moe	bf1e9c43be	re-arrange directory structure	2023-11-25 11:43:15 -07:00
Jeff Moe	9972f98afa	readme markdown is now in sphinx	2023-11-25 11:38:40 -07:00
Jeff Moe	612a323adf	v0.0.7	2023-11-25 11:24:02 -07:00
Jeff Moe	40601b03d1	Add the-stack-* scripts to bin/	2023-11-25 11:22:47 -07:00
Jeff Moe	66b7539ba3	convert readme markdown to rst with parrot + phind	2023-11-25 10:52:24 -07:00
Jeff Moe	0a124d048c	Add the stack licenses script to sphinx docs	2023-11-25 10:31:19 -07:00
Jeff Moe	1b3f0e26cd	brief build note	2023-11-25 10:22:29 -07:00
Jeff Moe	478a7980b8	build hammer	2023-11-25 10:20:58 -07:00
Jeff Moe	33bcd9f539	docs/ not doc/	2023-11-25 10:17:31 -07:00
Jeff Moe	1dfbb8d268	rearrange docs, etc.	2023-11-25 10:10:39 -07:00
Jeff Moe	5764d3bf1d	pyproject initial file	2023-11-25 10:05:10 -07:00
Jeff Moe	0781ae6621	ignore build files	2023-11-25 10:04:31 -07:00
Jeff Moe	9d6a199d47	mv	2023-11-25 10:02:24 -07:00
Jeff Moe	27f4661c55	mv the_stack test	2023-11-25 10:02:08 -07:00
Jeff Moe	9d6222c5ac	mv the_stack scripts	2023-11-25 10:01:56 -07:00
Jeff Moe	5511c02d46	rm egg-info	2023-11-25 10:01:24 -07:00
Jeff Moe	ba4c7c0044	mv metadata example out of source	2023-11-25 09:59:55 -07:00
Jeff Moe	b6ea86ff93	ignore python version	2023-11-25 09:59:26 -07:00
Jeff Moe	6ecfa2cba0	new build deps	2023-11-25 09:55:57 -07:00
Jeff Moe	a898bcaae3	rm setup.py, use pyproject.toml	2023-11-25 09:55:42 -07:00
Jeff Moe	46216ab9ef	v0.0.6	2023-11-24 21:38:18 -07:00
Jeff Moe	a9e5a38ef4	Read the docs theme	2023-11-24 21:37:40 -07:00
Jeff Moe	a5526cfe29	dont need no bats	2023-11-24 21:36:45 -07:00
Jeff Moe	c86109f2ca	scriptlet build cruft	2023-11-24 21:16:48 -07:00
Jeff Moe	49c63860bc	v0.0.5	2023-11-24 20:10:25 -07:00
Jeff Moe	4e15ae4fed	mv, setup.py, eggs, etc	2023-11-24 20:07:13 -07:00
Jeff Moe	57e305a75e	Add docstrings, sphinx bits	2023-11-24 19:46:52 -07:00
Jeff Moe	e3f25d4b7e	The Smack doc stub	2023-11-24 19:24:53 -07:00
Jeff Moe	029e40e316	init for sphinx docs	2023-11-24 19:12:38 -07:00
Jeff Moe	f3be8d6566	sphinx docs stubs	2023-11-24 18:55:33 -07:00
Jeff Moe	2ce0c9365e	sphix dep for docs	2023-11-24 18:51:24 -07:00
Jeff Moe	5f5425d69a	Add test stub for license scriptlet	2023-11-24 18:37:27 -07:00
Jeff Moe	852d9d9e61	the_stack_licenses, noted	2023-11-24 18:36:26 -07:00
Jeff Moe	208d4c2c0d	underscore muh	2023-11-24 18:32:56 -07:00
Jeff Moe	83310747f5	Add pytest for license scriptlet	2023-11-24 18:21:44 -07:00