Compare commits

...

52 Commits
v0.0.4 ... main

Author SHA1 Message Date
Jeff Moe 507d40c45c docs headers 2023-12-01 13:53:18 -07:00
Jeff Moe af5d09a709 formatting 2023-12-01 13:07:24 -07:00
Jeff Moe b2df8ac079 machine translation note 2023-12-01 09:57:24 -07:00
Jeff Moe ea01745dbc format docs 2023-11-30 19:40:49 -07:00
Jeff Moe 7107db27c2 formatting docs 2023-11-30 19:17:24 -07:00
root d04b0b4cb1 Development notice banner 2023-11-29 08:01:55 -07:00
root 3272692be2 rm index, fewer toctree 2023-11-28 14:20:32 -07:00
Jeff Moe a0d35874bc libre datasets.... 2023-11-28 12:46:44 -07:00
Jeff Moe a70dbaec9f Add project README to sphinx docs 2023-11-25 12:43:16 -07:00
Jeff Moe 68cbf51151 Parrot Datasets main name for docs, rearrange, etc 2023-11-25 12:30:58 -07:00
Jeff Moe e053d14155 more rename/re-org 2023-11-25 12:20:48 -07:00
Jeff Moe 873960c927 clean, noted 2023-11-25 12:05:19 -07:00
Jeff Moe f8a2d265a8 rm upstream example 2023-11-25 11:55:55 -07:00
Jeff Moe cbe6149161 v0.0.9 2023-11-25 11:53:00 -07:00
Jeff Moe 924c6a6e0e Version revving, noted, fixed 2023-11-25 11:51:54 -07:00
Jeff Moe b641a66111 Version revving, noted 2023-11-25 11:51:18 -07:00
Jeff Moe d1c44f1f7d v0.0.8 2023-11-25 11:44:16 -07:00
Jeff Moe bf1e9c43be re-arrange directory structure 2023-11-25 11:43:15 -07:00
Jeff Moe 9972f98afa readme markdown is now in sphinx 2023-11-25 11:38:40 -07:00
Jeff Moe 612a323adf v0.0.7 2023-11-25 11:24:02 -07:00
Jeff Moe 40601b03d1 Add the-stack-* scripts to bin/ 2023-11-25 11:22:47 -07:00
Jeff Moe 66b7539ba3 convert readme markdown to rst with parrot + phind 2023-11-25 10:52:24 -07:00
Jeff Moe 0a124d048c Add the stack licenses script to sphinx docs 2023-11-25 10:31:19 -07:00
Jeff Moe 1b3f0e26cd brief build note 2023-11-25 10:22:29 -07:00
Jeff Moe 478a7980b8 build hammer 2023-11-25 10:20:58 -07:00
Jeff Moe 33bcd9f539 docs/ not doc/ 2023-11-25 10:17:31 -07:00
Jeff Moe 1dfbb8d268 rearrange docs, etc. 2023-11-25 10:10:39 -07:00
Jeff Moe 5764d3bf1d pyproject initial file 2023-11-25 10:05:10 -07:00
Jeff Moe 0781ae6621 ignore build files 2023-11-25 10:04:31 -07:00
Jeff Moe 9d6a199d47 mv 2023-11-25 10:02:24 -07:00
Jeff Moe 27f4661c55 mv the_stack test 2023-11-25 10:02:08 -07:00
Jeff Moe 9d6222c5ac mv the_stack scripts 2023-11-25 10:01:56 -07:00
Jeff Moe 5511c02d46 rm egg-info 2023-11-25 10:01:24 -07:00
Jeff Moe ba4c7c0044 mv metadata example out of source 2023-11-25 09:59:55 -07:00
Jeff Moe b6ea86ff93 ignore python version 2023-11-25 09:59:26 -07:00
Jeff Moe 6ecfa2cba0 new build deps 2023-11-25 09:55:57 -07:00
Jeff Moe a898bcaae3 rm setup.py, use pyproject.toml 2023-11-25 09:55:42 -07:00
Jeff Moe 46216ab9ef v0.0.6 2023-11-24 21:38:18 -07:00
Jeff Moe a9e5a38ef4 Read the docs theme 2023-11-24 21:37:40 -07:00
Jeff Moe a5526cfe29 dont need no bats 2023-11-24 21:36:45 -07:00
Jeff Moe c86109f2ca scriptlet build cruft 2023-11-24 21:16:48 -07:00
Jeff Moe 49c63860bc v0.0.5 2023-11-24 20:10:25 -07:00
Jeff Moe 4e15ae4fed mv, setup.py, eggs, etc 2023-11-24 20:07:13 -07:00
Jeff Moe 57e305a75e Add docstrings, sphinx bits 2023-11-24 19:46:52 -07:00
Jeff Moe e3f25d4b7e The Smack doc stub 2023-11-24 19:24:53 -07:00
Jeff Moe 029e40e316 init for sphinx docs 2023-11-24 19:12:38 -07:00
Jeff Moe f3be8d6566 sphinx docs stubs 2023-11-24 18:55:33 -07:00
Jeff Moe 2ce0c9365e sphix dep for docs 2023-11-24 18:51:24 -07:00
Jeff Moe 5f5425d69a Add test stub for license scriptlet 2023-11-24 18:37:27 -07:00
Jeff Moe 852d9d9e61 the_stack_licenses, noted 2023-11-24 18:36:26 -07:00
Jeff Moe 208d4c2c0d underscore muh 2023-11-24 18:32:56 -07:00
Jeff Moe 83310747f5 Add pytest for license scriptlet 2023-11-24 18:21:44 -07:00
19 changed files with 368 additions and 153 deletions

11
.gitignore vendored
View File

@ -1,3 +1,12 @@
.~lock.*.ods#
*.swp
.pytest_cache/
.python-version
build
env
tmp
venv
*.swp
**/dist/
**/*.egg-info/
*/target/
__pycache__

22
BUILD.md 100644
View File

@ -0,0 +1,22 @@
# Build
Build, perhaps like this:
```
deactivate ; rm -rf venv env dist ; virtualenv env ; source env/bin/activate ; pip install -U setuptools wheel pip ; pip install -r requirements.txt ; pip install -e . ; cd docs/ ; make clean ; make html ; cd .. ; python -m build
```
Cleanish:
```
source env/bin/activate ; cd docs/ ; make clean ; cd .. ; deactivate ; rm -rf venv env dist src/*-info src/*/__pycache__
```
# Versions
```
vim CHANGELOG.txt docs/source/conf.py pyproject.toml
# git commit CHANGELOG.txt docs/source/conf.py pyproject.toml -m "v0.0.0"
# git tag v0.0.0
# git push ; git push --tags
```

View File

@ -1,3 +1,8 @@
v0.0.9 Fix versions.
v0.0.8 Re-arrange project structure.
v0.0.7 The Smack, use pyproject.toml, update Sphinx docs.
v0.0.6 Read the Docs Sphinx for The Smack.
v0.0.5 Sphinx docs for The Smack.
v0.0.4 The Stack scriplet to generate license list.
v0.0.3 The Smack started.
v0.0.2 Dataset table.

View File

@ -1,53 +1,67 @@
# Parrot Datasets
.. _parrot-datasets:
Parrot Datasets
===============
Datasets for Parrot Libre AI IDE.
https://parrot.codes
.. _parrot-libre-datasets:
Libre Datasets
---------------
# Libre Datasets
A list of libre datasets suitable for training a libre instruct model
shall be listed.
A list of libre datasets suitable for training a libre instruct model shall be listed.
Note other well known datasets, and their license suitability.
.. _parrot-dataset-licensing:
# Parrot Dataset Licensing
The model may use data that is under a license that appears on one
of these three lists as an acceptable free/open license:
Parrot Dataset Licensing
-------------------------
The model may use data that is under a license that appears on one of these three lists as an acceptable free/open license:
* https://www.gnu.org/licenses/license-list.html
* https://opensource.org/licenses/
* https://commons.wikimedia.org/wiki/Commons:Licensing
.. _unsuitable-licenses:
# Unsuitable Licenses
Licenses that are not free, libre, open, even if they may claim to
be "open source".
Unsuitable Licenses
--------------------
Licenses that are not free, libre, open, even if they may claim to be "open source".
These are not "Wikipedia Commons compatible", for example:
* Creative Commons Non-commercial (NC).
* Proprietary licenses.
* Any "custom" license that hasn't been reviewed by the general community.
.. _datasets-table:
Datasets Table
--------------
# Datasets Table
Table of datasets. See also the spreadsheet `datasets.ods`.
![Table of Datasets](img/datasets-table.png)
.. image:: img/datasets-table.png
:alt: Table of Datasets
.. _datasets:
Datasets
--------
# Datasets
Datasets perhaps to be built and used.
## The Smack
Libre version of The Stack.
See: `datasets/the-smack`.
* The Smack
Libre version of The Stack. See: `datasets/the-smack`.
.. _license:
License
-------
# License
Creative Commons Attribution-ShareAlike 4.0 International
*Copyright © 2023, Jeff Moe.*
Copyright © 2023, Jeff Moe.

View File

@ -1,4 +0,0 @@
**/target/
Cargo.lock
venv
env

View File

@ -1,57 +0,0 @@
# The Smack Dataset
The Smack Dataset doesn't exist.
Should it happen to exist someday, it will be a libre build of The Stack dataset,
but not using the dataset directly, so as not to be encumbered by The Stack's
non-libre (not "open source") license.
# The Stack Metadata
The Stack has a metadata repo with details about The Stack dataset, without
containing the dataset itself. One reason for this (as they discussed in an
issue/post) is so researchers can learn about the dataset contents without
being encumbered by the license. For example, how can you agree to a license
without knowing the licenses of the contents? Using the metadata files can
help with this issue.
# Downloading Metadata
While metadata is far less than the total dataset, it is still relatively large.
The git repo is a bit over one terabyte.
Here is a link to the git repository:
```
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
```
# Reading Metadata
The Stack metadata is in parquet format, which is swell.
The parquet files are currently 562 gigabytes, numbering 2,832 files,
in 945 directories.
# Selecting Repos
Write a script to select appropriate repos per libre criteria.
# Cloning Repos
Write a script to go clone the repos.
# Train
Train, using libre code from Bigcode (makers of The Stack).
# Scripts
The following scripts are available.
* `the-stack-headers` --- Reads header names from The Stack parquet files.
# Code Assist
The following scripts were written using Parrot code assist.
`The Phind-CodeLlama-34B-v2_q8.guff` model from TheBloke was used.
* `the-stack-headers`

View File

@ -1,63 +0,0 @@
import datasets
from pathlib import Path
from tqdm.auto import tqdm
import pandas as pd
# assuming metadata is cloned into the local folder /data/hf_repos/the-stack-metadata
# the stack is cloned into the local folder /data/hf_repos/the-stack-v1.1
# destination folder is in /repo_workdir/numpy_restored
the_stack_meta_path = Path('/srv/ml/huggingface/datasets/bigcode/the-stack-metadata')
the_stack_path = Path('/data/hf_repos/the-stack-v1.1')
repo_dst_root = Path('/repo_workdir/numpy_restored')
repo_name = 'numpy/numpy'
# Get bucket with numpy repo info
# meta_bucket_path = None
#for fn in tqdm(list((the_stack_meta_path/'data').glob('*/ri.parquet'))):
# df = pd.read_parquet(fn)
# if any(df['name'] == repo_name):
# meta_bucket_path = fn
# break
meta_bucket_path = the_stack_meta_path / 'data/255_944'
# Get repository id from repo name
ri_id = pd.read_parquet(
meta_bucket_path / 'ri.parquet'
).query(
f'`name` == "{repo_name}"'
)['id'].to_list()[0]
# Get files information for the reopository
files_info = pd.read_parquet(
meta_bucket_path / 'fi.parquet'
).query(
f'`ri_id` == {ri_id} and `size` != 0 and `is_deleted` == False'
)
# Convert DF with files information to a dictionary by language and then file hexsha
# there can be more than one file with the same hexsha in the repo so we gather
# all instances per unique hexsha
files_info_dict = {
k: v[['hexsha', 'path']].groupby('hexsha').apply(lambda x: list(x['path'])).to_dict()
for k, v in files_info.groupby('lang_ex')
}
# Load Python part of The Stack
ds = datasets.load_dataset(
str(the_stack_path/'data/python'),
num_proc=10, ignore_verifications=True
)
# Save file content of the python files in the numpy reposirotry in their appropriate locations
def save_file_content(example, files_info_dict, repo_dst_root):
if example['hexsha'] in files_info_dict:
for el in files_info_dict[example['hexsha']]:
path = repo_dst_root / el
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(example['content'])
ds.map(
save_file_content,
fn_kwargs={'files_info_dict': files_info_dict['Python'], 'repo_dst_root': repo_dst_root},
num_proc=10
)

View File

@ -1,5 +0,0 @@
datasets
tqdm
pandas
pathlib
termcolor

20
docs/Makefile 100644
View File

@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -0,0 +1,12 @@
project = "Parrot Datasets"
copyright = "2023, Jeff Moe"
author = "Jeff Moe"
release = "v0.0.9"
extensions = [
"sphinx.ext.autodoc",
]
templates_path = ["_templates"]
exclude_patterns = ["_build"]
html_theme = "sphinx_rtd_theme"
html_static_path = ["_static"]
htmlhelp_basename = "ParrotDatasetsdoc"

Binary file not shown.

After

Width:  |  Height:  |  Size: 348 KiB

View File

@ -0,0 +1,69 @@
Dataset
=======
Datasets for Parrot Libre AI IDE.
There is no Parrot Dataset, at present.
.. note:: Parrot is in early development, not ready for end users.
.. _parrot-libre-datasets:
Libre Datasets
---------------
A list of libre datasets suitable for training a libre instruct model shall be listed.
Note other well known datasets, and their license suitability.
.. _parrot-dataset-licensing:
Dataset Licensing
-----------------
The model may use data that is under a license that appears on one of these three lists as an acceptable free/open license:
* https://www.gnu.org/licenses/license-list.html
* https://opensource.org/licenses/
* https://commons.wikimedia.org/wiki/Commons:Licensing
.. _unsuitable-licenses:
Unsuitable Licenses
--------------------
Licenses that are not free, libre, open, even if they may claim to be "open source".
These are not "Wikipedia Commons compatible", for example:
* Creative Commons Non-commercial (NC).
* Proprietary licenses.
* Any "custom" license that hasn't been reviewed by the general community.
.. _datasets-table:
Datasets Table
--------------
Table of datasets. See also the spreadsheet ``datasets.ods``.
.. image:: img/datasets-table.png
:alt: Table of Datasets
.. _libre_datasets:
Libre Datasets
--------------
Datasets perhaps to be built and used.
The Smack
^^^^^^^^^
| Libre version of The Stack.
| See: :doc:`The Smack <the_smack>`.
.. toctree::
the_smack
:maxdepth: 1
:caption: Contents:
.. note:: Parrot documentation is written in English and uses AI machine translation for other languages.

View File

@ -0,0 +1,94 @@
The Smack Dataset
=================
The Smack Dataset does not exist.
In the future,
if it arises,
it will be a libre build of The Stack dataset without using the original dataset directly due to non-libre (non-"open source") license encumbrances.
.. note:: Parrot is in early development, not ready for end users.
The Stack Metadata
------------------
The Stack has a separate metadata repository containing information about the dataset without hosting the dataset itself.
This practice is beneficial as it allows researchers to understand dataset contents without being bound by licenses.
For instance,
how can one agree to a license when they're unaware of the content's licenses?
By using metadata files,
this issue can be mitigated.
Link to the Git Repository:
.. code-block:: bash
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
Downloading Metadata
--------------------
The metadata is considerably less than the entire dataset,
but still substantially large.
The Git metadata repository is approximately one terabyte in size.
Reading Metadata
----------------
The Stack's metadata is stored in parquet format.
The parquet files span 562 gigabytes and consist of 2,832 individual files across 945 directories.
Selecting Repos
---------------
Write a script to filter appropriate repositories based on libre criteria.
Cloning Repos
-------------
Write a script to clone the selected repositories.
Train
-----
Utilize libre code from Bigcode (creators of The Stack) for model training.
Scripts
-------
The following scripts are available:
* ``the-stack-headers`` --
Retrieves header names from The Stack's parquet files.
* ``the-stack-licenses`` --
Extracts licenses and records from The Stack's license file.
Code Assist
-----------
The following scripts were developed using Parrot code assist:
* ``the-stack-headers``
* ``the-stack-licenses``
These scripts were created with the
`The Phind-CodeLlama-34B-v2_q8.guff`
model from TheBloke.
.. toctree::
:maxdepth: 2
:caption: Contents:
.. automodule:: the_smack
:members:
.. automodule:: the_smack.the_stack_licenses
:members:
.. note:: Parrot documentation is written in English and uses AI machine translation for other languages.

12
pyproject.toml 100644
View File

@ -0,0 +1,12 @@
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "the_smack"
version = "0.0.9"
[project.scripts]
the-stack-licenses = "the_smack.the_stack_licenses:main"
the-stack-headers = "the_smack.the_stack_headers:main"

10
requirements.txt 100644
View File

@ -0,0 +1,10 @@
datasets
tqdm
pandas
pathlib
termcolor
pytest
sphinx
sphinx_rtd_theme
build
toml

View File

@ -0,0 +1,5 @@
# __init__.py
import the_smack
__all__ = ["the_smack"]

View File

@ -1,5 +1,17 @@
#!/usr/bin/env python3
"""Script to read and print specific records from the lic.parquet file in a numbered directory under the data/ subdirectory."""
"""
This script is designed to read and print specific records from the lic.parquet file in a numbered directory under the data/ subdirectory.
Example usage: python3 script.py --records 1-5 -c
Command-line options:
-h, --help show this help message and exit
--version show program's version number and exit
-r RANGE, --records=RANGE
record number or range to print (e.g., 1, 5-7)
-c, --color colorize the output
-l, --list_licenses list unique licenses in the file
"""
import argparse
import os
@ -9,6 +21,16 @@ from termcolor import colored
def get_records(dataframe, args):
"""
Extract records from a DataFrame based on user-specified range.
Parameters:
dataframe (DataFrame): The pandas DataFrame to extract records from.
args (Namespace): A namespace object containing parsed command line arguments.
Returns:
DataFrame: The extracted records as a new DataFrame.
"""
if "-" in args.records:
start, end = map(int, args.records.split("-"))
return dataframe[start - 1 : end]
@ -18,6 +40,13 @@ def get_records(dataframe, args):
def print_records(dataframe, color):
"""
Print the records in a DataFrame with optional colorization.
Parameters:
dataframe (DataFrame): The pandas DataFrame to print.
color (bool): If True, colorize the output.
"""
for index, row in dataframe.iterrows():
if color:
for col in row.index:
@ -31,6 +60,12 @@ def print_records(dataframe, color):
def print_unique_licenses(dataframe):
"""
Print the unique licenses in a DataFrame, sorted alphabetically.
Parameters:
dataframe (DataFrame): The pandas DataFrame to extract licenses from.
"""
licenses = dataframe["license"].unique().tolist()
licenses.sort(
key=lambda x: [int(i) if i.isdigit() else i for i in re.split("([0-9]+)", x)]
@ -40,6 +75,9 @@ def print_unique_licenses(dataframe):
def main():
"""
Main function to parse command line arguments and run the script.
"""
parser = argparse.ArgumentParser(
description="Specify the directory and record range to use"
)

View File

@ -0,0 +1,34 @@
import pytest
import pandas as pd
from the_stack_licenses import (
get_records,
print_records,
print_unique_licenses,
)
def test_get_records():
df = pd.DataFrame({"a": [1, 2, 3]})
assert get_records(df, "-1-2").equals(pd.DataFrame({"a": [1, 2]}))
assert get_records(df, "1").equals(pd.DataFrame({"a": [1]}))
def test_print_records():
df = pd.DataFrame({"a": [1, 2]})
# Mocking print function using built-in unittest.mock module
with unittest.mock.patch("builtins.print") as mock_print:
print_records(df)
mock_print.assert_called_with(df)
def test_print_unique_licenses():
df = pd.DataFrame({"license": ["MIT", "GPL", "Apache"]})
# Mocking print function using built-in unittest.mock module
with unittest.mock.patch("builtins.print") as mock_print:
print_unique_licenses(df)
mock_print.assert_called_with(
pd.Series(["Apache", "GPL", "MIT"]).sort_values()
) # assuming sorting is done in function