Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize whole repository #3

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[flake8]
max-line-width=120
extend-ignore =
E203
E501
exclude =
.git
__pycache__
docs/source/conf.py
old
build
dist
temp
max-complexity = 10
13 changes: 13 additions & 0 deletions .github/workflows/code_quality.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: Code Quality

on: [pull_request]

jobs:
code-quality:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: 3.8
- uses: pre-commit/[email protected]
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,4 @@ dmypy.json
test_datasets/
=======
>>>>>>> e259a5cda662d5482a6d9115faae09ec18299969
.DS_Store
45 changes: 45 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.1.0
hooks:
- id: check-case-conflict
- id: check-json
- id: check-symlinks
- id: check-yaml
- id: destroyed-symlinks
- id: end-of-file-fixer
exclude: docs/CNAME
- id: fix-byte-order-marker
- id: fix-encoding-pragma
args: [--remove]
- id: mixed-line-ending
args: [--fix=lf]
- id: requirements-txt-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
files: ^(trlx|examples|tests|setup.py)/
- repo: https://github.com/pycqa/isort
rev: 5.11.2
hooks:
- id: isort
name: isort (python)
- repo: https://github.com/pycqa/flake8
rev: 6.0.0
hooks:
- id: flake8
- repo: https://github.com/codespell-project/codespell
rev: v2.2.2
hooks:
- id: codespell
args: [--skip="data/**""]
exclude: >
(?x)^(
.*\.json|.*\.ipynb|.*\.jsonl
)$
additional_dependencies:
- tomli
8 changes: 8 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"cSpell.words": [
"dedup",
"freelaw",
"levelname",
"philpapers"
]
}
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# pilev2
# Pile v2
3 changes: 0 additions & 3 deletions pile/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
from .templates import Dataset
from .datasets import *

import logging
from pathlib import Path

Expand Down
2 changes: 1 addition & 1 deletion pile/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from .enron import EnronEmails
from .euro_parl import EuroParl
from .freelaw import FreeLaw
from .grade_school_math import *
from .grade_school_math import GradeSchoolMath, GradeSchoolMathNoCalc, NIHRePORTER
from .philpapers import PhilPapers
from .project_gutenberg import ProjectGutenberg
from .wikipedia import Wikipedia
Expand Down
1 change: 0 additions & 1 deletion pile/datasets/dm_mathematics/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .dm_mathematics import DMMathematics
11 changes: 7 additions & 4 deletions pile/datasets/dm_mathematics/dm_mathematics.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class DMMathematics(Dataset):
name = "DeepMind Mathematics Dataset"

license = "MIT License"

urls = [""]
Expand All @@ -29,7 +30,9 @@ def paths(self):
yield path

def examples(self):
return list(stream_jsonl(Path(__file__).parent / "dm_mathematics_examples.jsonl"))
return list(
stream_jsonl(Path(__file__).parent / "dm_mathematics_examples.jsonl")
)

def size_on_disk(self):
return -1
Expand Down
1 change: 0 additions & 1 deletion pile/datasets/enron/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .enron import EnronEmails
7 changes: 4 additions & 3 deletions pile/datasets/enron/enron.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class EnronEmails(Dataset):
name = "Enron Emails"

license = "Public domain"

urls = ["http://eaidata.bmk.sh/data/enron_emails.jsonl.zst"]
Expand Down
1 change: 0 additions & 1 deletion pile/datasets/euro_parl/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .euro_parl import EuroParl
11 changes: 7 additions & 4 deletions pile/datasets/euro_parl/euro_parl.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class EnronEmails(Dataset):
name = "EuroParl"

license = "Except where otherwise indicated, reproduction is authorised, provided that the source is acknowledged"

urls = ["https://the-eye.eu/public/AI/pile_preliminary_components/EuroParliamentProceedings_1996_2011.jsonl.zst"]
urls = [
"https://the-eye.eu/public/AI/pile_preliminary_components/EuroParliamentProceedings_1996_2011.jsonl.zst"
]

checksum = "6111400e7b7f75ce91fed1b5fc0a3630b8263217bd01ce75f7d8701f26ac0e98"

Expand Down
1 change: 0 additions & 1 deletion pile/datasets/free_law/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .free_law import FreeLaw
7 changes: 4 additions & 3 deletions pile/datasets/free_law/free_law.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class FreeLaw(Dataset):
name = "FreeLaw Project"

license = "BSD 2-Clause License"

urls = ["http://eaidata.bmk.sh/data/FreeLaw_Opinions.jsonl.zst"]
Expand Down
1 change: 0 additions & 1 deletion pile/datasets/grade_school_math/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .grade_school_math import GradeSchoolMath, GradeSchoolMathNoCalc
11 changes: 6 additions & 5 deletions pile/datasets/grade_school_math/grade_school_math.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, dump_jsonl
from ...utils import download, mark_done, done_path, sha256sum
from pathlib import Path
import re
from pathlib import Path

from ...file_utils import dump_jsonl, stream_jsonl
from ...templates import Dataset
from ...utils import done_path, download, mark_done, sha256sum

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -60,7 +61,7 @@ def replicate(self):
question_answer_to_pile_format(qa) for qa in stream_jsonl(out_path)
]
if self.remove_calculator_strings:
# calculator strings in this dataset are always inbetween << and >>
# calculator strings in this dataset are always in between << and >>
# they are there so that the model can indicate when to outsource to a calculator,
# but we might not always want this behavior
# below removes them
Expand Down
1 change: 0 additions & 1 deletion pile/datasets/nih_reporter/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .nih_reporter import NIHRePORTER
13 changes: 9 additions & 4 deletions pile/datasets/nih_reporter/nih_reporter.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class NIHRePORTER(Dataset):
name = "National Institute of Health RePORTER"

license = "Public domain"

urls = ["https://mystic.the-eye.eu/public/AI/pile_v2/data/nih_reporter.jsonl.zst ", "http://eaidata.bmk.sh/data/pile_v2/NIH_ExPORTER_awarded_grant_text.jsonl.zst", "https://drive.google.com/file/d/1Sz9mFTPFa4ePYHy0AOSajUtKgVyk9mTg/view?usp=sharing"]
urls = [
"https://mystic.the-eye.eu/public/AI/pile_v2/data/nih_reporter.jsonl.zst ",
"http://eaidata.bmk.sh/data/pile_v2/NIH_ExPORTER_awarded_grant_text.jsonl.zst",
"https://drive.google.com/file/d/1Sz9mFTPFa4ePYHy0AOSajUtKgVyk9mTg/view?usp=sharing",
]

checksum = "0db76318737fda6c2a2484b809bb53e9e42952c284c0bf2b8862e8428e154833"

Expand Down
1 change: 0 additions & 1 deletion pile/datasets/phil_papers/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .phil_papers import PhilPapers
7 changes: 4 additions & 3 deletions pile/datasets/phil_papers/phil_papers.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class PhilPapers(Dataset):
name = "PhilPapers"

license = "Open Access"

urls = ["http://eaidata.bmk.sh/data/phil_papers.jsonl.zst"]
Expand Down
1 change: 0 additions & 1 deletion pile/datasets/project_gutenberg/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .project_gutenberg import ProjectGutenberg
11 changes: 7 additions & 4 deletions pile/datasets/project_gutenberg/project_gutenberg.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class ProjectGutenberg(Dataset):
name = "Project Gutenberg"

license = "Public domain"

urls = [""]
Expand All @@ -29,7 +30,9 @@ def paths(self):
yield path

def examples(self):
return list(stream_jsonl(Path(__file__).parent / "project_gutenberg_examples.jsonl"))
return list(
stream_jsonl(Path(__file__).parent / "project_gutenberg_examples.jsonl")
)

def size_on_disk(self):
return -1
Expand Down
1 change: 0 additions & 1 deletion pile/datasets/wikipedia/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
from .wikipedia import Wikipedia
7 changes: 4 additions & 3 deletions pile/datasets/wikipedia/wikipedia.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import logging
from ...templates import Dataset
from ...file_utils import stream_jsonl, stream_jsonl_zst
from pathlib import Path

from ...file_utils import stream_jsonl, stream_jsonl_zst
from ...templates import Dataset

logger = logging.getLogger(__name__)


class Wikipedia(Dataset):
name = "Wikipedia"

license = ""

urls = [""]
Expand Down
9 changes: 5 additions & 4 deletions pile/file_utils.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from typing import Union, List
from pathlib import Path
import logging
import io
import json
import logging
from pathlib import Path
from typing import List, Union

import zstandard as zstd
import io

logger = logging.getLogger(__name__)

Expand Down
Empty file added pile/filtering/__init__.py
Empty file.
Loading