Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow training actions to be performed in PRs #4

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
6e9ea18
Adjust ci config for staging repo
bhearsum Mar 30, 2023
494f158
Use bhearsum's taskgraph repo for now
bhearsum Mar 31, 2023
18f7e9f
Get rid of hello kind now that we know that Taskcluster works
bhearsum Mar 30, 2023
9f38ed0
Add worker type for b-linux-large, for more CPU intensive tasks; refo…
bhearsum Apr 6, 2023
7e8a176
Add yamllint config for taskcluster files
bhearsum Apr 14, 2023
039d3db
Add toolchain tasks for things that we depend on to train language mo…
bhearsum Mar 30, 2023
b515a25
Bump decision task image
bhearsum Mar 31, 2023
5eb2007
Add tasks to fetch a few dataset types
bhearsum Mar 31, 2023
571a221
Add configuration for black and ruff for python formatting
bhearsum Apr 14, 2023
835a98d
Add `clean` stage of the training pipeline
bhearsum Apr 13, 2023
851035c
Update pipeline scripts to work with Taskcluster
bhearsum Apr 13, 2023
da180ef
Add treeherder symbol for decision task
bhearsum Apr 14, 2023
08fc29d
Add a `train` action task to support kicking off the training pipeline
bhearsum Apr 13, 2023
119aa6b
Add bicleaner pack fetches
bhearsum Apr 19, 2023
dabd4df
Implement `bicleaner` pipeline stage
bhearsum Apr 19, 2023
ba43205
Raise taskgraph level for pushes, cron, and actions to level 3
bhearsum Apr 26, 2023
49b2733
Re-adjust ci-config.yml for production repository
bhearsum May 1, 2023
c87d3dc
Don't set treeherder routes for pull requests
bhearsum May 1, 2023
06a0ae5
Use standard cache prefixes
bhearsum May 1, 2023
3a7950a
Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster c…
bhearsum May 1, 2023
d38e247
Bump taskgraph version; re-enable pip hash checking
bhearsum May 2, 2023
7d97d5a
Switch cache attributes to be nested, instead of multiple top level a…
bhearsum May 3, 2023
3553279
Override compression scheme in pipeline steps.
bhearsum May 3, 2023
e768cbd
Allow training actions to be performed in PRs
bhearsum May 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Taskcluster pipeline related files. Changes to these ought to be reviewed by
# RelEng to watch for security issues and best practices. These should also
# be reviewed by people familiar with the pipeline itself.
.taskcluster.yml @mozilla/releng
taskcluster @mozilla/releng
385 changes: 204 additions & 181 deletions .taskcluster.yml

Large diffs are not rendered by default.

Empty file modified pipeline/alignment/generate-alignment-and-shortlist.sh
100644 → 100755
Empty file.
33 changes: 20 additions & 13 deletions pipeline/bicleaner/bicleaner.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,20 @@ type=$4
threads=$5
pack_dir=$6

COMPRESSION_CMD="${COMPRESSION_CMD:-pigz}"
ARTIFACT_EXT="${ARTIFACT_EXT:-gz}"

if [ "$threads" = "auto" ]; then
threads=$(nproc)
fi

output_dir=$(dirname "${output_prefix}")
mkdir -p "${output_dir}"

if [ "${bicleaner_threshold}" == "0" ]; then
echo "Threshold is 0, skipping filtering"
cp "${corpus_prefix}.${SRC}.gz" "${output_prefix}.${SRC}.gz"
cp "${corpus_prefix}.${TRG}.gz" "${output_prefix}.${TRG}.gz"
cp "${corpus_prefix}.${SRC}.${ARTIFACT_EXT}" "${output_prefix}.${SRC}.${ARTIFACT_EXT}"
cp "${corpus_prefix}.${TRG}.${ARTIFACT_EXT}" "${output_prefix}.${TRG}.${ARTIFACT_EXT}"
else
if [ "${type}" == 'bicleaner-ai' ]; then
echo "### Using bicleaner-ai"
Expand Down Expand Up @@ -69,27 +76,27 @@ else
}
export -f biclean
# {%} is a 1-indexed job slot number from GNU parallel. We use that as the 1-indexed offset in CUDA_VISIBLE_ARRAY
paste <(pigz -dc "${corpus_prefix}.${SRC}.gz") <(pigz -dc "${corpus_prefix}.${TRG}.gz") |
paste <(${COMPRESSION_CMD} -dc "${corpus_prefix}.${SRC}.${ARTIFACT_EXT}") <(${COMPRESSION_CMD} -dc "${corpus_prefix}.${TRG}.${ARTIFACT_EXT}") |
parallel -j ${#CUDA_VISIBLE_ARRAY[@]} --pipe -k --block 10M biclean "${pack_dir}"/*.yaml {%} |
pigz >"${output_prefix}.scored.gz"
${COMPRESSION_CMD} >"${output_prefix}.scored.${ARTIFACT_EXT}"
else
paste <(pigz -dc "${corpus_prefix}.${SRC}.gz") <(pigz -dc "${corpus_prefix}.${TRG}.gz") |
paste <(${COMPRESSION_CMD} -dc "${corpus_prefix}.${SRC}.${ARTIFACT_EXT}") <(${COMPRESSION_CMD} -dc "${corpus_prefix}.${TRG}.${ARTIFACT_EXT}") |
${cmd} --scol ${scol} --tcol ${tcol} --processes "${threads}" - - "${pack_dir}"/*.yaml |
pigz >"${output_prefix}.scored.gz"
${COMPRESSION_CMD} >"${output_prefix}.scored.${ARTIFACT_EXT}"
fi

echo "### Filtering"
pigz -dc "${output_prefix}.scored.gz" |
${COMPRESSION_CMD} -dc "${output_prefix}.scored.${ARTIFACT_EXT}" |
awk -v threshold=${bicleaner_threshold} -F"\t" '{if ($3>threshold) {print $0}}' |
pigz >"${output_prefix}.best.gz"
${COMPRESSION_CMD} >"${output_prefix}.best.${ARTIFACT_EXT}"

echo "Lines before filtering: $(pigz -dc "${output_prefix}.scored.gz" | wc -l)"
echo "Lines after filtering: $(pigz -dc "${output_prefix}.best.gz" | wc -l)"
echo "Lines before filtering: $(${COMPRESSION_CMD} -dc "${output_prefix}.scored.${ARTIFACT_EXT}" | wc -l)"
echo "Lines after filtering: $(${COMPRESSION_CMD} -dc "${output_prefix}.best.${ARTIFACT_EXT}" | wc -l)"

echo "### Writing output corpus"
pigz -dc "${output_prefix}.best.gz" |
tee >(cut -f1 | pigz >"${output_prefix}.${SRC}.gz") |
cut -f2 | pigz >"${output_prefix}.${TRG}.gz"
${COMPRESSION_CMD} -dc "${output_prefix}.best.${ARTIFACT_EXT}" |
tee >(cut -f1 | ${COMPRESSION_CMD} >"${output_prefix}.${SRC}.${ARTIFACT_EXT}") |
cut -f2 | ${COMPRESSION_CMD} >"${output_prefix}.${TRG}.${ARTIFACT_EXT}"

# do not delete intermediate files to inspect them and tune the threshold
fi
Expand Down
Empty file modified pipeline/bicleaner/download-pack.sh
100644 → 100755
Empty file.
1 change: 1 addition & 0 deletions pipeline/bicleaner/requirements/bicleaner-ai.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
bicleaner-ai==2.0
223 changes: 223 additions & 0 deletions pipeline/bicleaner/requirements/bicleaner-ai.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
#
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
# pip-compile bicleaner-ai.in
#
absl-py==1.4.0
# via
# tensorboard
# tensorflow
astunparse==1.6.3
# via tensorflow
bicleaner-ai==2.0
# via -r bicleaner-ai.in
bicleaner-ai-glove==0.2.1
# via bicleaner-ai
bicleaner-hardrules==2.7.0
# via bicleaner-ai
cachetools==5.3.0
# via google-auth
certifi==2022.12.7
# via requests
charset-normalizer==3.1.0
# via requests
click==8.1.3
# via sacremoses
exceptiongroup==1.1.1
# via pytest
fastspell==0.5
# via bicleaner-hardrules
fasttext==0.9.2
# via
# bicleaner-hardrules
# fastspell
filelock==3.12.0
# via
# huggingface-hub
# transformers
flatbuffers==23.3.3
# via tensorflow
fuzzywuzzy==0.18.0
# via bicleaner-ai
gast==0.4.0
# via tensorflow
google-auth==2.17.3
# via
# google-auth-oauthlib
# tensorboard
google-auth-oauthlib==0.4.6
# via tensorboard
google-pasta==0.2.0
# via tensorflow
grpcio==1.54.0
# via
# tensorboard
# tensorflow
h5py==3.8.0
# via tensorflow
huggingface-hub==0.11.1
# via
# bicleaner-ai
# transformers
hunspell==0.5.5
# via fastspell
idna==3.4
# via requests
iniconfig==2.0.0
# via pytest
joblib==1.2.0
# via
# bicleaner-ai
# bicleaner-hardrules
# sacremoses
# scikit-learn
keras==2.11.0
# via tensorflow
levenshtein==0.20.9
# via python-levenshtein
libclang==16.0.0
# via tensorflow
markdown==3.4.3
# via tensorboard
markupsafe==2.1.2
# via werkzeug
numpy==1.24.2
# via
# bicleaner-ai
# bicleaner-ai-glove
# fasttext
# h5py
# opt-einsum
# scikit-learn
# scipy
# tensorboard
# tensorflow
# transformers
oauthlib==3.2.2
# via requests-oauthlib
opt-einsum==3.3.0
# via tensorflow
packaging==23.1
# via
# huggingface-hub
# pytest
# tensorflow
# transformers
pluggy==1.0.0
# via pytest
protobuf==3.19.6
# via
# tensorboard
# tensorflow
psutil==5.9.5
# via bicleaner-ai
pyasn1==0.4.8
# via
# pyasn1-modules
# rsa
pyasn1-modules==0.2.8
# via google-auth
pybind11==2.10.4
# via fasttext
pytest==7.3.1
# via
# bicleaner-ai
# bicleaner-hardrules
python-levenshtein==0.20.9
# via bicleaner-ai
pyyaml==6.0
# via
# bicleaner-ai
# bicleaner-hardrules
# fastspell
# huggingface-hub
# transformers
rapidfuzz==2.15.1
# via levenshtein
regex==2023.3.23
# via
# bicleaner-ai
# bicleaner-hardrules
# sacremoses
# transformers
requests==2.28.2
# via
# huggingface-hub
# requests-oauthlib
# tensorboard
# transformers
requests-oauthlib==1.3.1
# via google-auth-oauthlib
rsa==4.9
# via google-auth
sacremoses==0.0.53
# via
# bicleaner-ai
# bicleaner-hardrules
# fastspell
scikit-learn==1.2.2
# via bicleaner-ai
scipy==1.10.1
# via
# bicleaner-ai-glove
# scikit-learn
sentencepiece==0.1.98
# via bicleaner-ai
six==1.16.0
# via
# astunparse
# google-auth
# google-pasta
# sacremoses
# tensorflow
tensorboard==2.11.2
# via tensorflow
tensorboard-data-server==0.6.1
# via tensorboard
tensorboard-plugin-wit==1.8.1
# via tensorboard
tensorflow==2.11.1
# via bicleaner-ai
tensorflow-estimator==2.11.0
# via tensorflow
tensorflow-io-gcs-filesystem==0.32.0
# via tensorflow
termcolor==2.2.0
# via tensorflow
threadpoolctl==3.1.0
# via scikit-learn
tokenizers==0.13.3
# via transformers
tomli==2.0.1
# via pytest
toolwrapper==2.1.0
# via
# bicleaner-ai
# bicleaner-hardrules
tqdm==4.65.0
# via
# huggingface-hub
# sacremoses
# transformers
transformers==4.26
# via bicleaner-ai
typing-extensions==4.5.0
# via
# huggingface-hub
# tensorflow
urllib3==1.26.15
# via
# fastspell
# requests
werkzeug==2.2.3
# via tensorboard
wheel==0.40.0
# via
# astunparse
# tensorboard
wrapt==1.15.0
# via tensorflow

# The following packages are considered to be unsafe in a requirements file:
# setuptools
1 change: 1 addition & 0 deletions pipeline/bicleaner/requirements/bicleaner.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
bicleaner==0.16
86 changes: 86 additions & 0 deletions pipeline/bicleaner/requirements/bicleaner.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
# pip-compile bicleaner.in
#
bicleaner==0.16
# via -r bicleaner.in
bicleaner-hardrules==2.5.1
# via bicleaner
click==8.1.3
# via sacremoses
exceptiongroup==1.1.1
# via pytest
fastspell==0.4
# via bicleaner-hardrules
fasttext==0.9.2
# via
# bicleaner-hardrules
# fastspell
hunspell==0.5.5
# via fastspell
iniconfig==2.0.0
# via pytest
joblib==1.2.0
# via
# bicleaner
# bicleaner-hardrules
# sacremoses
# scikit-learn
numpy==1.24.2
# via
# bicleaner
# fasttext
# scikit-learn
# scipy
packaging==23.1
# via pytest
pluggy==1.0.0
# via pytest
pybind11==2.10.4
# via fasttext
pycld2==0.41
# via bicleaner
pytest==7.3.1
# via
# bicleaner
# bicleaner-hardrules
pyyaml==6.0
# via
# bicleaner
# bicleaner-hardrules
# fastspell
regex==2023.3.23
# via
# bicleaner
# bicleaner-hardrules
# sacremoses
sacremoses==0.0.53
# via
# bicleaner
# bicleaner-hardrules
# fastspell
scikit-learn==1.1.3
# via bicleaner
scipy==1.10.1
# via
# bicleaner
# scikit-learn
six==1.16.0
# via sacremoses
threadpoolctl==3.1.0
# via scikit-learn
tomli==2.0.1
# via pytest
toolwrapper==2.1.0
# via
# bicleaner
# bicleaner-hardrules
tqdm==4.65.0
# via sacremoses
urllib3==1.26.15
# via fastspell

# The following packages are considered to be unsafe in a requirements file:
# setuptools
Empty file modified pipeline/cefilter/ce-filter.sh
100644 → 100755
Empty file.
Empty file modified pipeline/cefilter/score.sh
100644 → 100755
Empty file.
Loading