Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow training actions to be performed in PRs #4

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Conversation

bhearsum
Copy link
Collaborator

@bhearsum bhearsum commented May 4, 2023

No description provided.

bhearsum added 24 commits April 27, 2023 15:38
It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment.
…dels.

Most of these are straight forward download and compiles, but there's a few callouts:
- The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though?
- Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks.
- CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain.

This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine.
I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these.

As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names.

This revision also builds out a couple of transforms that are used here, and will be used elsewhere:
* One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc.
* Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task.
This is largely built on the earlier work done on the `dataset` kind.
These is a few things:
1) Mark all the shell scripts as +x
2) Switch out pigz/gz for zstdmt/zst (in progress)
3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress).
4) Use `curl` instead of `wget` (in progress)
We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have.
This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements.
These are an additional dependency for the `bicleaner` stage of the pipeline.
Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are:
- The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...)
- At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers.
- An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest
- A similar enchancement to the substitution transform allow substituting parameters
Most of this diff is just indentation changes.
@bhearsum bhearsum force-pushed the main branch 6 times, most recently from c1ceb2d to 86cdd5e Compare May 12, 2023 20:50
@bhearsum bhearsum force-pushed the main branch 6 times, most recently from 0ba83fd to 58cc0f4 Compare May 17, 2023 16:20
@bhearsum bhearsum force-pushed the main branch 17 times, most recently from bb2da96 to 9e8641b Compare November 29, 2024 01:53
@bhearsum bhearsum force-pushed the main branch 3 times, most recently from f44726c to 11251f1 Compare December 23, 2024 16:14
@bhearsum bhearsum force-pushed the main branch 3 times, most recently from da1dd41 to 7d45b3a Compare December 27, 2024 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant