diff --git a/CALL_FOR_SUBMISSIONS.md b/CALL_FOR_SUBMISSIONS.md new file mode 100644 index 000000000..ecc7840e7 --- /dev/null +++ b/CALL_FOR_SUBMISSIONS.md @@ -0,0 +1,3 @@ +# MLCommons™ AlgoPerf: Call for Submissions + +🚧 **Coming soon!** 🚧 diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 38867b369..025cb6d30 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,4 +1,28 @@ -# Contributing +# MLCommons™ AlgoPerf: Contributing + +## Table of Contents + +- [Setup](#setup) + - [Setting up a Linux VM on GCP](#setting-up-a-linux-vm-on-gcp) + - [Installing GPU Drivers](#installing-gpu-drivers) + - [Authentication for Google Cloud Container Registry](#authentication-for-google-cloud-container-registry) +- [Installation](#installation) +- [Docker workflows](#docker-workflows) + - [Pre-built Images on Google Cloud Container Registry](#pre-built-images-on-google-cloud-container-registry) + - [Trigger rebuild and push of maintained images](#trigger-rebuild-and-push-of-maintained-images) + - [Trigger build and push of images on other branch](#trigger-build-and-push-of-images-on-other-branch) + - [GCP Data and Experiment Integration](#gcp-data-and-experiment-integration) + - [Downloading Data from GCP](#downloading-data-from-gcp) + - [Saving Experiments to GCP](#saving-experiments-to-gcp) + - [Getting Information from a Container](#getting-information-from-a-container) + - [Mounting Local Repository](#mounting-local-repository) +- [Submitting PRs](#submitting-prs) +- [Testing](#testing) + - [Style Testing](#style-testing) + - [Unit and integration tests](#unit-and-integration-tests) + - [Regression tests](#regression-tests) + +We invite everyone to look through our rules and codebase and submit issues and pull requests, e.g. for rules changes, clarifications, or any bugs you might encounter. If you are interested in contributing to the work of the working group and influence the benchmark's design decisions, please [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/) and consider becoming a member of the working group. The best way to contribute to the MLCommons is to get involved with one of our many project communities. You find more information about getting involved with MLCommons [here](https://mlcommons.org/en/get-involved/#getting-started). @@ -8,29 +32,25 @@ To get started contributing code, you or your organization needs to sign the MLC MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your Pull requests. -# Table of Contents -- [Setup](#setup) -- [Installation](#installation) -- [Docker workflows](#docker-workflows) -- [Submitting PRs](#submitting-prs) -- [Testing](#testing) +## Setup +### Setting up a Linux VM on GCP -# Setup -## Setting up a Linux VM on GCP If you want to run containers on GCP VMs or store and retrieve Docker images from the Google Cloud Container Registry, please read ahead. If you'd like to use a Linux VM, you will have to install the correct GPU drivers and the NVIDIA Docker toolkit. We recommmend to use the Deep Learning on Linux image. Further instructions are based on that. ### Installing GPU Drivers + You can use the `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit. ### Authentication for Google Cloud Container Registry + To access the Google Cloud Container Registry, you will have to authenticate to the repository whenever you use Docker. Use the gcloud credential helper as documented [here](https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling#cred-helper). +## Installation -# Installation If you have not installed the package and dependencies yet see [Installation](./README.md#installation). To use the development tools such as `pytest` or `pylint` use the `dev` option: @@ -42,39 +62,34 @@ pre-commit install To get an installation with the requirements for all workloads and development, use the argument `[full_dev]`. +## Docker workflows +We recommend developing in our Docker image to ensure a consistent environment between developing, testing and scoring submissions. -# Docker workflows -We recommend developing in our Docker image to ensure a consistent environment between developing, testing and scoring submissions. +To get started see also: -To get started see: -- [Installation with Docker](./README.md#docker) +- [Installation with Docker](./README.md#docker) - [Running a submission inside a Docker Container](./getting_started.md#run-your-submission-in-a-docker-container) -Other resources: -- [Pre-built Images on Google Cloud Container Registry](#pre-built-images-on-google-cloud-container-registry) -- [GCP Data and Experiment Integration](#gcp-integration) - - [Downloading Data from GCP](#downloading-data-from-gcp) - - [Saving Experiments Results to GCP](#saving-experiments-to-gcp) -- [Getting Information from a Container](#getting-information-from-a-container) -- [Mounting local repository](#mounting-local-repository) +### Pre-built Images on Google Cloud Container Registry - -## Pre-built Images on Google Cloud Container Registry If you want to maintain or use images stored on our Google Cloud Container Registry read this section. You will have to use an authentication helper to set up permissions to access the repository: -``` + +```bash ARTIFACT_REGISTRY_URL=us-central1-docker.pkg.dev gcloud auth configure-docker $ARTIFACT_REGISTRY_URL ``` To pull the latest prebuilt image: -``` +```bash docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/ ``` -The naming convention for `image_name` is `algoperf__`. + +The naming convention for `image_name` is `algoperf__`. Currently maintained images on the repository are: + - `algoperf_jax_main` - `algoperf_pytorch_main` - `algoperf_both_main` @@ -82,32 +97,40 @@ Currently maintained images on the repository are: - `algoperf_pytorch_dev` - `algoperf_both_dev` -To reference the pulled image you will have to use the full `image_path`, e.g. +To reference the pulled image you will have to use the full `image_path`, e.g. `us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_main`. ### Trigger rebuild and push of maintained images + To build and push all images (`pytorch`, `jax`, `both`) on maintained branches (`dev`, `main`). -``` + +```bash bash docker/build_docker_images.sh -b ``` #### Trigger build and push of images on other branch -You can also use the above script to build images from a different branch. + +You can also use the above script to build images from a different branch. + 1. Push the branch to `mlcommons/algorithmic-efficiency` repository. 2. Run - ``` + + ```bash bash docker/build_docker_images.sh -b ``` -## GCP Data and Experiment Integration -The Docker entrypoint script can transfer data to and from +### GCP Data and Experiment Integration + +The Docker entrypoint script can transfer data to and from our GCP buckets on our internal GCP project. If -you are an approved contributor you can get access to these resources to automatically download the datasets and upload experiment results. +you are an approved contributor you can get access to these resources to automatically download the datasets and upload experiment results. You can use these features by setting the `--internal_contributor` flag to 'true' for the Docker entrypoint script. ### Downloading Data from GCP + To run a docker container that will only download data (if not found on host) -``` + +```bash docker run -t -d \ -v $HOME/data/:/data/ \ -v $HOME/experiment_runs/:/experiment_runs \ @@ -120,15 +143,18 @@ docker run -t -d \ --keep_container_alive \ --internal_contributor true ``` + If `keep_container_alive` is `true` the main process on the container will persist after finishing the data download. -This run command is useful if you are developing or debugging. +This run command is useful if you are developing or debugging. ### Saving Experiments to GCP + If you set the internal collaborator mode to true experiments will also be automatically uploaded to our GCP bucket under `gs://mlcommons-runs/ ``` To enter a bash session in the container -``` + +```bash docker exec -it /bin/bash ``` -## Mounting Local Repository +### Mounting Local Repository + Rebuilding the docker image can become tedious if you are making frequent changes to the code. -To have changes in your local copy of the algorithmic-efficiency repo be reflected inside the container you can mount the local repository with the `-v` flag. -``` +To have changes in your local copy of the algorithmic-efficiency repo be reflected inside the container you can mount the local repository with the `-v` flag. + +```bash docker run -t -d \ -v $HOME/data/:/data/ \ -v $HOME/experiment_runs/:/experiment_runs \ @@ -178,33 +210,40 @@ docker run -t -d \ --keep_container_alive true ``` -# Submitting PRs +## Submitting PRs + New PRs will be merged on the dev branch by default, given that they pass the presubmits. -# Testing +## Testing + We run tests with GitHub Actions, configured in the [.github/workflows](https://github.com/mlcommons/algorithmic-efficiency/tree/main/.github/workflows) folder. -## Style Testing +### Style Testing + We run yapf and linting tests on PRs. You can view and fix offending errors with these instructions. To run the below commands, use the versions installed via `pip install -e '.[dev]'`. To automatically fix formatting errors, run the following (*WARNING:* this will edit your code, so it is suggested to make a git commit first!): + ```bash yapf -i -r -vv -p algorithmic_efficiency baselines datasets reference_algorithms tests *.py ``` To sort all import orderings, run the following: + ```bash isort . ``` To just print out all offending import orderings, run the following: + ```bash isort . --check --diff ``` To print out all offending pylint issues, run the following: + ```bash pylint algorithmic_efficiency pylint baselines @@ -218,16 +257,20 @@ pylint tests We run unit tests and integration tests as part of the of github actions as well. You can also use `python tests/reference_algorithm_tests.py` to run a single model update and two model evals for each workload using the reference algorithm in `reference_algorithms/target_setting_algorithms/`. -## Regression tests +### Regression tests + We also have regression tests available in [.github/workflows/regression_tests.yml](https://github.com/mlcommons/algorithmic-efficiency/tree/main/.github/workflows/regression_tests.yml) that can be run semi-automatically. -The regression tests are shorter end-to-end submissions run in a containerized environment across all 8 workloads, in both the jax and pytorch frameworks. +The regression tests are shorter end-to-end submissions run in a containerized environment across all 8 workloads, in both the jax and pytorch frameworks. The regression tests run on self-hosted runners and are triggered for pull requests that target the main branch. Typically these PRs will be from the `dev` branch so the tests will run containers based on images build from the `dev` branch. To run a regression test: + 1. Build and upload latest Docker images from dev branch. - ``` + + ```bash bash ~/algorithmic-efficiency/docker/build_docker_images.sh -b dev ``` + 2. Turn on the self-hosted runner. 3. Run the self-hosted runner application for the runner to accept jobs. 4. Open a pull request into mian to trigger the workflow. diff --git a/GETTING_STARTED.md b/GETTING_STARTED.md index 8cab3959c..1369f5cc7 100644 --- a/GETTING_STARTED.md +++ b/GETTING_STARTED.md @@ -1,4 +1,6 @@ -# Getting Started +# MLCommons™ AlgoPerf: Getting Started + +## Table of Contents - [Set up and installation](#set-up-and-installation) - [Download the data](#download-the-data) diff --git a/README.md b/README.md index 91a56285c..dd0d7fe3e 100644 --- a/README.md +++ b/README.md @@ -176,22 +176,29 @@ To use the Docker container as an interactive virtual environment, you can run a To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container). ### Using Singularity/Apptainer instead of Docker + Since many compute clusters don't allow the usage of Docker due to securtiy concerns and instead encourage the use of [Singularity/Apptainer](https://github.com/apptainer/apptainer) (formerly Singularity, now called Apptainer), we also provide instructions on how to build an Apptainer container based on the here provided Dockerfile. To convert the Dockerfile into an Apptainer definition file, we will use [spython](https://github.com/singularityhub/singularity-cli): + ```bash pip3 install spython cd algorithmic-efficiency/docker spython recipe Dockerfile &> Singularity.def ``` + Now we can build the Apptainer image by running + ```bash singularity build --fakeroot .sif Singularity.def ``` + To start a shell session with GPU support (by using the `--nv` flag), we can run + ```bash singularity shell --nv .sif ``` + Similarly to Docker, Apptainer allows you to bind specific paths on the host system and the container by specifying the `--bind` flag, as explained [here](https://docs.sylabs.io/guides/3.7/user-guide/bind_paths_and_mounts.html). ## Getting Started diff --git a/RULES.md b/RULES.md index 7225a76b0..d74525244 100644 --- a/RULES.md +++ b/RULES.md @@ -1,10 +1,12 @@ # MLCommons™ AlgoPerf: Benchmark Rules -**Version:** 0.0.17 *(Last updated 10 August 2023)* +**Version:** 0.0.18 *(Last updated 03 Oktober 2023)* > **TL;DR** New training algorithms and models can make neural net training faster. > We need a rigorous training time benchmark that measures time to result given a fixed hardware configuration and stimulates algorithmic progress. We propose a [Training Algorithm Track](#training-algorithm-track) and a [Model Track](#model-track) in order to help disentangle optimizer improvements and model architecture improvements. This two-track structure lets us enforce a requirement that new optimizers work well on multiple models and that new models aren't highly specific to particular training hacks. +## Table of Contents + - [Introduction](#introduction) - [Training Algorithm Track](#training-algorithm-track) - [Submissions](#submissions) @@ -48,7 +50,7 @@ For a description of how to submit a training algorithm to the AlgoPerf: Trainin ### Submissions -A valid submission is a piece of code that defines all of the submission functions and is able to train all benchmark workloads on the [benchmarking hardware](#benchmarking-hardware) (defined in the [Scoring](#scoring) section). Both the validation set and the test set performance will be checked regularly during training (see the [Evaluation during training](#evaluation-during-training) section). Training halts when the workload-specific [target errors](#defining-target-performance) for the validation and test sets have been reached. For each workload, the training time to reach the *test* set target error is used as input to the [scoring process](#scoring) for the submission. Submissions using [external tuning](#external-tuning-ruleset) will be tuned independently for each workload using a single workload-agnostic search space for their specified hyperparameters. The tuning trials are selected based on the time to reach the *validation* target, but only their training times to reach the *test* target will be used for scoring. Submissions under either tuning ruleset may always self-tune while on the clock. +A valid submission is a piece of code that defines all of the submission functions and is able to train all benchmark workloads on the [benchmarking hardware](#benchmarking-hardware) (defined in the [Scoring](#scoring) section). Both the validation set and the test set performance will be checked regularly during training (see the [Evaluation during training](#evaluation-during-training) section), however, only the validation performance is relevant for scoring. Training halts when the workload-specific [target errors](#defining-target-performance) for the validation and test sets have been reached. For each workload, only the training time to reach the *validation* set target error is used as input to the [scoring process](#scoring) for the submission. Submissions using [external tuning](#external-tuning-ruleset) will be tuned independently for each workload using a single workload-agnostic search space for their specified hyperparameters. The tuning trials are selected based on the time to reach the *validation* target. Submissions under either tuning ruleset may always self-tune while on the clock. #### Specification @@ -354,17 +356,17 @@ Tuning will be substantially different for the [external](#external-tuning-rules For each workload, the hyperparameters are tuned using $O=20$ tuning **trials**. To estimate the variance of the results, this tuning will be repeated for $S=5$ **studies**, for a total of $S\cdot O = 100$ different hyperparameter settings. The submitters will provide a workload-agnostic search space and the working group will then return $100$ hyperparameters settings obtained using [(quasi)random search](https://arxiv.org/abs/1706.03200). The working group will also randomly partition these $100$ trials into $5$ studies of $20$ trials each. In lieu of independent samples from a search space, submissions can instead supply a fixed list of $20$ hyper-parameter points that will be sampled without replacement. -In each trial, the tuning trial with the fastest training time to achieve the *validation target* is determined among the $O=20$ hyperparameter settings. For scoring, however, we use the training time to reach the *test targets* of those $5$ selected runs. The median of these $5$ per-study training times will be the final training time for the submission on this workload and is used in the scoring procedure (see the "[Scoring submissions](#scoring)" section). In other words: We use the *validation performance* for tuning and selecting the best hyperparameter but use the *test performance* when measuring the training speed. Runs that do not reach the target performance of the evaluation metric have an infinite time. Submissions are always free to perform additional self-tuning while being timed. +In each trial, the tuning trial with the fastest training time to achieve the *validation target* is determined among the $O=20$ hyperparameter settings. For scoring, we use this required training time to reach the *validation targets* of those $5$ selected runs. The median of these $5$ per-study training times will be the final training time for the submission on this workload and is used in the scoring procedure (see the "[Scoring submissions](#scoring)" section). Runs that do not reach the target performance of the evaluation metric have an infinite time. Submissions are always free to perform additional self-tuning while being timed. #### Self-tuning ruleset Submissions to this ruleset are not allowed to have user-defined hyperparameters. This ruleset allows both submissions that use the same hyperparameters for all workloads, including the randomized ones (e.g. Adam with default parameters), as well as submissions that perform inner-loop tuning during their training run (e.g. SGD with line searches). -Submissions will run on one instance of the [benchmarking hardware](#benchmarking-hardware). As always, submissions are allowed to perform inner-loop tuning (e.g. for their learning rate) but the tuning efforts will be part of their score. A submission will run *S=5* times and its score will be the median time to reach the target evaluation metric value on the test set. To account for the lack of external tuning, submissions have a longer time budget to reach the target performance. Compared to the [external tuning ruleset](#external-tuning-ruleset), the `max_runtime` is tripled. Runs that do not reach the target performance of the evaluation metric within this allotted time budget have an infinite time. +Submissions will run on one instance of the [benchmarking hardware](#benchmarking-hardware). As always, submissions are allowed to perform inner-loop tuning (e.g. for their learning rate) but the tuning efforts will be part of their score. A submission will run *S=5* times and its score will be the median time to reach the target evaluation metric value on the validation set. To account for the lack of external tuning, submissions have a longer time budget to reach the target performance. Compared to the [external tuning ruleset](#external-tuning-ruleset), the `max_runtime` is tripled. Runs that do not reach the target performance of the evaluation metric within this allotted time budget have an infinite time. ### Workloads -For the purposes of the Training Algorithm Track, we consider a workload the combination of a `dataset`, `model`, `loss_fn`, along with a target that is defined over some evaluation metric. E.g., ResNet50 on ImageNet using the cross-entropy loss until a target error of 34.6% on the test set has been reached, would constitute a workload. The evaluation metric, in this example the misclassification error rate, is directly implied by the dataset/task. +For the purposes of the Training Algorithm Track, we consider a workload the combination of a `dataset`, `model`, `loss_fn`, along with a target that is defined over some evaluation metric. E.g., ResNet50 on ImageNet using the cross-entropy loss until a target error of 22.6% on the validation set has been reached, would constitute a workload. The evaluation metric, in this example the misclassification error rate, is directly implied by the dataset/task. Submissions will be scored based on their performance on the [fixed workload](#fixed-workloads). However, additionally submissions must also perform resonably well on a set of [held-out workloads](#randomized-workloads) in order for their score on the fixed workload to count (for full details see the [Scoring](#scoring) section). These held-out workloads will be generated after the submission deadline, but their randomized generating process is publicly available with the call for submissions (see "[Randomized workloads](#randomized-workloads)" section). @@ -407,9 +409,9 @@ For the [external tuning ruleset](#external-tuning-ruleset), we will only use $1 ### Scoring -Submissions will be scored based on their required training time to reach the target performance on the test set of each workload. This target performance metric can be the same as the loss function but might also be a different workload-specific metric such as the error rate or BLEU score. The target performance was defined using four standard training algorithms, see the "[Defining target performance](#defining-target-performance)" section for more details. The training time of a submission includes the compilation times for computation graphs and ops that could happen just-in-time during training; all our benchmarks should be fast enough to compile so as not to dramatically impact overall performance. The overall ranking is then determined by summarizing the performances across all [fixed workloads](#fixed-workloads), using [performance profiles](#benchmark-score-using-performance-profiles), as explained below. +Submissions will be scored based on their required training time to reach the target performance on the validation set of each workload. This target performance metric can be the same as the loss function but might also be a different workload-specific metric such as the error rate or BLEU score. The target performance was defined using four standard training algorithms, see the "[Defining target performance](#defining-target-performance)" section for more details. The training time of a submission includes the compilation times for computation graphs and ops that could happen just-in-time during training; all our benchmarks should be fast enough to compile so as not to dramatically impact overall performance. The overall ranking is then determined by summarizing the performances across all [fixed workloads](#fixed-workloads), using [performance profiles](#benchmark-score-using-performance-profiles), as explained below. -While the training time to the *test set* target is used for scoring, we use the training time to the *validation set* target for tuning. This is only relevant for submissions in the [external tuning ruleset](#external-tuning-ruleset) but is also enforced for self-reported results (i.e. submissions in the self-reported ruleset must also reach the validation target in time but only the time to the test target is used for scoring). Submitters must select the hyperparameter setting that reached the *validation* target the fastest, irrespective of its training time to achieve the *test* target. This ensures a fair and practical procedure. +The training time until the target performance on the test set was reached is not used in the scoring procedure but might be used for additional analysis of the competition results. #### Benchmarking hardware @@ -428,7 +430,7 @@ Both [tuning rulesets](#tuning) will use the same target performances. The runti We will aggregate the training times of a submission on all fixed workloads using [Performance Profiles](http://www.argmin.net/2018/03/26/performance-profiles/) (originally from [Dolan and Moré](https://arxiv.org/abs/cs/0102001)). Below we surface several relevant definitions from their work for easier readability, before explaining how we integrate the performance profiles to reach a scalar benchmark score that will be used for ranking submissions. -*Notation:* We have a set $\mathcal{S} = \{s_1, s_2, \dots, s_k\}$ of in total $k$ submissions that we evaluate on a set of $n$ fixed workloads: $\mathcal{W} = \{w_1, w_2, \dots, w_n\}$. For each submission $s$ and each workload $w$ we have a training time score $t_{s,w} \in [0,\infty)$. This is the time it took the submission to reach the test target performance on this particular workload. +*Notation:* We have a set $\mathcal{S} = \{s_1, s_2, \dots, s_k\}$ of in total $k$ submissions that we evaluate on a set of $n$ fixed workloads: $\mathcal{W} = \{w_1, w_2, \dots, w_n\}$. For each submission $s$ and each workload $w$ we have a training time score $t_{s,w} \in [0,\infty)$. This is the time it took the submission to reach the validation target performance on this particular workload. ##### Computing performance ratios @@ -464,10 +466,10 @@ The integral is normalized by the total integration area, with higher benchmark For the benchmark score, we compute and integrate the performance profiles using the training times of only the fixed workloads. But we use the submission's performance on the held-out workloads to penalize submissions. Specifically, if a submission is unable to train a held-out workload, we score the submission on the corresponding fixed workload as if that submission did not reach the target. In other words, for a submission to receive a finite training time on a fixed workload, it needs to: -- Reach the validation and test target on the fixed workload within the maximum runtime. -- Reach the validation and test target fixed workload within 4x of the fastest submission. -- Reach the validation and test target on the held-out workload (corresponding to the fixed workload) within the maximum runtime. -- Reach the validation and test target on the held-out workload (corresponding to the fixed workload) within 4x of the fastest submission. To determine the fastest submission on a held-out workload, we only consider submissions that reached the target on the corresponding fixed workload. This protects us against extremely fast submissions that only work on a specific held-out workload and are useless as general algorithms. +- Reach the validation target on the fixed workload within the maximum runtime. +- Reach the validation target fixed workload within 4x of the fastest submission. +- Reach the validation target on the held-out workload (corresponding to the fixed workload) within the maximum runtime. +- Reach the validation target on the held-out workload (corresponding to the fixed workload) within 4x of the fastest submission. To determine the fastest submission on a held-out workload, we only consider submissions that reached the target on the corresponding fixed workload. This protects us against extremely fast submissions that only work on a specific held-out workload and are useless as general algorithms. Only if all four requirements are met, does the submission get a finite score. Otherwise, a submission will receive a training time of infinity. diff --git a/SUBMISSION_PROCESS_RULES.md b/SUBMISSION_PROCESS_RULES.md index 51aeff043..227d6128b 100644 --- a/SUBMISSION_PROCESS_RULES.md +++ b/SUBMISSION_PROCESS_RULES.md @@ -2,6 +2,8 @@ **Version:** 0.0.3 *(Last updated 10 Oktober 2023)* +## Table of Contents + - [Basics](#basics) - [Schedule](#schedule) - [Dates](#dates)