Paper (arXiv) • Installation • Rules • Contributing • License
MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the competition rules and the benchmark code to run it. For a detailed description of the benchmark design, see our paper.
-
Create new environment, e.g. via
conda
orvirtualenv
:Python minimum requirement >= 3.7
sudo apt-get install python3-venv python3 -m venv env source env/bin/activate
-
Clone this repository:
git clone https://github.com/mlcommons/algorithmic-efficiency.git cd algorithmic-efficiency
-
We use pip to install the
algorithmic_efficiency
.
TL;DR to install the Jax version for GPU run:
pip3 install -e '.[pytorch_cpu]'
pip3 install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html'
pip3 install -e '.[full]'
TL;DR to install the PyTorch version for GPU run:
pip3 install -e '.[jax_cpu]'
pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'
pip3 install -e '.[full]'
You can also install the requirements for individual workloads, e.g. via
pip3 install -e '.[librispeech]'
or all workloads at once via
pip3 install -e '.[full]'
Depending on the framework you want to use (e.g. JAX
or PyTorch
) you need to install them as well. You could either do this manually or by adding the corresponding options:
JAX (GPU)
pip3 install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html'
JAX (CPU)
pip3 install -e '.[jax_cpu]'
PyTorch (GPU)
pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'
PyTorch (CPU)
pip3 install -e '.[pytorch_cpu]'
Development
To use the development tools such as pytest
or pylint
use the dev
option:
pip3 install -e '.[dev]'
pre-commit install
To get an installation with the requirements for all workloads and development, use the argument [full_dev]
.
-
Clone this repository:
git clone https://github.com/mlcommons/algorithmic-efficiency.git
-
Build Docker
cd algorithmic-efficiency/ && sudo docker build -t algorithmic-efficiency .
-
Run Docker
sudo docker run --gpus all -it --rm -v $PWD:/home/ubuntu/algorithmic-efficiency --ipc=host algorithmic-efficiency
Currently docker method installs both PyTorch and JAX
See the reference_algorithms/
dir for training various algorithm implementations (note that none of these are valid submissions because they have workload-specific logic, so we refer to them as "algorithms" instead of "submissions").
python3 submission_runner.py \
--framework=jax \
--workload=mnist \
--experiment_dir=/home/znado \
--experiment_name=baseline \
--submission_path=reference_algorithms/development_algorithms/mnist/mnist_jax/submission.py \
--tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json
python3 submission_runner.py \
--framework=pytorch \
--workload=mnist \
--experiment_dir=/home/znado \
--experiment_name=baseline \
--submission_path=reference_algorithms/development_algorithms/mnist/mnist_pytorch/submission.py \
--tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json
When using multiple GPUs on a single node it is recommended to use PyTorch's
distributed data parallel.
To do so, simply replace python3
by
torchrun --standalone --nnodes=1 --nproc_per_node=N_GPUS
where N_GPUS
is the number of available GPUs on the node. To only see output from the first process, you can run the following to redirect the output from processes 1-7 to a log file:
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8
The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate rules document. Suggestions, clarifications and questions can be raised via pull requests.
If you are interested in contributing to the work of the working group, feel free to join the weekly meetings, open issues, and see the MLCommons contributing guidelines.
We run basic presubmit checks with GitHub Actions, configured in the .github/workflows folder.
To run the below commands, use the versions installed via pip install -e '.[dev]'
.
To automatically fix formatting errors, run the following (WARNING: this will edit your code, so it is suggested to make a git commit first!):
yapf -i -r -vv -p algorithmic_efficiency baselines datasets reference_algorithms tests *.py
To sort all import orderings, run the following:
isort .
To just print out all offending import orderings, run the following:
isort . --check --diff
To print out all offending pylint issues, run the following:
pylint algorithmic_efficiency
pylint baselines
pylint datasets
pylint reference_algorithms
pylint submission_runner.py
pylint tests
You can also use python tests/reference_algorithm_tests.py
to run a single model update and two model evals for each workload using the reference algorithm in reference_algorithms/development_algorithms/
.
The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.
Since we use PyTorch's DistributedDataParallel
implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See this PR thread for more details.
While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with rank == 0
), and broadcast the batches to all other devices. This introduces an additional communication overhead for each batch. See the implementation for the WMT workload as an example.