Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Submission Process Rules #476

Merged
merged 19 commits into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 51 additions & 25 deletions getting_started.md → GETTING_STARTED.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,66 @@
# Getting Started

Table of Contents:
- [Set up and installation](#set-up-and-installation)
- [Set up and installation](#set-up-and-installation)
- [Download the data](#download-the-data)
- [Develop your submission](#develop-your-submission)
- [Set up your directory structure (Optional)](#set-up-your-directory-structure-optional)
- [Coding your submission](#coding-your-submission)
- [Run your submission](#run-your-submission)
- [Docker](#run-your-submission-in-a-docker-container)
- [Pytorch DDP](#pytorch-ddp)
- [Run your submission in a Docker container](#run-your-submission-in-a-docker-container)
- [Docker Tips](#docker-tips)
- [Score your submission](#score-your-submission)
- [Good Luck](#good-luck)

## Set up and installation

To get started you will have to make a few decisions and install the repository along with its dependencies. Specifically:

1. Decide if you would like to develop your submission in either Pytorch or Jax.
2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware).
2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware).
The specs on the benchmarking machines are:
- 8 V100 GPUs
- 240 GB in RAM
- 2 TB in storage (for datasets).
- 2 TB in storage (for datasets).

3. Install the algorithmic package and dependencies, see [Installation](./README.md#installation).

## Download the data
The workloads in this benchmark use 6 different datasets across 8 workloads. You may choose to download some or all of the datasets as you are developing your submission, but your submission will be scored across all 8 workloads. For instructions on obtaining and setting up the datasets see [datasets/README](https://github.com/mlcommons/algorithmic-efficiency/blob/main/datasets/README.md#dataset-setup).

The workloads in this benchmark use 6 different datasets across 8 workloads. You may choose to download some or all of the datasets as you are developing your submission, but your submission will be scored across all 8 workloads. For instructions on obtaining and setting up the datasets see [datasets/README](https://github.com/mlcommons/algorithmic-efficiency/blob/main/datasets/README.md#dataset-setup).

## Develop your submission

To develop a submission you will write a python module containing your optimizer algorithm. Your optimizer must implement a set of predefined API methods for the initialization and update steps.

### Set up your directory structure (Optional)

Make a submissions subdirectory to store your submission modules e.g. `algorithmic-effiency/submissions/my_submissions`.

### Coding your submission

You can find examples of sumbission modules under `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms`. \
A submission for the external ruleset will consist of a submission module and a tuning search space definition.

1. Copy the template submission module `submissions/template/submission.py` into your submissions directory e.g. in `algorithmic-efficiency/my_submissions`.
2. Implement at least the methods in the template submission module. Feel free to use helper functions and/or modules as you see fit. Make sure you adhere to to the competition rules. Check out the guidelines for [allowed submissions](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#disallowed-submissions), [disallowed submissions](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#disallowed-submissions) and pay special attention to the [software dependencies rule](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#software-dependencies).
3. Add a tuning configuration e.g. `tuning_search_space.json` file to your submission directory. For the tuning search space you can either:
1. Define the set of feasible points by defining a value for "feasible_points" for the hyperparameters:
```

```JSON
{
"learning_rate": {
"feasible_points": 0.999
},
}
```

For a complete example see [tuning_search_space.json](https://github.com/mlcommons/algorithmic-efficiency/blob/main/reference_algorithms/target_setting_algorithms/imagenet_resnet/tuning_search_space.json).

2. Define a range of values for quasirandom sampling by specifing a `min`, `max` and `scaling`
2. Define a range of values for quasirandom sampling by specifing a `min`, `max` and `scaling`
keys for the hyperparameter:
```

```JSON
{
"weight_decay": {
"min": 5e-3,
Expand All @@ -55,14 +69,15 @@ A submission for the external ruleset will consist of a submission module and a
}
}
```
For a complete example see [tuning_search_space.json](https://github.com/mlcommons/algorithmic-efficiency/blob/main/baselines/nadamw/tuning_search_space.json).

For a complete example see [tuning_search_space.json](https://github.com/mlcommons/algorithmic-efficiency/blob/main/baselines/nadamw/tuning_search_space.json).

## Run your submission

From your virtual environment or interactively running Docker container run your submission with `submission_runner.py`:

**JAX**: to score your submission on a workload, from the algorithmic-efficency directory run:
**JAX**: to score your submission on a workload, from the algorithmic-efficency directory run:

```bash
python3 submission_runner.py \
--framework=jax \
Expand All @@ -73,7 +88,8 @@ python3 submission_runner.py \
--tuning_search_space=<path_to_tuning_search_space>
```

**Pytorch**: to score your submission on a workload, from the algorithmic-efficency directory run:
**Pytorch**: to score your submission on a workload, from the algorithmic-efficency directory run:

```bash
python3 submission_runner.py \
--framework=pytorch \
Expand All @@ -84,14 +100,18 @@ python3 submission_runner.py \
--tuning_search_space=<path_to_tuning_search_space>
```

#### Pytorch DDP
We recommend using PyTorch's [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
when using multiple GPUs on a single node. You can initialize ddp with torchrun.
### Pytorch DDP

We recommend using PyTorch's [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
when using multiple GPUs on a single node. You can initialize ddp with torchrun.
For example, on single host with 8 GPUs simply replace `python3` in the above command by:

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=N_GPUS
```

So the complete command is:

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
--standalone \
Expand All @@ -109,17 +129,18 @@ torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
### Run your submission in a Docker container

The container entrypoint script provides the following flags:

- `--dataset` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/<dataset>` does not exist on the host machine. Required for running a submission.
- `--framework` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission.
- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
- `--tuning_search_space` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission.
- `--experiment_name` experiment_name: name of experiment. Required for running a submission.
- `--workload` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission.
- `--max_global_steps` max_global_steps: maximum number of steps to run the workload for. Optional.
- `--keep_container_alive` : can be true or false. If`true` the container will not be killed automatically. This is useful for developing or debugging.


To run the docker container that will run the submission runner run:

```bash
docker run -t -d \
-v $HOME/data/:/data/ \
Expand All @@ -136,32 +157,37 @@ docker run -t -d \
--workload <workload> \
--keep_container_alive <keep_container_alive>
```

This will print the container ID to the terminal.

#### Docker Tips ####
#### Docker Tips

To find the container IDs of running containers
```

```bash
docker ps
```

To see output of the entrypoint script
```

```bash
docker logs <container_id>
```

To enter a bash session in the container
```

```bash
docker exec -it <container_id> /bin/bash
```

## Score your submission
## Score your submission

To produce performance profile and performance table:

```bash
python3 scoring/score_submission.py --experiment_path=<path_to_experiment_dir> --output_dir=<output_dir>
```

We provide the scores and performance profiles for the baseline algorithms in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).

We provide the scores and performance profiles for the baseline algorithms in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).

## Good Luck!
## Good Luck
78 changes: 58 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,38 @@

[MLCommons Algorithmic Efficiency](https://mlcommons.org/en/groups/research-algorithms/) is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the [competition rules](RULES.md) and the benchmark code to run it. For a detailed description of the benchmark design, see our [paper](https://arxiv.org/abs/2306.07179).

# Table of Contents
## Table of Contents

- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Python Virtual Environment](#python-virtual-environment)
- [Docker](#docker)
- [Python virtual environment](#python-virtual-environment)
- [Docker](#docker)
- [Building Docker Image](#building-docker-image)
- [Running Docker Container (Interactive)](#running-docker-container-interactive)
- [Running Docker Container (End-to-end)](#running-docker-container-end-to-end)
- [Using Singularity/Apptainer instead of Docker](#using-singularityapptainer-instead-of-docker)
- [Getting Started](#getting-started)
- [Running a workload](#running-a-workload)
- [JAX](#jax)
- [Pytorch](#pytorch)
- [Rules](#rules)
- [Contributing](#contributing)
- [Diclaimers](#disclaimers)
- [FAQS](#faqs)
- [Citing AlgoPerf Benchmark](#citing-algoperf-benchmark)
- [Shared data pipelines between JAX and PyTorch](#shared-data-pipelines-between-jax-and-pytorch)
- [Setup and Platform](#setup-and-platform)
- [My machine only has one GPU. How can I use this repo?](#my-machine-only-has-one-gpu-how-can-i-use-this-repo)
- [How do I run this on my SLURM cluster?](#how-do-i-run-this-on-my-slurm-cluster)
- [How can I run this on my AWS/GCP/Azure cloud project?](#how-can-i-run-this-on-my-awsgcpazure-cloud-project)
- [Submissions](#submissions)
- [Can submission be structured using multiple files?](#can-submission-be-structured-using-multiple-files)
- [Can I install custom dependencies?](#can-i-install-custom-dependencies)
- [How can I know if my code can be run on benchmarking hardware?](#how-can-i-know-if-my-code-can-be-run-on-benchmarking-hardware)
- [Are we allowed to use our own hardware to self-report the results?](#are-we-allowed-to-use-our-own-hardware-to-self-report-the-results)




## Installation

You can install this package and dependences in a [python virtual environment](#virtual-environment) or use a [Docker/Singularity/Apptainer container](#install-in-docker) (recommended).

*TL;DR to install the Jax version for GPU run:*
Expand All @@ -53,10 +71,13 @@ You can install this package and dependences in a [python virtual environment](#
pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'
pip3 install -e '.[full]'
```
## Python virtual environment

### Python virtual environment

Note: Python minimum requirement >= 3.8

To set up a virtual enviornment and install this repository

1. Create new environment, e.g. via `conda` or `virtualenv`

```bash
Expand Down Expand Up @@ -89,35 +110,43 @@ or all workloads at once via
```bash
pip3 install -e '.[full]'
```

</details>

## Docker
### Docker

We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.
Alternatively, a Singularity/Apptainer container can also be used (see instructions below).

We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.

**Prerequisites for NVIDIA GPU set up**: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs.
**Prerequisites for NVIDIA GPU set up**: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs.
See instructions [here](https://github.com/NVIDIA/nvidia-docker).

### Building Docker Image
#### Building Docker Image

1. Clone this repository

```bash
cd ~ && git clone https://github.com/mlcommons/algorithmic-efficiency.git
```

2. Build Docker Image

```bash
cd algorithmic-efficiency/docker
docker build -t <docker_image_name> . --build-arg framework=<framework>
```

The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies.
The `docker_image_name` is arbitrary.

#### Running Docker Container (Interactive)

### Running Docker Container (Interactive)
To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission.
1. Run detached Docker Container. The `container_id` will be printed if the container is running successfully.

1. Run detached Docker Container. The container_id will be printed if the container is run successfully.

```bash
docker run -t -d \
-v $HOME/data/:/data/ \
Expand All @@ -142,7 +171,8 @@ To use the Docker container as an interactive virtual environment, you can run a
docker exec -it <container_id> /bin/bash
```

### Running Docker Container (End-to-end)
#### Running Docker Container (End-to-end)

To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

### Using Singularity/Apptainer instead of Docker
Expand All @@ -164,14 +194,17 @@ singularity shell --nv <singularity_image_name>.sif
```
Similarly to Docker, Apptainer allows you to bind specific paths on the host system and the container by specifying the `--bind` flag, as explained [here](https://docs.sylabs.io/guides/3.7/user-guide/bind_paths_and_mounts.html).

# Getting Started
## Getting Started

For instructions on developing and scoring your own algorithm in the benchmark see [Getting Started Document](./getting_started.md).
## Running a workload

### Running a workload

To run a submission directly by running a Docker container, see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

From your virtual environment or interactively running Docker container run:

**JAX**
#### JAX

```bash
python3 submission_runner.py \
Expand All @@ -183,7 +216,7 @@ python3 submission_runner.py \
--tuning_search_space=baselines/adamw/tuning_search_space.json
```

**Pytorch**
#### Pytorch

```bash
python3 submission_runner.py \
Expand All @@ -194,6 +227,7 @@ python3 submission_runner.py \
--submission_path=baselines/adamw/jax/submission.py \
--tuning_search_space=baselines/adamw/tuning_search_space.json
```

<details>
<summary>
Using Pytorch DDP (Recommended)
Expand All @@ -207,12 +241,14 @@ torchrun --standalone --nnodes=1 --nproc_per_node=N_GPUS
```

where `N_GPUS` is the number of available GPUs on the node. To only see output from the first process, you can run the following to redirect the output from processes 1-7 to a log file:

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8
```

So the complete command is for example:
```

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 \
submission_runner.py \
--framework=pytorch \
Expand All @@ -222,13 +258,15 @@ submission_runner.py \
--submission_path=baselines/adamw/jax/submission.py \
--tuning_search_space=baselines/adamw/tuning_search_space.json
```

</details>

## Rules

# Rules
The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate [rules document](RULES.md). Suggestions, clarifications and questions can be raised via pull requests.

# Contributing
## Contributing

If you are interested in contributing to the work of the working group, feel free to [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/), open issues. See our [CONTRIBUTING.md](CONTRIBUTING.md) for MLCommons contributing guidelines and setup and workflow instructions.


Expand Down
Loading
Loading