Skip to content

Commit

Permalink
Merge pull request #476 from fsschneider/CfS
Browse files Browse the repository at this point in the history
Add Submission Process Rules
  • Loading branch information
priyakasimbeg authored Nov 3, 2023
2 parents 3b47594 + b616fd2 commit 931f71f
Show file tree
Hide file tree
Showing 4 changed files with 283 additions and 79 deletions.
76 changes: 51 additions & 25 deletions getting_started.md → GETTING_STARTED.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,66 @@
# Getting Started

Table of Contents:
- [Set up and installation](#set-up-and-installation)
- [Set up and installation](#set-up-and-installation)
- [Download the data](#download-the-data)
- [Develop your submission](#develop-your-submission)
- [Set up your directory structure (Optional)](#set-up-your-directory-structure-optional)
- [Coding your submission](#coding-your-submission)
- [Run your submission](#run-your-submission)
- [Docker](#run-your-submission-in-a-docker-container)
- [Pytorch DDP](#pytorch-ddp)
- [Run your submission in a Docker container](#run-your-submission-in-a-docker-container)
- [Docker Tips](#docker-tips)
- [Score your submission](#score-your-submission)
- [Good Luck](#good-luck)

## Set up and installation

To get started you will have to make a few decisions and install the repository along with its dependencies. Specifically:

1. Decide if you would like to develop your submission in either Pytorch or Jax.
2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware).
2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware).
The specs on the benchmarking machines are:
- 8 V100 GPUs
- 240 GB in RAM
- 2 TB in storage (for datasets).
- 2 TB in storage (for datasets).

3. Install the algorithmic package and dependencies, see [Installation](./README.md#installation).

## Download the data
The workloads in this benchmark use 6 different datasets across 8 workloads. You may choose to download some or all of the datasets as you are developing your submission, but your submission will be scored across all 8 workloads. For instructions on obtaining and setting up the datasets see [datasets/README](https://github.com/mlcommons/algorithmic-efficiency/blob/main/datasets/README.md#dataset-setup).

The workloads in this benchmark use 6 different datasets across 8 workloads. You may choose to download some or all of the datasets as you are developing your submission, but your submission will be scored across all 8 workloads. For instructions on obtaining and setting up the datasets see [datasets/README](https://github.com/mlcommons/algorithmic-efficiency/blob/main/datasets/README.md#dataset-setup).

## Develop your submission

To develop a submission you will write a python module containing your optimizer algorithm. Your optimizer must implement a set of predefined API methods for the initialization and update steps.

### Set up your directory structure (Optional)

Make a submissions subdirectory to store your submission modules e.g. `algorithmic-effiency/submissions/my_submissions`.

### Coding your submission

You can find examples of sumbission modules under `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms`. \
A submission for the external ruleset will consist of a submission module and a tuning search space definition.

1. Copy the template submission module `submissions/template/submission.py` into your submissions directory e.g. in `algorithmic-efficiency/my_submissions`.
2. Implement at least the methods in the template submission module. Feel free to use helper functions and/or modules as you see fit. Make sure you adhere to to the competition rules. Check out the guidelines for [allowed submissions](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#disallowed-submissions), [disallowed submissions](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#disallowed-submissions) and pay special attention to the [software dependencies rule](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#software-dependencies).
3. Add a tuning configuration e.g. `tuning_search_space.json` file to your submission directory. For the tuning search space you can either:
1. Define the set of feasible points by defining a value for "feasible_points" for the hyperparameters:
```

```JSON
{
"learning_rate": {
"feasible_points": 0.999
},
}
```

For a complete example see [tuning_search_space.json](https://github.com/mlcommons/algorithmic-efficiency/blob/main/reference_algorithms/target_setting_algorithms/imagenet_resnet/tuning_search_space.json).

2. Define a range of values for quasirandom sampling by specifing a `min`, `max` and `scaling`
2. Define a range of values for quasirandom sampling by specifing a `min`, `max` and `scaling`
keys for the hyperparameter:
```

```JSON
{
"weight_decay": {
"min": 5e-3,
Expand All @@ -55,14 +69,15 @@ A submission for the external ruleset will consist of a submission module and a
}
}
```
For a complete example see [tuning_search_space.json](https://github.com/mlcommons/algorithmic-efficiency/blob/main/baselines/nadamw/tuning_search_space.json).

For a complete example see [tuning_search_space.json](https://github.com/mlcommons/algorithmic-efficiency/blob/main/baselines/nadamw/tuning_search_space.json).

## Run your submission

From your virtual environment or interactively running Docker container run your submission with `submission_runner.py`:

**JAX**: to score your submission on a workload, from the algorithmic-efficency directory run:
**JAX**: to score your submission on a workload, from the algorithmic-efficency directory run:

```bash
python3 submission_runner.py \
--framework=jax \
Expand All @@ -73,7 +88,8 @@ python3 submission_runner.py \
--tuning_search_space=<path_to_tuning_search_space>
```

**Pytorch**: to score your submission on a workload, from the algorithmic-efficency directory run:
**Pytorch**: to score your submission on a workload, from the algorithmic-efficency directory run:

```bash
python3 submission_runner.py \
--framework=pytorch \
Expand All @@ -84,14 +100,18 @@ python3 submission_runner.py \
--tuning_search_space=<path_to_tuning_search_space>
```

#### Pytorch DDP
We recommend using PyTorch's [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
when using multiple GPUs on a single node. You can initialize ddp with torchrun.
### Pytorch DDP

We recommend using PyTorch's [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
when using multiple GPUs on a single node. You can initialize ddp with torchrun.
For example, on single host with 8 GPUs simply replace `python3` in the above command by:

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=N_GPUS
```

So the complete command is:

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
--standalone \
Expand All @@ -109,17 +129,18 @@ torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
### Run your submission in a Docker container

The container entrypoint script provides the following flags:

- `--dataset` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/<dataset>` does not exist on the host machine. Required for running a submission.
- `--framework` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission.
- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
- `--tuning_search_space` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission.
- `--experiment_name` experiment_name: name of experiment. Required for running a submission.
- `--workload` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission.
- `--max_global_steps` max_global_steps: maximum number of steps to run the workload for. Optional.
- `--keep_container_alive` : can be true or false. If`true` the container will not be killed automatically. This is useful for developing or debugging.


To run the docker container that will run the submission runner run:

```bash
docker run -t -d \
-v $HOME/data/:/data/ \
Expand All @@ -136,32 +157,37 @@ docker run -t -d \
--workload <workload> \
--keep_container_alive <keep_container_alive>
```

This will print the container ID to the terminal.

#### Docker Tips ####
#### Docker Tips

To find the container IDs of running containers
```

```bash
docker ps
```

To see output of the entrypoint script
```

```bash
docker logs <container_id>
```

To enter a bash session in the container
```

```bash
docker exec -it <container_id> /bin/bash
```

## Score your submission
## Score your submission

To produce performance profile and performance table:

```bash
python3 scoring/score_submission.py --experiment_path=<path_to_experiment_dir> --output_dir=<output_dir>
```

We provide the scores and performance profiles for the baseline algorithms in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).

We provide the scores and performance profiles for the baseline algorithms in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).

## Good Luck!
## Good Luck
78 changes: 58 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,38 @@

[MLCommons Algorithmic Efficiency](https://mlcommons.org/en/groups/research-algorithms/) is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the [competition rules](RULES.md) and the benchmark code to run it. For a detailed description of the benchmark design, see our [paper](https://arxiv.org/abs/2306.07179).

# Table of Contents
## Table of Contents

- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Python Virtual Environment](#python-virtual-environment)
- [Docker](#docker)
- [Python virtual environment](#python-virtual-environment)
- [Docker](#docker)
- [Building Docker Image](#building-docker-image)
- [Running Docker Container (Interactive)](#running-docker-container-interactive)
- [Running Docker Container (End-to-end)](#running-docker-container-end-to-end)
- [Using Singularity/Apptainer instead of Docker](#using-singularityapptainer-instead-of-docker)
- [Getting Started](#getting-started)
- [Running a workload](#running-a-workload)
- [JAX](#jax)
- [Pytorch](#pytorch)
- [Rules](#rules)
- [Contributing](#contributing)
- [Diclaimers](#disclaimers)
- [FAQS](#faqs)
- [Citing AlgoPerf Benchmark](#citing-algoperf-benchmark)
- [Shared data pipelines between JAX and PyTorch](#shared-data-pipelines-between-jax-and-pytorch)
- [Setup and Platform](#setup-and-platform)
- [My machine only has one GPU. How can I use this repo?](#my-machine-only-has-one-gpu-how-can-i-use-this-repo)
- [How do I run this on my SLURM cluster?](#how-do-i-run-this-on-my-slurm-cluster)
- [How can I run this on my AWS/GCP/Azure cloud project?](#how-can-i-run-this-on-my-awsgcpazure-cloud-project)
- [Submissions](#submissions)
- [Can submission be structured using multiple files?](#can-submission-be-structured-using-multiple-files)
- [Can I install custom dependencies?](#can-i-install-custom-dependencies)
- [How can I know if my code can be run on benchmarking hardware?](#how-can-i-know-if-my-code-can-be-run-on-benchmarking-hardware)
- [Are we allowed to use our own hardware to self-report the results?](#are-we-allowed-to-use-our-own-hardware-to-self-report-the-results)




## Installation

You can install this package and dependences in a [python virtual environment](#virtual-environment) or use a [Docker/Singularity/Apptainer container](#install-in-docker) (recommended).

*TL;DR to install the Jax version for GPU run:*
Expand All @@ -53,10 +71,13 @@ You can install this package and dependences in a [python virtual environment](#
pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'
pip3 install -e '.[full]'
```
## Python virtual environment

### Python virtual environment

Note: Python minimum requirement >= 3.8

To set up a virtual enviornment and install this repository

1. Create new environment, e.g. via `conda` or `virtualenv`

```bash
Expand Down Expand Up @@ -89,35 +110,43 @@ or all workloads at once via
```bash
pip3 install -e '.[full]'
```

</details>

## Docker
### Docker

We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.
Alternatively, a Singularity/Apptainer container can also be used (see instructions below).

We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.

**Prerequisites for NVIDIA GPU set up**: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs.
**Prerequisites for NVIDIA GPU set up**: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs.
See instructions [here](https://github.com/NVIDIA/nvidia-docker).

### Building Docker Image
#### Building Docker Image

1. Clone this repository

```bash
cd ~ && git clone https://github.com/mlcommons/algorithmic-efficiency.git
```

2. Build Docker Image

```bash
cd algorithmic-efficiency/docker
docker build -t <docker_image_name> . --build-arg framework=<framework>
```

The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies.
The `docker_image_name` is arbitrary.

#### Running Docker Container (Interactive)

### Running Docker Container (Interactive)
To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission.
1. Run detached Docker Container. The `container_id` will be printed if the container is running successfully.

1. Run detached Docker Container. The container_id will be printed if the container is run successfully.

```bash
docker run -t -d \
-v $HOME/data/:/data/ \
Expand All @@ -142,7 +171,8 @@ To use the Docker container as an interactive virtual environment, you can run a
docker exec -it <container_id> /bin/bash
```

### Running Docker Container (End-to-end)
#### Running Docker Container (End-to-end)

To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

### Using Singularity/Apptainer instead of Docker
Expand All @@ -164,14 +194,17 @@ singularity shell --nv <singularity_image_name>.sif
```
Similarly to Docker, Apptainer allows you to bind specific paths on the host system and the container by specifying the `--bind` flag, as explained [here](https://docs.sylabs.io/guides/3.7/user-guide/bind_paths_and_mounts.html).

# Getting Started
## Getting Started

For instructions on developing and scoring your own algorithm in the benchmark see [Getting Started Document](./getting_started.md).
## Running a workload

### Running a workload

To run a submission directly by running a Docker container, see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

From your virtual environment or interactively running Docker container run:

**JAX**
#### JAX

```bash
python3 submission_runner.py \
Expand All @@ -183,7 +216,7 @@ python3 submission_runner.py \
--tuning_search_space=baselines/adamw/tuning_search_space.json
```

**Pytorch**
#### Pytorch

```bash
python3 submission_runner.py \
Expand All @@ -194,6 +227,7 @@ python3 submission_runner.py \
--submission_path=baselines/adamw/jax/submission.py \
--tuning_search_space=baselines/adamw/tuning_search_space.json
```

<details>
<summary>
Using Pytorch DDP (Recommended)
Expand All @@ -207,12 +241,14 @@ torchrun --standalone --nnodes=1 --nproc_per_node=N_GPUS
```

where `N_GPUS` is the number of available GPUs on the node. To only see output from the first process, you can run the following to redirect the output from processes 1-7 to a log file:

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8
```

So the complete command is for example:
```

```bash
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 \
submission_runner.py \
--framework=pytorch \
Expand All @@ -222,13 +258,15 @@ submission_runner.py \
--submission_path=baselines/adamw/jax/submission.py \
--tuning_search_space=baselines/adamw/tuning_search_space.json
```

</details>

## Rules

# Rules
The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate [rules document](RULES.md). Suggestions, clarifications and questions can be raised via pull requests.

# Contributing
## Contributing

If you are interested in contributing to the work of the working group, feel free to [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/), open issues. See our [CONTRIBUTING.md](CONTRIBUTING.md) for MLCommons contributing guidelines and setup and workflow instructions.


Expand Down
Loading

0 comments on commit 931f71f

Please sign in to comment.