Skip to content

Commit

Permalink
Merge pull request #474 from mlcommons/dev
Browse files Browse the repository at this point in the history
Dev into main
  • Loading branch information
znado authored Aug 11, 2023
2 parents 5577b32 + fecb64b commit aa0d692
Show file tree
Hide file tree
Showing 13 changed files with 339 additions and 356 deletions.
45 changes: 24 additions & 21 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Other resources:


## Pre-built Images on Google Cloud Container Registry
If you'd like to maintain or use images stored on our Google Cloud Container Registry read this section.
If you want to maintain or use images stored on our Google Cloud Container Registry read this section.
You will have to use an authentication helper to set up permissions to access the repository:
```
ARTIFACT_REGISTRY_URL=us-central1-docker.pkg.dev
Expand All @@ -82,6 +82,9 @@ Currently maintained images on the repository are:
- `algoperf_pytorch_dev`
- `algoperf_both_dev`

To reference the pulled image you will have to use the full `image_path`, e.g.
`us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_main`.

### Trigger rebuild and push of maintained images
To build and push all images (`pytorch`, `jax`, `both`) on maintained branches (`dev`, `main`).
```
Expand All @@ -97,10 +100,10 @@ You can also use the above script to build images from a different branch.
```

## GCP Data and Experiment Integration
The Docker entrypoint script can communicate with
The Docker entrypoint script can transfer data to and from
our GCP buckets on our internal GCP project. If
you are an approved contributor you can get access to these resources to automatically download the datasets and upload experiment results.
You can use these features by setting the `-i` flag (for internal collaborator) to 'true' for the Docker entrypoint script.
You can use these features by setting the `--internal_contributor` flag to 'true' for the Docker entrypoint script.

### Downloading Data from GCP
To run a docker container that will only download data (if not found on host)
Expand All @@ -111,14 +114,14 @@ docker run -t -d \
-v $HOME/experiment_runs/logs:/logs \
--gpus all \
--ipc=host \
<docker_image_name> \
-d <dataset> \
-f <framework> \
-b <debugging_mode> \
-i true
<image_path> \
--dataset <dataset> \
--framework <framework> \
--keep_container_alive <keep_container_alive> \
--internal_contributor true
```
If debugging_mode is `true` the main process on the container will persist after finishing the data download.
This run command is useful if you manually want to run a sumbission or look around.
If `keep_container_alive` is `true` the main process on the container will persist after finishing the data download.
This run command is useful if you are developing or debugging.

### Saving Experiments to GCP
If you set the internal collaborator mode to true
Expand All @@ -132,15 +135,15 @@ docker run -t -d \
-v $HOME/experiment_runs/logs:/logs \
--gpus all \
--ipc=host \
<docker_image_name> \
-d <dataset> \
-f <framework> \
-s <submission_path> \
-t <tuning_search_space> \
-e <experiment_name> \
-w <workload> \
-b <debug_mode>
-i true \
<image_path> \
--dataset <dataset> \
--framework <framework> \
--sumbission_path <submission_path> \
--tuning_search_space <tuning_search_space> \
--experiment_name <experiment_name> \
--workload <workload> \
--keep_container_alive <keep_container_alive>
--internal_contributor true \
```

## Getting Information from a Container
Expand Down Expand Up @@ -171,8 +174,8 @@ docker run -t -d \
-v $HOME/algorithmic-efficiency:/algorithmic-efficiency \
--gpus all \
--ipc=host \
<docker_image_name> \
-b <debug_mode>
<image_path> \
--keep_container_alive true
```

# Submitting PRs
Expand Down
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,8 @@
[MLCommons Algorithmic Efficiency](https://mlcommons.org/en/groups/research-algorithms/) is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the [competition rules](RULES.md) and the benchmark code to run it. For a detailed description of the benchmark design, see our [paper](https://arxiv.org/abs/2306.07179).

# Table of Contents
- [Table of Contents](#table-of-contents)
- [AlgoPerf Benchmark Workloads](#algoperf-benchmark-workloads)
- [Installation](#installation)
- [Python Virtual Environment](#python-virtual-environment)
- [Docker](#docker)
- [Getting Started](#getting-started)
- [Rules](#rules)
Expand All @@ -51,7 +50,7 @@ You can install this package and dependences in a [python virtual environment](#
pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'
pip3 install -e '.[full]'
```
## Virtual environment
## Python virtual environment
Note: Python minimum requirement >= 3.8

To set up a virtual enviornment and install this repository
Expand All @@ -74,7 +73,7 @@ To set up a virtual enviornment and install this repository

<details>
<summary>
Additional Details
Per workload installations
</summary>
You can also install the requirements for individual workloads, e.g. via

Expand Down Expand Up @@ -105,15 +104,16 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).

2. Build Docker Image
```bash
cd `algorithmic-efficiency/docker`
docker build -t <docker_image_name> . --build-args framework=<framework>
cd algorithmic-efficiency/docker
docker build -t <docker_image_name> . --build-arg framework=<framework>
```
The `framework` flag can be either `pytorch`, `jax` or `both`.
The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies.
The `docker_image_name` is arbitrary.


### Running Docker Container (Interactive)
1. Run detached Docker Container
To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission.
1. Run detached Docker Container. The container_id will be printed if the container is run successfully.
```bash
docker run -t -d \
-v $HOME/data/:/data/ \
Expand All @@ -123,22 +123,22 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).
--gpus all \
--ipc=host \
<docker_image_name>
-keep_container_alive true
```
This will print out a container id.
2. Open a bash terminal
```bash
docker exec -it <container_id> /bin/bash
```

### Running Docker Container (End-to-end)
To run a submission end-to-end in a container see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

# Getting Started
For instructions on developing and scoring your own algorithm in the benchmark see [Getting Started Document](./getting_started.md).
## Running a workload
To run a submission directly by running a Docker container, see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).

Alternatively from a your virtual environment or interactively running Docker container `submission_runner.py` run:
From your virtual environment or interactively running Docker container run:

**JAX**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@ def has_reached_validation_target(self, eval_result: Dict[str,

@property
def validation_target_value(self) -> float:
return 0.078477
return 0.084952

def has_reached_test_target(self, eval_result: Dict[str, float]) -> bool:
return eval_result['test/wer'] < self.test_target_value

@property
def test_target_value(self) -> float:
return 0.046973
return 0.053000

@property
def loss_type(self) -> spec.LossType:
Expand Down Expand Up @@ -67,13 +67,13 @@ def train_stddev(self):

@property
def max_allowed_runtime_sec(self) -> int:
return 101_780 # ~28 hours
return 61_068 # ~17 hours

@property
def eval_period_time_sec(self) -> int:
return 40 * 60 # 40m
return 24 * 60

@property
def step_hint(self) -> int:
"""Max num steps the baseline algo was given to reach the target."""
return 133_333
return 80_000
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@ class BaseDeepspeechLibrispeechWorkload(workload.BaseLibrispeechWorkload):

@property
def validation_target_value(self) -> float:
return 0.1162
return 0.118232

@property
def test_target_value(self) -> float:
return 0.068093
return 0.073397

@property
def step_hint(self) -> int:
"""Max num steps the baseline algo was given to reach the target."""
return 80_000
return 48_000

@property
def max_allowed_runtime_sec(self) -> int:
return 92_509 # ~26 hours
return 55_506 # ~15.4 hours
Loading

0 comments on commit aa0d692

Please sign in to comment.