Merge pull request #474 from mlcommons/dev

Dev into main
mlcommons · Aug 11, 2023 · aa0d692 · aa0d692
2 parents 5577b32 + fecb64b
commit aa0d692
Show file tree

Hide file tree

Showing 13 changed files with 339 additions and 356 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -61,7 +61,7 @@ Other resources:
 
 
 ## Pre-built Images on Google Cloud Container Registry 
-If you'd like to maintain or use images stored on our Google Cloud Container Registry read this section.
+If you want to maintain or use images stored on our Google Cloud Container Registry read this section.
 You will have to use an authentication helper to set up permissions to access the repository:
 ```
 ARTIFACT_REGISTRY_URL=us-central1-docker.pkg.dev
@@ -82,6 +82,9 @@ Currently maintained images on the repository are:
 - `algoperf_pytorch_dev`
 - `algoperf_both_dev`
 
+To reference the pulled image you will have to use the full `image_path`, e.g. 
+`us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_main`.
+
 ### Trigger rebuild and push of maintained images
 To build and push all images (`pytorch`, `jax`, `both`) on maintained branches (`dev`, `main`).
 ```
@@ -97,10 +100,10 @@ You can also use the above script to build images from a different branch.
    ```
 
 ## GCP Data and Experiment Integration
-The Docker entrypoint script can communicate with
+The Docker entrypoint script can transfer data to and from 
 our GCP buckets on our internal GCP project. If
 you are an approved contributor you can get access to these resources to automatically download the datasets and upload experiment results. 
-You can use these features by setting the `-i` flag (for internal collaborator) to 'true' for the Docker entrypoint script.
+You can use these features by setting the `--internal_contributor` flag to 'true' for the Docker entrypoint script.
 
 ### Downloading Data from GCP
 To run a docker container that will only download data (if not found on host)
@@ -111,14 +114,14 @@ docker run -t -d \
 -v $HOME/experiment_runs/logs:/logs \
 --gpus all \
 --ipc=host \
-<docker_image_name> \
--d <dataset> \
--f <framework> \
--b <debugging_mode> \
--i true
+<image_path> \
+--dataset <dataset> \
+--framework <framework> \
+--keep_container_alive <keep_container_alive> \
+--internal_contributor true
 ```
-If debugging_mode is `true` the main process on the container will persist after finishing the data download.
-This run command is useful if you manually want to run a sumbission or look around.
+If `keep_container_alive` is `true` the main process on the container will persist after finishing the data download.
+This run command is useful if you are developing or debugging. 
 
 ### Saving Experiments to GCP
 If you set the internal collaborator mode to true
@@ -132,15 +135,15 @@ docker run -t -d \
 -v $HOME/experiment_runs/logs:/logs \
 --gpus all \
 --ipc=host \
-<docker_image_name> \
--d <dataset> \
--f <framework> \
--s <submission_path> \
--t <tuning_search_space> \
--e <experiment_name> \
--w <workload> \
--b <debug_mode>
--i true \
+<image_path> \
+--dataset <dataset> \
+--framework <framework> \
+--sumbission_path <submission_path> \
+--tuning_search_space <tuning_search_space> \
+--experiment_name <experiment_name> \
+--workload <workload> \
+--keep_container_alive <keep_container_alive>
+--internal_contributor true \
 ```
 
 ## Getting Information from a Container
@@ -171,8 +174,8 @@ docker run -t -d \
 -v $HOME/algorithmic-efficiency:/algorithmic-efficiency \
 --gpus all \
 --ipc=host \
-<docker_image_name> \
--b <debug_mode>
+<image_path> \
+--keep_container_alive true 
 ```
 
 # Submitting PRs 

diff --git a/README.md b/README.md
@@ -23,9 +23,8 @@
 [MLCommons Algorithmic Efficiency](https://mlcommons.org/en/groups/research-algorithms/) is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the [competition rules](RULES.md) and the benchmark code to run it. For a detailed description of the benchmark design, see our [paper](https://arxiv.org/abs/2306.07179).
 
 # Table of Contents
-- [Table of Contents](#table-of-contents)
-- [AlgoPerf Benchmark Workloads](#algoperf-benchmark-workloads)
 - [Installation](#installation)
+   - [Python Virtual Environment](#python-virtual-environment)
    - [Docker](#docker)
 - [Getting Started](#getting-started)
 - [Rules](#rules)
@@ -51,7 +50,7 @@ You can install this package and dependences in a [python virtual environment](#
    pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'
    pip3 install -e '.[full]'
    ```
-##  Virtual environment
+##  Python virtual environment
 Note: Python minimum requirement >= 3.8
 
 To set up a virtual enviornment and install this repository
@@ -74,7 +73,7 @@ To set up a virtual enviornment and install this repository
 
 <details>
 <summary>
-Additional Details
+Per workload installations
 </summary>
 You can also install the requirements for individual workloads, e.g. via
 
@@ -105,15 +104,16 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).
 
 2. Build Docker Image
    ```bash
-   cd `algorithmic-efficiency/docker`
-   docker build -t <docker_image_name> . --build-args framework=<framework>
+   cd algorithmic-efficiency/docker
+   docker build -t <docker_image_name> . --build-arg framework=<framework>
    ```
-   The `framework` flag can be either `pytorch`, `jax` or `both`. 
+   The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies.
    The `docker_image_name` is arbitrary.
 
 
 ### Running Docker Container (Interactive)
-1. Run detached Docker Container
+To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission.
+1. Run detached Docker Container. The container_id will be printed if the container is run successfully.
    ```bash
    docker run -t -d \
       -v $HOME/data/:/data/ \
@@ -123,22 +123,22 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker).
       --gpus all \
       --ipc=host \
       <docker_image_name> 
+      -keep_container_alive true
    ```
-   This will print out a container id. 
 2. Open a bash terminal
    ```bash
    docker exec -it <container_id> /bin/bash
    ```
 
 ### Running Docker Container (End-to-end)
-To run a submission end-to-end in a container see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
+To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
 
 # Getting Started
 For instructions on developing and scoring your own algorithm in the benchmark see [Getting Started Document](./getting_started.md).
 ## Running a workload
 To run a submission directly by running a Docker container, see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container).
 
-Alternatively from a your virtual environment or interactively running Docker container `submission_runner.py` run:
+From your virtual environment or interactively running Docker container run:
 
 **JAX**
 

diff --git a/algorithmic_efficiency/workloads/librispeech_conformer/workload.py b/algorithmic_efficiency/workloads/librispeech_conformer/workload.py
@@ -19,14 +19,14 @@ def has_reached_validation_target(self, eval_result: Dict[str,
 
   @property
   def validation_target_value(self) -> float:
-    return 0.078477
+    return 0.084952
 
   def has_reached_test_target(self, eval_result: Dict[str, float]) -> bool:
     return eval_result['test/wer'] < self.test_target_value
 
   @property
   def test_target_value(self) -> float:
-    return 0.046973
+    return 0.053000
 
   @property
   def loss_type(self) -> spec.LossType:
@@ -67,13 +67,13 @@ def train_stddev(self):
 
   @property
   def max_allowed_runtime_sec(self) -> int:
-    return 101_780  # ~28 hours
+    return 61_068  # ~17 hours
 
   @property
   def eval_period_time_sec(self) -> int:
-    return 40 * 60  # 40m
+    return 24 * 60
 
   @property
   def step_hint(self) -> int:
     """Max num steps the baseline algo was given to reach the target."""
-    return 133_333
+    return 80_000
diff --git a/algorithmic_efficiency/workloads/librispeech_deepspeech/workload.py b/algorithmic_efficiency/workloads/librispeech_deepspeech/workload.py
@@ -5,17 +5,17 @@ class BaseDeepspeechLibrispeechWorkload(workload.BaseLibrispeechWorkload):
 
   @property
   def validation_target_value(self) -> float:
-    return 0.1162
+    return 0.118232
 
   @property
   def test_target_value(self) -> float:
-    return 0.068093
+    return 0.073397
 
   @property
   def step_hint(self) -> int:
     """Max num steps the baseline algo was given to reach the target."""
-    return 80_000
+    return 48_000
 
   @property
   def max_allowed_runtime_sec(self) -> int:
-    return 92_509  # ~26 hours
+    return 55_506  # ~15.4 hours