diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index b122372a9..33a14f83c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -61,7 +61,7 @@ Other resources: ## Pre-built Images on Google Cloud Container Registry -If you'd like to maintain or use images stored on our Google Cloud Container Registry read this section. +If you want to maintain or use images stored on our Google Cloud Container Registry read this section. You will have to use an authentication helper to set up permissions to access the repository: ``` ARTIFACT_REGISTRY_URL=us-central1-docker.pkg.dev @@ -82,6 +82,9 @@ Currently maintained images on the repository are: - `algoperf_pytorch_dev` - `algoperf_both_dev` +To reference the pulled image you will have to use the full `image_path`, e.g. +`us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_main`. + ### Trigger rebuild and push of maintained images To build and push all images (`pytorch`, `jax`, `both`) on maintained branches (`dev`, `main`). ``` @@ -97,10 +100,10 @@ You can also use the above script to build images from a different branch. ``` ## GCP Data and Experiment Integration -The Docker entrypoint script can communicate with +The Docker entrypoint script can transfer data to and from our GCP buckets on our internal GCP project. If you are an approved contributor you can get access to these resources to automatically download the datasets and upload experiment results. -You can use these features by setting the `-i` flag (for internal collaborator) to 'true' for the Docker entrypoint script. +You can use these features by setting the `--internal_contributor` flag to 'true' for the Docker entrypoint script. ### Downloading Data from GCP To run a docker container that will only download data (if not found on host) @@ -111,14 +114,14 @@ docker run -t -d \ -v $HOME/experiment_runs/logs:/logs \ --gpus all \ --ipc=host \ - \ --d \ --f \ --b \ --i true + \ +--dataset \ +--framework \ +--keep_container_alive \ +--internal_contributor true ``` -If debugging_mode is `true` the main process on the container will persist after finishing the data download. -This run command is useful if you manually want to run a sumbission or look around. +If `keep_container_alive` is `true` the main process on the container will persist after finishing the data download. +This run command is useful if you are developing or debugging. ### Saving Experiments to GCP If you set the internal collaborator mode to true @@ -132,15 +135,15 @@ docker run -t -d \ -v $HOME/experiment_runs/logs:/logs \ --gpus all \ --ipc=host \ - \ --d \ --f \ --s \ --t \ --e \ --w \ --b --i true \ + \ +--dataset \ +--framework \ +--sumbission_path \ +--tuning_search_space \ +--experiment_name \ +--workload \ +--keep_container_alive +--internal_contributor true \ ``` ## Getting Information from a Container @@ -171,8 +174,8 @@ docker run -t -d \ -v $HOME/algorithmic-efficiency:/algorithmic-efficiency \ --gpus all \ --ipc=host \ - \ --b + \ +--keep_container_alive true ``` # Submitting PRs diff --git a/README.md b/README.md index c60efae60..54e274a6c 100644 --- a/README.md +++ b/README.md @@ -23,9 +23,8 @@ [MLCommons Algorithmic Efficiency](https://mlcommons.org/en/groups/research-algorithms/) is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the [competition rules](RULES.md) and the benchmark code to run it. For a detailed description of the benchmark design, see our [paper](https://arxiv.org/abs/2306.07179). # Table of Contents -- [Table of Contents](#table-of-contents) -- [AlgoPerf Benchmark Workloads](#algoperf-benchmark-workloads) - [Installation](#installation) + - [Python Virtual Environment](#python-virtual-environment) - [Docker](#docker) - [Getting Started](#getting-started) - [Rules](#rules) @@ -51,7 +50,7 @@ You can install this package and dependences in a [python virtual environment](# pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html' pip3 install -e '.[full]' ``` -## Virtual environment +## Python virtual environment Note: Python minimum requirement >= 3.8 To set up a virtual enviornment and install this repository @@ -74,7 +73,7 @@ To set up a virtual enviornment and install this repository
-Additional Details +Per workload installations You can also install the requirements for individual workloads, e.g. via @@ -105,15 +104,16 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker). 2. Build Docker Image ```bash - cd `algorithmic-efficiency/docker` - docker build -t . --build-args framework= + cd algorithmic-efficiency/docker + docker build -t . --build-arg framework= ``` - The `framework` flag can be either `pytorch`, `jax` or `both`. + The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies. The `docker_image_name` is arbitrary. ### Running Docker Container (Interactive) -1. Run detached Docker Container +To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission. +1. Run detached Docker Container. The container_id will be printed if the container is run successfully. ```bash docker run -t -d \ -v $HOME/data/:/data/ \ @@ -123,22 +123,22 @@ See instructions [here](https://github.com/NVIDIA/nvidia-docker). --gpus all \ --ipc=host \ + -keep_container_alive true ``` - This will print out a container id. 2. Open a bash terminal ```bash docker exec -it /bin/bash ``` ### Running Docker Container (End-to-end) -To run a submission end-to-end in a container see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container). +To run a submission end-to-end in a containerized environment see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container). # Getting Started For instructions on developing and scoring your own algorithm in the benchmark see [Getting Started Document](./getting_started.md). ## Running a workload To run a submission directly by running a Docker container, see [Getting Started Document](./getting_started.md#run-your-submission-in-a-docker-container). -Alternatively from a your virtual environment or interactively running Docker container `submission_runner.py` run: +From your virtual environment or interactively running Docker container run: **JAX** diff --git a/algorithmic_efficiency/workloads/librispeech_conformer/workload.py b/algorithmic_efficiency/workloads/librispeech_conformer/workload.py index 985f4b0eb..dc7fb912b 100644 --- a/algorithmic_efficiency/workloads/librispeech_conformer/workload.py +++ b/algorithmic_efficiency/workloads/librispeech_conformer/workload.py @@ -19,14 +19,14 @@ def has_reached_validation_target(self, eval_result: Dict[str, @property def validation_target_value(self) -> float: - return 0.078477 + return 0.084952 def has_reached_test_target(self, eval_result: Dict[str, float]) -> bool: return eval_result['test/wer'] < self.test_target_value @property def test_target_value(self) -> float: - return 0.046973 + return 0.053000 @property def loss_type(self) -> spec.LossType: @@ -67,13 +67,13 @@ def train_stddev(self): @property def max_allowed_runtime_sec(self) -> int: - return 101_780 # ~28 hours + return 61_068 # ~17 hours @property def eval_period_time_sec(self) -> int: - return 40 * 60 # 40m + return 24 * 60 @property def step_hint(self) -> int: """Max num steps the baseline algo was given to reach the target.""" - return 133_333 + return 80_000 diff --git a/algorithmic_efficiency/workloads/librispeech_deepspeech/workload.py b/algorithmic_efficiency/workloads/librispeech_deepspeech/workload.py index 7a836cf94..f9fd30b0d 100644 --- a/algorithmic_efficiency/workloads/librispeech_deepspeech/workload.py +++ b/algorithmic_efficiency/workloads/librispeech_deepspeech/workload.py @@ -5,17 +5,17 @@ class BaseDeepspeechLibrispeechWorkload(workload.BaseLibrispeechWorkload): @property def validation_target_value(self) -> float: - return 0.1162 + return 0.118232 @property def test_target_value(self) -> float: - return 0.068093 + return 0.073397 @property def step_hint(self) -> int: """Max num steps the baseline algo was given to reach the target.""" - return 80_000 + return 48_000 @property def max_allowed_runtime_sec(self) -> int: - return 92_509 # ~26 hours + return 55_506 # ~15.4 hours diff --git a/datasets/dataset_setup.py b/datasets/dataset_setup.py index bc4502a24..0227e728e 100644 --- a/datasets/dataset_setup.py +++ b/datasets/dataset_setup.py @@ -76,7 +76,6 @@ from absl import flags from absl import logging import requests -import tensorflow as tf import tensorflow_datasets as tfds from torchvision.datasets import CIFAR10 import tqdm @@ -84,9 +83,9 @@ IMAGENET_TRAIN_TAR_FILENAME = 'ILSVRC2012_img_train.tar' IMAGENET_VAL_TAR_FILENAME = 'ILSVRC2012_img_val.tar' -FASTMRI_TRAIN_TAR_FILENAME = 'knee_singlecoil_train.tar' -FASTMRI_VAL_TAR_FILENAME = 'knee_singlecoil_val.tar' -FASTMRI_TEST_TAR_FILENAME = 'knee_singlecoil_test.tar' +FASTMRI_TRAIN_TAR_FILENAME = 'knee_singlecoil_train.tar.xz' +FASTMRI_VAL_TAR_FILENAME = 'knee_singlecoil_val.tar.xz' +FASTMRI_TEST_TAR_FILENAME = 'knee_singlecoil_test.tar.xz' from algorithmic_efficiency.workloads.wmt import tokenizer from algorithmic_efficiency.workloads.wmt.input_pipeline import \ @@ -132,11 +131,11 @@ flags.DEFINE_string( 'data_dir', - None, + '~/data', 'The path to the folder where datasets should be downloaded.') flags.DEFINE_string( 'temp_dir', - '/tmp', + '/tmp/mlcommons', 'A local path to a folder where temp files can be downloaded.') flags.DEFINE_string( 'imagenet_train_url', @@ -162,6 +161,12 @@ 'Only necessary if you want this script to `wget` the FastMRI validation ' 'split. If not, you can supply the path to --data_dir in ' 'submission_runner.py.') +flags.DEFINE_string( + 'fastmri_knee_singlecoil_test_url', + None, + 'Only necessary if you want this script to `wget` the FastMRI validation ' + 'split. If not, you can supply the path to --data_dir in ' + 'submission_runner.py.') flags.DEFINE_integer( 'num_decompression_threads', @@ -169,9 +174,11 @@ 'The number of threads to use in parallel when decompressing.') flags.DEFINE_string('framework', None, 'Can be either jax or pytorch.') -flags.DEFINE_boolean('train_tokenizer', True, 'Train Librispeech tokenizer.') + FLAGS = flags.FLAGS +os.environ["CUDA_VISIBLE_DEVICES"] = "-1" + def _maybe_mkdir(d): if not os.path.exists(d): @@ -193,9 +200,15 @@ def _maybe_prompt_for_deletion(paths, interactive_deletion): logging.info('Skipping deletion.') -def _download_url(url, data_dir): +def _download_url(url, data_dir, name=None): + data_dir = os.path.expanduser(data_dir) - file_path = os.path.join(data_dir, url.split('/')[-1]) + if not name: + file_path = os.path.join(data_dir, url.split('/')[-1]) + else: + file_path = os.path.join(data_dir, name) + print(f"about to download to {file_path}") + response = requests.get(url, stream=True, timeout=600) total_size_in_bytes = int(response.headers.get('Content-length', 0)) total_size_in_mib = total_size_in_bytes / (2**20) @@ -282,61 +295,85 @@ def download_cifar(data_dir, framework): raise ValueError('Invalid value for framework: {}'.format(framework)) +def extract_filename_from_url(url, start_str='knee', end_str='.xz'): + """ the url filenames are sometimes couched within a urldefense+aws access id etc. string. + unfortunately querying the content disposition in requests fails (not provided)... + so fast search is done here within the url + """ + failure = -1 + start = url.find(start_str) + end = url.find(end_str) + if failure in (start, end): + raise ValueError( + f"Unable to locate filename wrapped in {start}--{end} in {url}") + end += len(end_str) # make it inclusive + return url[start:end] + + def download_fastmri(data_dir, fastmri_train_url, fastmri_val_url, fastmri_test_url): data_dir = os.path.join(data_dir, 'fastmri') - # Download fastmri train dataset + knee_train_filename = extract_filename_from_url(fastmri_train_url) logging.info( 'Downloading fastmri train dataset from {}'.format(fastmri_train_url)) - _download_url(url=fastmri_train_url, data_dir=data_dir).download() + _download_url( + url=fastmri_train_url, data_dir=data_dir, name=knee_train_filename) # Download fastmri val dataset + knee_val_filename = extract_filename_from_url(fastmri_val_url) logging.info( 'Downloading fastmri val dataset from {}'.format(fastmri_val_url)) - _download_url(url=fastmri_val_url, data_dir=data_dir).download() + _download_url(url=fastmri_val_url, data_dir=data_dir, name=knee_val_filename) # Download fastmri test dataset + knee_test_filename = extract_filename_from_url(fastmri_test_url) + logging.info( 'Downloading fastmri test dataset from {}'.format(fastmri_test_url)) - _download_url(url=fastmri_test_url, data_dir=data_dir).download() + _download_url( + url=fastmri_test_url, data_dir=data_dir, name=knee_test_filename) + return data_dir def extract(source, dest): if not os.path.exists(dest): os.path.makedirs(dest) - + print(f"extracting {source} to {dest}") tar = tarfile.open(source) + print(f"opened tar") + tar.extractall(dest) tar.close() -def setup_fastmri(data_dir): - train_tar_file_path = os.path.join(data_dir, FASTMRI_TRAIN_TAR_FILENAME) - val_tar_file_path = os.path.join(data_dir, FASTMRI_VAL_TAR_FILENAME) - test_tar_file_path = os.path.join(data_dir, FASTMRI_TEST_TAR_FILENAME) +def setup_fastmri(data_dir, src_data_dir): + + train_tar_file_path = os.path.join(src_data_dir, FASTMRI_TRAIN_TAR_FILENAME) + val_tar_file_path = os.path.join(src_data_dir, FASTMRI_VAL_TAR_FILENAME) + test_tar_file_path = os.path.join(src_data_dir, FASTMRI_TEST_TAR_FILENAME) # Make train, val and test subdirectories fastmri_data_dir = os.path.join(data_dir, 'fastmri') train_data_dir = os.path.join(fastmri_data_dir, 'train') - os.makedirs(train_data_dir) + os.makedirs(train_data_dir, exist_ok=True) val_data_dir = os.path.join(fastmri_data_dir, 'val') - os.makedirsval_data_dir() + os.makedirs(val_data_dir, exist_ok=True) test_data_dir = os.path.join(fastmri_data_dir, 'test') - os.makedirs(test_data_dir) + os.makedirs(test_data_dir, exist_ok=True) # Unzip tar file into subdirectories - logging.info('Unzipping {} to {}'.format(train_tar_file_path, - fastmri_data_dir)) + logging.info('Unzipping {} to {}'.format(train_tar_file_path, train_data_dir)) extract(train_tar_file_path, train_data_dir) - logging.info('Unzipping {} to {}'.format(val_tar_file_path, fastmri_data_dir)) + logging.info('Unzipping {} to {}'.format(val_tar_file_path, val_data_dir)) extract(val_tar_file_path, val_data_dir) - logging.info('Unzipping {} to {}'.format(val_tar_file_path, fastmri_data_dir)) + logging.info('Unzipping {} to {}'.format(test_tar_file_path, test_data_dir)) extract(test_tar_file_path, test_data_dir) - logging.info('Set up imagenet dataset for jax framework complete') + logging.info('Set up fastMRI dataset complete') + print(f"extraction completed! ") def download_imagenet(data_dir, imagenet_train_url, imagenet_val_url): @@ -458,17 +495,26 @@ def download_imagenet_v2(data_dir): data_dir=data_dir).download_and_prepare() -def download_librispeech(dataset_dir, tmp_dir, train_tokenizer): +def download_librispeech(dataset_dir, tmp_dir): # After extraction the result is a folder named Librispeech containing audio # files in .flac format along with transcripts containing name of audio file # and corresponding transcription. - tmp_librispeech_dir = os.path.join(tmp_dir, 'LibriSpeech') + tmp_librispeech_dir = os.path.join(dataset_dir, 'librispeech') + extracted_data_dir = os.path.join(tmp_librispeech_dir, 'LibriSpeech') + final_data_dir = os.path.join(dataset_dir, 'librispeech_processed') + _maybe_mkdir(tmp_librispeech_dir) for split in ['dev', 'test']: for version in ['clean', 'other']: - wget_cmd = f'wget http://www.openslr.org/resources/12/{split}-{version}.tar.gz -O - | tar xz' # pylint: disable=line-too-long - subprocess.Popen(wget_cmd, shell=True, cwd=tmp_dir).communicate() + wget_cmd = ( + f'wget --directory-prefix={tmp_librispeech_dir} ' + f'http://www.openslr.org/resources/12/{split}-{version}.tar.gz') + subprocess.Popen(wget_cmd, shell=True).communicate() + tar_path = os.path.join(tmp_librispeech_dir, f'{split}-{version}.tar.gz') + subprocess.Popen( + f'tar xzvf {tar_path} --directory {tmp_librispeech_dir}', + shell=True).communicate() tars = [ 'raw-metadata.tar.gz', @@ -477,19 +523,23 @@ def download_librispeech(dataset_dir, tmp_dir, train_tokenizer): 'train-other-500.tar.gz', ] for tar_filename in tars: - wget_cmd = f'wget http://www.openslr.org/resources/12/{tar_filename} -O - | tar xz ' # pylint: disable=line-too-long - subprocess.Popen(wget_cmd, shell=True, cwd=tmp_dir).communicate() + wget_cmd = (f'wget --directory-prefix={tmp_librispeech_dir} ' + f'http://www.openslr.org/resources/12/{tar_filename}') + subprocess.Popen(wget_cmd, shell=True).communicate() + tar_path = os.path.join(tmp_librispeech_dir, tar_filename) + subprocess.Popen( + f'tar xzvf {tar_path} --directory {tmp_librispeech_dir}', + shell=True).communicate() + + tokenizer_vocab_path = os.path.join(extracted_data_dir, 'spm_model.vocab') - if train_tokenizer: - tokenizer_vocab_path = librispeech_tokenizer.run( - train=True, data_dir=tmp_librispeech_dir) + if not os.path.exists(tokenizer_vocab_path): + librispeech_tokenizer.run(train=True, data_dir=extracted_data_dir) - # Preprocess data. - librispeech_dir = os.path.join(dataset_dir, 'librispeech') - librispeech_preprocess.run( - input_dir=tmp_librispeech_dir, - output_dir=librispeech_dir, - tokenizer_vocab_path=tokenizer_vocab_path) + librispeech_preprocess.run( + input_dir=extracted_data_dir, + output_dir=final_data_dir, + tokenizer_vocab_path=tokenizer_vocab_path) def download_mnist(data_dir): @@ -541,21 +591,26 @@ def main(_): download_mnist(data_dir) if FLAGS.all or FLAGS.fastmri: + print(f"starting fastMRI download...\n") logging.info('Downloading FastMRI...') knee_singlecoil_train_url = FLAGS.fastmri_knee_singlecoil_train_url knee_singlecoil_val_url = FLAGS.fastmri_knee_singlecoil_val_url knee_singlecoil_test_url = FLAGS.fastmri_knee_singlecoil_test_url - if (knee_singlecoil_train_url is None or knee_singlecoil_val_url is None or - knee_singlecoil_val_url is None): + if None in (knee_singlecoil_train_url, + knee_singlecoil_val_url, + knee_singlecoil_test_url): raise ValueError( - 'Must provide both --fastmri_knee_singlecoil_{train,val}_url to ' - 'download the FastMRI dataset. Sign up for the URLs at ' + f'Must provide three --fastmri_knee_singlecoil_[train,val,test]_url to ' + 'download the FastMRI dataset.\nSign up for the URLs at ' 'https://fastmri.med.nyu.edu/.') - download_fastmri(data_dir, - tmp_dir, - knee_singlecoil_train_url, - knee_singlecoil_val_url, - knee_singlecoil_test_url) + + updated_data_dir = download_fastmri(data_dir, + knee_singlecoil_train_url, + knee_singlecoil_val_url, + knee_singlecoil_test_url) + + print(f"fastMRI download completed. Extracting...") + setup_fastmri(data_dir, updated_data_dir) if FLAGS.all or FLAGS.imagenet: flags.mark_flag_as_required('imagenet_train_url') @@ -577,7 +632,7 @@ def main(_): if FLAGS.all or FLAGS.librispeech: logging.info('Downloading Librispeech...') - download_librispeech(data_dir, tmp_dir, train_tokenizer=True) + download_librispeech(data_dir, tmp_dir) if FLAGS.all or FLAGS.cifar: logging.info('Downloading CIFAR...') diff --git a/datasets/librispeech_preprocess.py b/datasets/librispeech_preprocess.py index 2ce8d79ca..0968f2a00 100644 --- a/datasets/librispeech_preprocess.py +++ b/datasets/librispeech_preprocess.py @@ -9,7 +9,6 @@ import threading import time -from absl import flags from absl import logging import numpy as np import pandas as pd @@ -23,15 +22,6 @@ exists = tf.io.gfile.exists rename = tf.io.gfile.rename -flags.DEFINE_string('raw_input_dir', - '', - 'Path to the raw training data directory.') -flags.DEFINE_string('output_dir', '', 'Dir to write the processed data to.') -flags.DEFINE_string('tokenizer_vocab_path', - '', - 'Path to sentence piece tokenizer vocab file.') -FLAGS = flags.FLAGS - TRANSCRIPTION_MAX_LENGTH = 256 AUDIO_MAX_LENGTH = 320000 @@ -178,11 +168,3 @@ def run(input_dir, output_dir, tokenizer_vocab_path): 'expected count: {} vs expected {}'.format( num_entries, librispeech_example_counts[subset])) example_ids.to_csv(os.path.join(output_dir, f'{subset}.csv')) - - -def main(): - run(FLAGS.input_dir, FLAGS.output_dir, FLAGS.tokenizer_vocab_path) - - -if __name__ == '__main__': - main() diff --git a/datasets/librispeech_tokenizer.py b/datasets/librispeech_tokenizer.py index 71aa719c2..e701d59d4 100644 --- a/datasets/librispeech_tokenizer.py +++ b/datasets/librispeech_tokenizer.py @@ -8,7 +8,6 @@ import tempfile from typing import Dict -from absl import flags from absl import logging import sentencepiece as spm import tensorflow as tf @@ -21,13 +20,6 @@ Features = Dict[str, tf.Tensor] -flags.DEFINE_string('input_dir', '', 'Path to training data directory.') -flags.DEFINE_boolean( - 'train', - False, - 'Whether to train a new tokenizer or load existing one to test.') -FLAGS = flags.FLAGS - def dump_chars_for_training(data_folder, splits, maxchars: int = int(1e7)): char_count = 0 @@ -118,13 +110,15 @@ def load_tokenizer(model_filepath): def run(train, data_dir): logging.info('Data dir: %s', data_dir) + vocab_path = os.path.join(data_dir, 'spm_model.vocab') + logging.info('vocab_path = ', vocab_path) if train: logging.info('Training...') splits = ['train-clean-100'] - return train_tokenizer(data_dir, splits) + train_tokenizer(data_dir, splits, model_path=vocab_path) else: - tokenizer = load_tokenizer(os.path.join(data_dir, 'spm_model.vocab')) + tokenizer = load_tokenizer(vocab_path) test_input = 'OPEN SOURCE ROCKS' tokens = tokenizer.tokenize(test_input) detokenized = tokenizer.detokenize(tokens).numpy().decode('utf-8') @@ -135,11 +129,3 @@ def run(train, data_dir): if detokenized == test_input: logging.info('Tokenizer working correctly!') - - -def main(): - run(FLAGS.train, FLAGS.data_dir) - - -if __name__ == '__main__': - main() diff --git a/docker/Dockerfile b/docker/Dockerfile index d2d946851..d178d6bf1 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -6,12 +6,13 @@ # To build Docker image FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 +ARG DEBIAN_FRONTEND=noninteractive # Installing machine packages RUN echo "Setting up machine" RUN apt-get update RUN apt-get install -y curl tar -RUN apt-get install -y git python3 pip wget +RUN apt-get install -y git python3 pip wget ffmpeg RUN apt-get install libtcmalloc-minimal4 RUN export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 diff --git a/docker/README.md b/docker/README.md deleted file mode 100644 index 7fac6df77..000000000 --- a/docker/README.md +++ /dev/null @@ -1,155 +0,0 @@ -## Docker Instructions - -### General - -#### Prerequisites -You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs. - -If you are working with a GCP VM with Container Optimized OS setup, you will have to mount the NVIDIA drivers and devices on -`docker run` command (see below). - -#### Building Image - -From `algorithmic-efficiency/docker/` run: -``` -docker build -t . -``` - -#### Container Entry Point Flags -You can run a container that will download data to the host VM (if not already downloaded), run a submission or both. If you only want to download data you can run the container with just the `-d` and `-f` flags (`-f` is only required if `-d` is 'imagenet'). If you want to run a submission the `-d`, `-f`, `-s`, `-t`, `-e`, `-w` flags are all required to locate the data and run the submission script. - -The container entrypoint script provides the following flags: -- `-d` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/` does not exist on the host machine. Required for running a submission. -- `-f` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission. -- `-s` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission. -- `-t` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission. -- `-e` experiment_name: name of experiment. Required for running a submission. -- `-w` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission. -- `-m` max_steps: maximum number of steps to run the workload for. Optional. -- `-b` debugging_mode: can be true or false. If `-b ` (debugging_mode) is `true` the main process on the container will persist. - - -#### Starting container w end-to-end submission runner -To run the docker container that will download data (if not found host) and run a submisison run: -``` -docker run -t -d \ --v $HOME_DIR/data/:/data/ \ --v $HOME_DIR/experiment_runs/:/experiment_runs \ --v $HOME_DIR/experiment_runs/logs:/logs \ ---gpus all \ ---ipc=host \ - \ --d \ --f \ --s \ --t \ --e \ --w \ --b \ -``` -This will print the container ID to the terminal. -If debugging_mode is `true` the main process on the container will persist after finishing the submission runner. - - -#### Starting a container with automated data download -To run a docker container that will only download data (if not found on host): -``` -docker run -t -d \ --v $HOME_DIR/data/:/data/ \ --v $HOME_DIR/experiment_runs/:/experiment_runs \ --v $HOME_DIR/experiment_runs/logs:/logs \ ---gpus all \ ---ipc=host \ - \ --d \ --f \ --b \ -``` -If debugging_mode is `true` the main process on the container will persist after finishing the data download. -This run command is useful if you manually want to run a sumbission or look around. - -#### Interacting with the container -To find the container IDs of running containers run: -``` -docker ps -``` - -To see the status of the data download or submission runner run: -``` -docker logs -``` - -To enter a bash session in the container run: -``` -docker exec -it /bin/bash -``` - -## GCP Integration -If you want to run containers on GCP VMs or store and retrieve Docker images from the Google Cloud Container Registry, please read ahead. - -### Google Cloud Container Registry -If you'd like to maintain or use images stored on our Google Cloud Container Registry read this section. -You will have to use an authentication helper to set up permissions to access the repository: -``` -ARTIFACT_REGISTRY_URL=us-central1-docker.pkg.dev -gcloud auth configure-docker $ARTIFACT_REGISTRY_URL -``` - -To push built image to artifact registry on GCP do this : -``` -PROJECT=training-algorithms-external -REPO=mlcommons-docker-repo - -docker tag base_image:latest us-central1-docker.pkg.dev/$PROJECT/$REPO/base_image:latest -docker push us-central1-docker.pkg.dev/$PROJECT/$REPO/base_image:latest -``` - -To pull the latest image to GCP run: -``` -PROJECT=training-algorithms-external -REPO=mlcommons-docker-repo -docker pull us-central1-docker.pkg.dev/$PROJECT/$REPO/base_image:latest -``` - -### Setting up a Linux VM -If you'd like to use a Linux VM, you will have to install the correct GPU drivers and the NVIDIA Docker toolkit. -We recommmend to use the Deep Learning on Linux image. Further instructions are based on that. - -#### Installing GPU Drivers -You can use the `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the -NVIDIA GPU Drivers and NVIDIA Docker toolkit. - -#### Authentication for Google Cloud Container Registry -To access the Google Cloud Container Registry, you will have to authenticate to the repository whenever you use Docker. -Use the gcloud credential helper as documented [here](https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling#cred-helper). - -### Setting up a Container Optimized OS VMs on GCP -You may want use a [Container Optimized OS](https://cloud.google.com/container-optimized-os/docs) to run submissions. -However, the Container Optimized OS does not support CUDA 11.7. If you go down this route, -please adjust the base image in the Dockerfile to CUDA 11.6. -We don't guarantee compatibility of the `algorithmic_efficiency` package with CUDA 11.6 though. - -#### Installing GPU Drivers -To install NVIDIA GPU drivers on container optimized OS you can use the `cos` installer. -Follow instructions [here](https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus) - -#### Authentication for Google Cloud Container Registry -To access the Google Cloud Container Registry, you will have to authenticate to the repository whenever you use Docker. -Use a standalone credential helper as documented [here](https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling#cred-helper). - -#### cloud-init script -You can automate installation GPU Drivers and authentication for Cloud Container Registry with a cloud-init script, by passing -the content of the script as `user-data` in the VMs metadata. - - -## Other Tips and tricks - -How to avoid sudo for docker ? - -``` -sudo groupadd docker -sudo usermod -aG docker $USER -newgrp docker -``` - -Recommendation : Use a GCP CPU VM to build mlcommons docker image. Do not use cloudshell to build mlcommons docker images as the cloudshell provisioned machine runs out of storage diff --git a/docker/build_docker_images.sh b/docker/build_docker_images.sh index 4a5ae08dc..f3c891c6f 100644 --- a/docker/build_docker_images.sh +++ b/docker/build_docker_images.sh @@ -1,6 +1,6 @@ # Bash script to build and push dev docker images to artifact repo # Usage: -# bash build_docker_images.sh -b +# bash build_docker_images.sh -b while getopts b: flag do diff --git a/docker/scripts/startup.sh b/docker/scripts/startup.sh index c76340397..cdd2c649c 100644 --- a/docker/scripts/startup.sh +++ b/docker/scripts/startup.sh @@ -7,26 +7,107 @@ # our algorithmic-efficiency repo. To do so # set the -i flag to true. +function usage() { + cat <` does not exist on the host machine. Required for running a submission. -- `-f` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission. -- `-s` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission. -- `-t` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission. -- `-e` experiment_name: name of experiment. Required for running a submission. -- `-w` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission. -- `-m` max_steps: maximum number of steps to run the workload for. Optional. -- `-b` debugging_mode: can be true or false. If `-b ` (debugging_mode) is `true` the main process on the container will persist. +- `--dataset` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/` does not exist on the host machine. Required for running a submission. +- `--framework` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission. +- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission. +- `--tuning_search_space` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission. +- `--experiment_name` experiment_name: name of experiment. Required for running a submission. +- `--workload` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission. +- `--max_global_steps` max_global_steps: maximum number of steps to run the workload for. Optional. +- `--keep_container_alive` : can be true or false. If`true` the container will not be killed automatically. This is useful for developing or debugging. To run the docker container that will run the submission runner run: @@ -128,16 +128,15 @@ docker run -t -d \ --gpus all \ --ipc=host \ \ --d \ --f \ --s \ --t \ --e \ --w \ --b +--dataset \ +--framework \ +--submission_path \ +--tuning_search_space \ +--experiment_name \ +--workload \ +--keep_container_alive ``` This will print the container ID to the terminal. -If debugging_mode is `true` the main process on the container will persist after finishing the submission runner. #### Docker Tips #### @@ -162,5 +161,7 @@ To produce performance profile and performance table: python3 scoring/score_submission.py --experiment_path= --output_dir= ``` +We provide the scores and performance profiles for the baseline algorithms in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179). + ## Good Luck!