Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to script that runs submissions for scoring #794

Merged
merged 62 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
f030996
base workflow
priyakasimbeg May 3, 2024
5eabed8
fix name
priyakasimbeg May 3, 2024
e694c2e
fix
priyakasimbeg May 3, 2024
3cd70c2
fix
priyakasimbeg May 3, 2024
0ae2861
fix
priyakasimbeg May 3, 2024
a965758
modify docker to inlcude submissions
priyakasimbeg May 4, 2024
5610928
copy submissions to docker container
priyakasimbeg May 4, 2024
a8abbc4
fix
priyakasimbeg May 4, 2024
73630b2
add project name
priyakasimbeg May 4, 2024
175f78e
add project name
priyakasimbeg May 4, 2024
fb59910
add dryrun
priyakasimbeg May 7, 2024
6757da2
fix workload metadata path
priyakasimbeg May 7, 2024
546137c
Merge branch 'scoring' into scoring_team11_nadamp
priyakasimbeg May 7, 2024
02966d2
remove dryrun
priyakasimbeg May 7, 2024
6e87429
fix variable names and experiment name
priyakasimbeg May 8, 2024
4deca9a
fix workload metadata path
priyakasimbeg May 8, 2024
befc0c6
fix paths
priyakasimbeg May 8, 2024
4fd4d30
fix heldout workload path
priyakasimbeg May 8, 2024
9933e77
fix seed flag
priyakasimbeg May 8, 2024
0139919
remove local
priyakasimbeg May 8, 2024
cd6fe95
Merge branch 'scoring' into scoring_team11_nadamp
priyakasimbeg May 8, 2024
5abac0a
fix submission paths
priyakasimbeg May 8, 2024
ed80e79
=Merge branch 'scoring' into scoring_team11_nadamp
priyakasimbeg May 8, 2024
eecf143
Merge remote-tracking branch 'origin/main' into scoring
priyakasimbeg May 14, 2024
b525c44
fix paths
priyakasimbeg May 14, 2024
dbfffc7
Merge branch 'scoring' into scoring_team11_nadamp
priyakasimbeg May 14, 2024
6c58062
remove heldout workloads from run workloads
priyakasimbeg May 14, 2024
bb3b30b
increase timeout
priyakasimbeg May 14, 2024
d4c2726
Merge branch 'main' into scoring
priyakasimbeg May 14, 2024
3297023
change command to python3.8
priyakasimbeg May 14, 2024
5c71071
Merge branch 'scoring' into scoring_team11_nadamp
priyakasimbeg May 14, 2024
9174267
add env
priyakasimbeg May 14, 2024
4bc51d6
add env
priyakasimbeg May 14, 2024
cc2f764
remove heldout workload path
priyakasimbeg May 14, 2024
c69092a
remove heldoutworkloads flag
priyakasimbeg May 14, 2024
df7621e
enter env
priyakasimbeg May 14, 2024
64669a4
add check to kill container
priyakasimbeg May 23, 2024
ce3d502
fix
priyakasimbeg May 23, 2024
9b79774
add functionality to install additional requirements
priyakasimbeg May 24, 2024
30852b5
fix
priyakasimbeg May 24, 2024
0932d24
add flag for max steps
priyakasimbeg Jun 13, 2024
760c53e
add max steps flag
priyakasimbeg Jun 13, 2024
6983721
fix
priyakasimbeg Jun 13, 2024
d6b57f9
add workloads flag
priyakasimbeg Jul 4, 2024
58289d8
fix
priyakasimbeg Jul 4, 2024
ca8c52f
fix
priyakasimbeg Jul 4, 2024
01a7b68
debugging
priyakasimbeg Jul 4, 2024
84253df
debugging
priyakasimbeg Jul 4, 2024
f0bc1ee
debugging
priyakasimbeg Jul 4, 2024
8978221
debugging
priyakasimbeg Jul 4, 2024
f30ce4f
remove debugging
priyakasimbeg Jul 4, 2024
11ea68f
add safety flag to enforce explicitly enabling step budgets
priyakasimbeg Sep 19, 2024
078b5fa
fix to enable_step_percentage flag
priyakasimbeg Sep 19, 2024
4956a31
add flag for step budget
priyakasimbeg Sep 19, 2024
a0e4502
fix syntax error
priyakasimbeg Sep 19, 2024
262a9e6
Merge branch 'main' into scoring
priyakasimbeg Sep 27, 2024
ef77fc4
Merge pull request #793 from mlcommons/dev
priyakasimbeg Oct 15, 2024
790c282
Merge branch 'mlcommons:main' into scoring
priyakasimbeg Oct 15, 2024
44d1619
Merge branch 'scoring' of github.com:priyakasimbeg/algorithmic-effici…
priyakasimbeg Oct 16, 2024
3c26723
remove unwanted changes
priyakasimbeg Oct 16, 2024
a43836c
reformat
priyakasimbeg Oct 16, 2024
ce4fc77
remove duplicate run_workloads script
priyakasimbeg Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions docker/scripts/startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,10 @@ while [ "$1" != "" ]; do
shift
TEST=$1
;;
--additional_requirements_path)
shift
ADDITIONAL_REQUIREMENTS_PATH=$1
;;
*)
usage
exit 1
Expand All @@ -140,6 +144,16 @@ while [ "$1" != "" ]; do
shift
done


# Optionally install addtional dependencies
if [[ -n ${ADDITIONAL_REQUIREMENTS_PATH+x} ]]; then
echo "Installing addtional requirements..."
COMMAND="cd algorithmic-efficiency && pip install -r ${ADDITIONAL_REQUIREMENTS_PATH}"
echo $COMMAND
eval $COMMAND
fi


if [[ ${TEST} == "true" ]]; then
cd algorithmic-efficiency
COMMAND="python3 tests/test_traindiffs.py"
Expand Down
97 changes: 80 additions & 17 deletions scoring/run_workloads.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,11 @@
--tuning_search_space <path_to_tuning_search_space_json>
"""

import datetime
import json
import os
import struct
import subprocess
import time

from absl import app
Expand All @@ -26,9 +28,11 @@
'docker_image_url',
'us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_dev',
'URL to docker image')
flags.DEFINE_integer('run_percentage',
100,
'Percentage of max num steps to run for.')
flags.DEFINE_integer(
'run_percentage',
100,
'Percentage of max num steps to run for.'
'Must set the flag enable_step_budget to True for this to take effect.')
flags.DEFINE_string('experiment_name',
'my_experiment',
'Name of top sub directory in experiment dir.')
Expand Down Expand Up @@ -83,10 +87,24 @@
'If your algorithm has a smaller per step time than our baselines '
'you may want to increase the number of steps per workload.')
flags.DEFINE_string(
'workload',
'workloads',
None,
'String representing a comma separated list of workload names.'
'If not None, only run this workload, else run all workloads in workload_metadata_path.'
)
flags.DEFINE_string('additional_requirements_path',
None,
'Path to requirements.txt if any.')
flags.DEFINE_integer(
'max_steps',
None,
'Maximum number of steps to run. Must set flag enable_step_budget.'
'This flag takes precedence over the run_percentage flag.')
flags.DEFINE_bool(
'enable_step_budget',
False,
'Flag that has to be explicitly set to override time budgets to step budget percentage.'
)

FLAGS = flags.FLAGS

Expand All @@ -106,15 +124,40 @@ def container_running():
return True


def kill_containers():
docker_client = docker.from_env()
containers = docker_client.containers.list()
for container in containers:
container.kill()


def gpu_is_active():
output = subprocess.check_output([
'nvidia-smi',
'--query-gpu=utilization.gpu',
'--format=csv,noheader,nounits'
])
return any(int(x) > 0 for x in output.decode().splitlines())


def wait_until_container_not_running(sleep_interval=5 * 60):
# check gpu util
# if the gpu has not been utilized for 30 minutes kill the
gpu_last_active = datetime.datetime.now().timestamp()

while container_running():
# check if gpus have been inactive > 45 min and if so terminate container
if gpu_is_active():
gpu_last_active = datetime.datetime.now().timestamp()
if (datetime.datetime.now().timestamp() - gpu_last_active) > 45 * 60:
kill_containers(
"Killing container: GPUs have been inactive > 45 minutes...")
time.sleep(sleep_interval)
return


def main(_):
framework = FLAGS.framework
run_fraction = FLAGS.run_percentage / 100.
experiment_name = FLAGS.experiment_name
docker_image_url = FLAGS.docker_image_url
submission_path = FLAGS.submission_path
Expand All @@ -132,7 +175,13 @@ def main(_):
study_end_index = FLAGS.study_end_index
else:
study_end_index = num_studies - 1

additional_requirements_path_flag = ''
if FLAGS.additional_requirements_path:
additional_requirements_path_flag = f'--additional_requirements_path {FLAGS.additional_requirements_path} '

submission_id = FLAGS.submission_id

rng_seed = FLAGS.seed

if not rng_seed:
Expand All @@ -144,17 +193,22 @@ def main(_):
with open(FLAGS.workload_metadata_path) as f:
workload_metadata = json.load(f)

# Get list of all possible workloads
workloads = [w for w in workload_metadata.keys()]

# Read held-out workloads
# Read heldout workloads
if FLAGS.held_out_workloads_config_path:
held_out_workloads = read_held_out_workloads(
FLAGS.held_out_workloads_config_path)
workloads = workloads + held_out_workloads

# Filter for single workload
if FLAGS.workload and (FLAGS.workload in workloads):
workloads = [FLAGS.workload]
# Filter workloads if explicit workloads specified
if FLAGS.workloads is not None:
workloads = list(
filter(lambda x: x in FLAGS.workloads.split(','), workloads))
if len(workloads) != len(FLAGS.workloads.split(',')):
unmatched_workloads = set(FLAGS.workloads.split(',')) - set(workloads)
raise ValueError(f'Invalid workload name {unmatched_workloads}')

rng_subkeys = prng.split(rng_key, num_studies)

Expand All @@ -174,14 +228,22 @@ def main(_):
"sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'") # clear caches
print('=' * 100)
dataset = workload_metadata[base_workload_name]['dataset']
max_steps = int(workload_metadata[base_workload_name]['max_steps'] *
run_fraction)
max_steps_flag = ''
if FLAGS.enable_step_budget:
run_fraction = FLAGS.run_percentage / 100.
if FLAGS.max_steps is None:
max_steps = int(workload_metadata[base_workload_name]['max_steps'] *
run_fraction)
else:
max_steps = FLAGS.max_steps
max_steps_flag = f'-m {max_steps}'

mount_repo_flag = ''
if FLAGS.local:
mount_repo_flag = '-v $HOME/algorithmic-efficiency:/algorithmic-efficiency '
command = ('docker run -t -d -v $HOME/data/:/data/ '
'-v $HOME/experiment_runs/:/experiment_runs '
'-v $HOME/experiment_runs/logs:/logs '
mount_repo_flag = '-v /home/kasimbeg/algorithmic-efficiency:/algorithmic-efficiency '
command = ('docker run -t -d -v /home/kasimbeg/data/:/data/ '
'-v /home/kasimbeg/experiment_runs/:/experiment_runs '
'-v /home/kasimbeg/experiment_runs/logs:/logs '
f'{mount_repo_flag}'
'--gpus all --ipc=host '
f'{docker_image_url} '
Expand All @@ -190,9 +252,10 @@ def main(_):
f'-s {submission_path} '
f'-w {workload} '
f'-e {study_dir} '
f'-m {max_steps} '
f'{max_steps_flag} '
f'--num_tuning_trials {num_tuning_trials} '
f'--rng_seed {run_seed} '
f'{additional_requirements_path_flag}'
'-c false '
'-o true '
'-i true ')
Expand Down Expand Up @@ -235,4 +298,4 @@ def main(_):

if __name__ == '__main__':
flags.mark_flag_as_required('workload_metadata_path')
app.run(main)
app.run(main)
Loading
Loading