Sam Foreman
Intro to AI-driven Science on Supercomputers
2024-11-05
- Slides: https://samforeman.me/talks/ai-for-science-2024/slides
- HTML version: https://samforeman.me/talks/ai-for-science-2024
-
Submit interactive job:
qsub -A ALCFAITP -q by-node -l select=1 -l walltime=01:00:00,filesystems=eagle:home -I
-
On Sophia:
export HTTP_PROXY="http://proxy.alcf.anl.gov:3128" export HTTPS_PROXY="http://proxy.alcf.anl.gov:3128" export http_proxy="http://proxy.alcf.anl.gov:3128" export https_proxy="http://proxy.alcf.anl.gov:3128" export ftp_proxy="http://proxy.alcf.anl.gov:3128"
-
Clone repos:
-
git clone https://github.com/saforem2/wordplay cd wordplay
-
git clone https://github.com/saforem2/ezpz deps/ezpz
-
-
Setup python:
export PBS_O_WORKDIR=$(pwd) && source deps/ezpz/src/ezpz/bin/utils.sh ezpz_setup_python ezpz_setup_job
-
Install
{ezpz, wordplay}
:python3 -m pip install -e deps/ezpz --require-virtualenv python3 -m pip install -e . --require-virtualenv
-
Setup (or disable)
wandb
:# to setup: wandb login # to disable: export WANDB_DISABLED=1
-
Test Distributed Setup:
mpirun -n "${NGPUS}" python3 -m ezpz.test_dist
See:
ezpz/test_dist.py
-
Prepare Data:
python3 data/shakespeare_char/prepare.py
-
Launch Training:
mpirun -n "${NGPUS}" python3 -m wordplay \ train.backend=DDP \ train.eval_interval=100 \ data=shakespeare \ train.dtype=bf16 \ model.batch_size=64 \ model.block_size=1024 \ train.max_iters=1000 \ train.log_interval=10 \ train.compile=false
Submit proof that you were able to successfully follow the above instructions and launch a distributed data parallel training run.
Where proof can be any of:
- The contents printed out to your terminal during the run
- A path to a logfile containing the output from a run on the ALCF filesystems
- A screenshot of:
- the text printed out from the run
- a graph from the W&B Run
- anything that shows that you clearly were able to run the example
- url to a W&B Run or W&B Report
- etc.