Direct Model Programming
This sets up the environment:
conda env create
conda activate dmp
conda develop .
If you're using the job queue:
pip install --editable=git+https://github.nrel.gov/mlunacek/jobqueue-pg@master#egg=jobqueue
Be sure you do not have keras installed in this environment, as it will cause runners to fail due to a namespace conflict when instantiating the DNN optimizer.
pytest -s -v tests
python -u -m dmp.aspect_test "{'datasets': ['nursery'],'budgets':[500], 'topologies' : [ 'wide_first' ], 'depths' : [4], 'run_config' : { 'epochs': 10}, 'test_split': 0.1, 'reps': 1, 'mode':'direct' }"
You can add checkpointing and automatic-resuming of model runs by including the "checkpoint_epochs" parameter in the run config. This should be set to an integer number of epochs. The model will be checkpointed after this number of epochs have been completed. By default, the checkpoints will be saved to the directory "./checkpoints". This can be overridden by setting the environment variable $DMP_CHECKPOINT_DIR, which can itself be overridden by the "checkpoint_dir" parmeter in the config.
The name of the checkpoint will be "run_name" if not specified. If run through the job queue, it will be set to the uuid of the job being run. You can name the checkpoint file manually by passing in "jq_uuid" to the configuration.
Note:
-
This feature relies on keras-buoy package from PyPi.
-
This feature is not compatible with test_split configuration due to the way Keras stores historical losses in callback objects.
-
This feature does not restore the random state, so a result from a session which has been checkpointed and resumed may not be reproducible.
python -u -m dmp.aspect_test "{'datasets': ['nursery'],'budgets':[500], 'topologies' : [ 'wide_first' ], 'depths' : [4], 'run_config' : { 'epochs': 10}, 'reps': 1, 'mode':'direct', 'checkpoint_epochs':1, 'jq_uuid':'nursery_500_widefirst_4' }"
You can enable tensorboard logging with 'tensorboard' configuration. Set this to the tensorboard log directory.
'tensorboard':'./log/tensorboard'
To view the tensorboard logs, use the following command:
tensorboard --logdir ./log/tensorboard/
Note: You must install Graphviz and Pydot to use this feature.
'plot_model':'./log/plot'
DMP supports training of residual networks by using the 'residual_mode' configuration. Set this to "full" for wide_first or rectangular networks only to enable residual mode.
'residual_mode':'full'
Tensorflow 2.6 from the conda channel now has most important Eagle CPU optimizations compiled in, but you must set:
export TF_ENABLE_ONEDNN_OPTS=1
in order to activate the optimizations. Intel calls these "oneDNN", and you will see a message logged when they are activated on worker startup. I would set it either in your .bashrc or in your environment (admp) script as discussed below.
You can check if all is well with the tensorflow configuration by running:
python -m dmp.util.check_tensorflow
This will print out diagnostic tensorflow information, including if oneDNN optimizations are enabled.
Previously, we used this optimized tensorflow build for Eagle CPUs and GPUs:
pip install --upgrade --force-reinstall /nopt/nrel/apps/wheels/tensorflow-2.4.0-cp38-cp38-linux_x86_64.whl
An example script that sets up the right modules and environment variables to use the optimized tenseorflow build and appropriate directories.
#!/bin/bash
source ~/.bashrc
conda deactivate
module purge
module use /nopt/nrel/apps/modules/centos74/modulefiles/
module load gcc/7.4.0
module load cuda/10.0.130
module load cudnn/7.4.2/cuda-10.0
module load conda
module load conda
conda deactivate
export TEST_TMPDIR=/scratch/ctripp/tmp
export TMPDIR=/scratch/ctripp/tmp
export PYTHONPATH=/projects/dmpapps/ctripp/src
cd /projects/dmpapps/ctripp/src
conda activate /projects/dmpapps/ctripp/env/dmp
export TF_ENABLE_ONEDNN_OPTS=1
Create ~/admp file that contains env setup script like above.
sbatch -n1 -t18:00:00 --gres=gpu:1 ./srundmp.sh dmp.aspect_test.py "{'dataset': 'nursery','budgets':[262144, 524288, 1048576, 2097152, 4194304, 8388608], 'topologies' : [ 'wide_first' ] }"
Create the .jobqueue.json file in your home directory
Pipe the output of aspect_test.py list into jq_enqueue, with a tag name as the argument
python -u -m dmp.aspect_test "{'mode':'list', 'dataset': 'nursery','budgets':[500, 1000], 'topologies' : [ 'wide_first' ], 'depths' : [4], 'run_config' : { 'epochs': 2}, 'test_split': 0.1, 'reps': 1, 'log':'postgres'}" \
| python -u -m dmp.jobqueue_interface.enqueue dmp test_tag
Note, this allows you to first save the output of aspect_test to a local file for reproducibility purposes.
Start a jq_runner.py process pointing at the correct queue, and also
python -u -m dmp.jq_runner dmp test_tag
Python Module: dmp.aspect_test
Full Config:
{ mode:list, datasets: ['201_pol', '529_pollen', '537_houses', 'adult', 'connect_4', 'mnist', 'nursery', 'sleep', 'wine_quality_white'], 'topologies': ['rectangle', 'trapezoid', 'exponential', 'wide_first'], 'budgets': [1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432], 'depths': [2,3,4,5,7,8,9,10,12,14,16,18,20], 'log':'postgres' }
Residuals:
{ mode:list, datasets: ['201_pol', '529_pollen', '537_houses', 'adult', 'connect_4', 'mnist', 'nursery', 'sleep', 'wine_quality_white'], 'topologies': ['rectangle', 'wide_first'], 'residual_modes': ['full'], 'budgets': [1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432], 'depths': [2,3,4,5,7,8,9,10,12,14,16,18,20], 'log':'postgres' }
Save small test config to a log file:
python -u -m dmp.aspect_test "{ mode:list, datasets: ['201_pol', '529_pollen', '537_houses', 'adult', 'connect_4', 'mnist', 'nursery', 'sleep', 'wine_quality_white'], 'topologies': ['rectangle', 'trapezoid', 'exponential', 'wide_first'], 'budgets': [1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432], 'depths': [2,3,4,5,7,8,9,10,12,14,16,18,20], 'log':'postgres' }" > log/jordan_eagle_1.jobs
Push this log file to the queue
cat log/jordan_eagle_1.jobs | python -u -m dmp.jobqueue_interface.enqueue dmp jordan_eagle_1
Queue runner in sbatch
sbatch sbatchqueuerunner.sh dmp jordan_eagle_1
Slurm nanny with Python (use Login node as nanny)
screen
conda activate dmp
python dmp/jq_slurm.py dmp jordan_eagle_1
Monitor job progress using the database
SELECT * FROM jobqueue WHERE groupname='jordan_eagle_1' AND START_TIME IS NOT NULL;
Monitor SLURM jobs
squeue -u username
scancel jobid
DMP has been successfully run on vermillion.hpc.nrel.gov. These notes should help you get started with Vermillion.
module use /projects/dmpapps/module/
Initialize Anaconda
module load conda/mini_py39_4.9.2
conda init
source ~/.bashrc
Load cuda module
module load cuda
Install DMP
cd $DMP_HOME
git clone [email protected]:ctripp/DMP.git
cd DMP
conda env create
pip install -e .
Install Jobqueue
cd $DMP_HOME
git clone [email protected]:mlunacek/jobqueue-pg.git
cd jobqueue-pg
pip install -e .
Copy Job Queue Config File from Eagle into your Vermillion home directory
scp eagle:.jobqueue.json ~/.jobqueue.json
srun --time=30 --account=dmpapps --ntasks=36 --pty $SHELL
python -m dmp.jq.jq_node_manager dmp fixed_3k_1 "[[0,3,0,0,0], [3,6,0,0,0], [6,9,0,0,0], [9,12,0,0,0], [12,15,0,0,0], [15,18,0,0,0], [18,21,0,0,0], [21,24,0,0,0], [24,27,0,0,0], [27,30,0,0,0],[30,33,0,0,0],[33,36,0,0,0]]"