Hyper-parameter tuning was conducted with Weights & Biases an experiment tracking tool that also has the ability to orchestrate hyper-parameter tuning and store model artifacts such as weights and logs. You can view the results of hyper-parameter tuning in this project. Below is a visualization that highlights the parameters associated with the best-performing models on the validation set:
The purple lines corresponding to the best-performing models on the validation set. An interactive version of this visualization is available for viewing in the Weights and Biases project. A list of the hyper-parameters along with the definition of each is avaialable in the docstring of the model object.
-
lm_tune.py: contains the model definition with entry points that allows us to change hyper-parameters we want to tune. The Fire library is used to turn this script into a CLI.
-
sweep.yaml: defines the hyper-parameter sweep for a random grid search. This is used with the sweeps feature in Weights & Biases.
-
sweep_bayes.yaml: defines a hyper-parameter sweep that uses bayesian search. This is used with the sweeps feature in Weights & Biases. We found that the bayes method worked very well, and did this in parallel with random grid search.
-
hp_runner.sh: bash script that runs 1 agent per GPU in order to parallelize the hyperparameter sweep as much as possible. Note that the
sweep_id
has been hardcoded, which you must change if you wish to perform tuning. As illustrated by this script, we ran this on machines that had 8 GPUs.
This set of parameters performed the best against the validation set. You can see the model parameters and logs here. According to the logs for this run, these are the hyperparameters associated with the best model:
lm_tune.py --bptt=63 --bs=96 --emb_sz=800 --lr=0.0013 --n_hid=2400 --n_layers=4 --one_cycle=True --cycle_len 2
Note that you cannot simply run this command as you must first have the data prepared in the right directory. Please refer to the notebooks folder of this repo for a walk-through on how to train the model.
We let the tuning run for approximately one week on 24 total GPUs (3 p3.16xlarge instances with 8 GPUs each.). We ran the sweep on 20% of the training data in order more quickly search the hyper-parameter space. The best parameters found from tuning are used to choose parameters on the full data. Please see the main README of this repo on how to locate the final model.