From a82887b5df69296ffeaec440d1e8d880e80dccec Mon Sep 17 00:00:00 2001
From: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Date: Thu, 31 Aug 2023 19:57:24 -0400
Subject: [PATCH] Update documentation of NeMo-RLHF (#110)

* Update documentation of NeMo-RLHF

* Minor doc update on RLHF IP / Port settings
---
 README.md | 276 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 191 insertions(+), 85 deletions(-)

diff --git a/README.md b/README.md
index 3df4288d54..ebd5429a00 100755
--- a/README.md
+++ b/README.md
@@ -196,15 +196,16 @@ The most recent version of the README can be found at [https://ngc.nvidia.com/co
   * [5.16. Reinforcement Learning from Human Feedback](#516-reinforcement-learning-from-human-feedback)
     + [5.16.1. Reward Model Training](#5161-reward-model-training)
       - [5.16.1.1 Data preprocessing](#51611-data-preprocessing)
-      - [5.16.1.2 Reward Model Training](#51612-reward-model-training)
+      - [5.16.1.2 Training a Reward Model](#51612-training-a-reward-model)
       - [5.16.1.3 Reward Model Evaluation](#51613-reward-model-evaluation)
     + [5.16.2. PPO Training](#5162-ppo-training)
       - [5.16.2.1 Launching the Reward Model Inference Server](#51621-launching-the-reward-model-inference-server)
       - [5.16.2.2 Launching the Initial Policy Inference Server](#51622-launching-the-initial-policy-inference-server)
       - [5.16.2.3 Launching the PPO Critic Training and Inference Server](#51623-launching-the-ppo-critic-training-and-inference-server)
       - [5.16.2.4 Launching the PPO Actor Training](#51624-launching-the-ppo-actor-training)
-      - [5.16.2.5 Launching every job at once with SLURM](#51625-launching-every-job-at-once-with-slurm)
-      - [5.16.2.6 PPO Hyper-parameters](#51626-ppo-hyper-parameters)
+      - [5.16.2.5 Launching all jobs at once with SLURM](#51625-launching-all-jobs-at-once-with-slurm)
+      - [5.16.2.6 Ensuring consistency between jobs](#51626-ensuring-consistency-between-jobs)
+      - [5.16.2.7 PPO Hyper-parameters](#51627-ppo-hyper-parameters)
     + [5.16.3. Future Work](#5163-future-work)
   * [5.17 Curating pretraining datasets with the NeMo Data Curator](#517-curating-pretraining-datasets-with-the-nemo-data-curator)
 - [6. Deploying the NeMo Megatron Model](#6-deploying-the-nemo-megatron-model)
@@ -5046,9 +5047,9 @@ For finetuning dialogue dataset, we just need to add one extra configuration lin
 
 NeMo-RLHF is a library to fine-tune LLMs using Reinforcement Learning from Human Feedback (RLHF) in a scalable and fully distributed manner.
 
-NeMo-RLHF supports only GPT models and implements the Proximal Policy Optimization (PPO) algorithm. Support for other models and RL algorithms will be added in future releases. Furthermore, NeMo-RLHF is not currently integrated into NeMo-Megatron-Launcher, so the RLHF jobs must be launched directly from the NeMo-RLHF repository in `/opt/nemo-rlhf`, which should be copied to the local file system in the login node.
+NeMo-RLHF supports only GPT models and implements the Proximal Policy Optimization (PPO) algorithm. Support for other models and RL algorithms will be added in future releases. Furthermore, NeMo-RLHF is not currently integrated into NeMo-Megatron-Launcher, so the RLHF jobs must be launched directly from the NeMo-RLHF repository in `/opt/nemo-rlhf`, which should be copied to the local file system on the login node.
 
-We provide configurations to try RLHF on the newly released 2B GPT model with 4096 sequence length [available on HuggingFace](https://huggingface.co/nvidia/GPT-2B-001). We recommend users use the Anthropic HH-RLHF or the Stack Exchange Preferences datasets to get started.
+We provide configurations to try RLHF on the newly released 2B GPT model with 4096 sequence length [available on HuggingFace](https://huggingface.co/nvidia/GPT-2B-001). We recommend using the [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) or the [Stack Exchange Preferences](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences) datasets to get started.
 
 #### 5.16.1. Reward Model Training
 <a id="markdown-reward-model-training" name="reward-model-training"></a>
@@ -5058,19 +5059,39 @@ NeMo-RLHF can be used to train your own reward model. The reward model is traine
 ##### 5.16.1.1 Data preprocessing
 <a id="markdown-data-preprocessing" name="data-preprocessing"></a>
 
-With your own or publicly available data, start by processing them into a jsonl format. This is where prefixes should be inserted. Then use the `preprocess_data_for_megatron.py` script to convert this jsonl format into the NeMo format. Format your pairwise comparison dataset with the following structure:
+With your own or publicly available data, start by processing them into a jsonl format.
+This is where you should format the prompt based on your specific needs and model. For instance, if your original data looks like
+```
+Human: Give me a tasty apple pie recipe
+AI: Sure! Here's how my grandma used to cook an awesome apple pie: (...)
+```
+then you may for instance turn it into
+```
+Setting:
+You are a helpful assistant that responds concisely.
+
+User:
+Give me a tasty apple pie recipe
+
+Assistant:
+Sure! Here's how my grandma used to cook an awesome apple pie: (...)
+```
+
+Format your pairwise comparison dataset with the following structure:
 
 ```
-{“text”: prompt1+good_response_1}
-{“text”: prompt1+bad_response_1}
-{“text”: prompt2+good_response_2}
-{“text”: prompt2+bad_response_2}
+{"text": prompt1+good_response_1}
+{"text": prompt1+bad_response_1}
+{"text": prompt2+good_response_2}
+{"text": prompt2+bad_response_2}
 ...
 ```
 
-where 1 and 2 are different prompts. Note that for the same prompt, prompt+good_response must come before prompt+bad_response in the dataset.
+where 1 and 2 are different prompts. Note that for the same prompt, prompt+good_response must come before prompt+bad_response in the dataset you generate.
+If you have prompts with more than two responses, you currently need to convert them into pairwise preferences (i.e., generate multiple pairs sharing the same prompt).
 
-For reference we used the following command for preprocessing the dataset using the SentencePiece tokenizer.
+Then use the `preprocess_data_for_megatron.py` script to convert this jsonl format into the NeMo format. 
+For reference we used the following command for preprocessing the dataset using the SentencePiece tokenizer:
 
 ```bash
 python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
@@ -5084,12 +5105,12 @@ python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py
     --chunk_size=100 \
     --append-eod
 ```
-Which will generate files with `output_document.bin` and `output_document.idx` to use for reward model training.
+This generates files `output_text_document.bin` and `output_text_document.idx` to use for reward model training, described below.
 
-##### 5.16.1.2 Reward Model Training
-<a id="markdown-reward-model-training" name="reward-model-training"></a>
+##### 5.16.1.2 Training a Reward Model
+<a id="markdown-training-a-reward-model" name="training-a-reward-model"></a>
 
-To launch reward model training we first need to start with a pre-trained or fine-tuned nemo checkpoint. Our `training_rm.yaml` file has default configurations for the 2B model but feel free to use any model you like. An example command to begin training is:
+To launch reward model training we first need to start with a pre-trained or fine-tuned NeMo checkpoint. Our `training_rm.yaml` file has default settings for the 2B model but feel free to use any other model (adjusting the config accordingly). An example command to begin training is:
 
 ```bash
 cd /opt/nemo-rlhf \
@@ -5097,33 +5118,45 @@ cd /opt/nemo-rlhf \
 && python -u examples/nlp/gpt/train_reward_model.py \
     --config-path=examples/nlp/gpt/conf/ \
     --config-name=training_rm \
-    model.pretrained_checkpoint.restore_from_path='model.nemo' \
-    "model.data.data_prefix={train: [${train_output_document}], validation: [${val_output_document}], test: [${test_output_document}]}"
+    exp_manager.explicit_log_dir=/path/to/rm_output_dir \
+    model.pretrained_checkpoint.restore_from_path=/path/to/init_model.nemo \
+    "model.data.data_prefix={train: [/path/to/rm_train], validation: [/path/to/rm_val], test: [/path/to/rm_test]}"
+```
+
+The data files should point to the names of datasets generated as described in the previous section, but without the ".bin" or ".idx" suffix.
+Note that if you are using the command above with your own pre-trained model, you will need to modify `training_rm.yaml` (or the command line) to provide correct values for `tokenizer.model` and `tokenizer.tokenizer_model`.
+You can use `tar tvf /path/to/init_model.nemo` to inspect the model and obtain the name of its tokenizer files: typically, both files are identical and you may thus use the same name for both options, e.g. with
+```bash
+model.tokenizer.model=nemo:2b164b2c1dd74bd691ff90a0db3d39b8_xyz_256k.model \
+model.tokenizer.tokenizer_model=nemo:2b164b2c1dd74bd691ff90a0db3d39b8_xyz_256k.model \
 ```
 
+_Remark: currently, the example training script does not automatically run evaluation on the provided test set. This may change in a future release._
+
 ##### 5.16.1.3 Reward Model Evaluation
 <a id="markdown-reward-model-evaluation" name="reward-model-evaluation"></a>
 
-To learn how to serve the reward model for evaluation, see the section "Launching the Reward Model inference server" below.
+Once trained, a reward model may be served for evaluation purpose, as described in the section "Launching the Reward Model Inference Server" below.
+This can also useful to compute the mean / std of reward predictions before doing PPO training, to be able to normalize them: documentation and scripts to perform such normalization will be provided soon.
 
 #### 5.16.2. PPO Training
 <a id="markdown-ppo-training" name="ppo-training"></a>
 
-After fine-tuning a GPT model using Supervised Finetuning(SFT) and training a Reward Model as explained in the previous sections, NeMo-RLHF can be used to launch PPO jobs to fine-tune the SFT model using RLHF. During PPO training, four different models will be interacting with each other:
+After fine-tuning a GPT model using Supervised Finetuning (SFT) and training a Reward Model as explained in the previous sections, NeMo-RLHF can be used to launch PPO jobs to fine-tune the SFT model using RLHF. During PPO training, four different models will be interacting with each other:
 
 1. The PPO Actor Network (also known as the Policy Network): This is the model we are training, and it should start from an SFT model trained as explained in the SFT section.
-2. The Reward Model (RM) Network (also known as a Preference Model (PM)): This model will take a prompt and a response as inputs, and it will provide a single scalar value as output. This scalar value will be the reward, which the PPO algorithm will try to maximize. The RM should be a model trained as described in the RM Training section.
-3. The PPO Critic Network (also known as the Value Network): Since PPO is an actor-critic algorithm, we need a critic to help our actor learn more effectively. The critic will provide Value estimates to each token in the responses provided by the actor. These values can be seen as an estimate of the amount of reward the actor will receive after generating all the remaining tokens. The critic is loaded from the same RM we trained as described in the RM training section. Note: The RM generates a single reward for the entire sequence, whereas the Critic generates a value for each token.
-4. The Initial Policy Network (also known as the Reference Model): We use this model to compute a KL Divergence penalty term that ensures that the PPO Actor does not diverge too much from the Initial Policy. This way, we prevent the PPO Actor from overfitting to the reward models given by the RM, and ensure it does not forget the knowledge it acquired during pretraining and SFT. This model should be the same model as the PPO Actor Network.
+2. The Reward Model (RM) Network (also known as a Preference Model (PM)): This model takes a prompt concatenated with a response as input, and outputs a single scalar value: the reward, which the PPO algorithm will try to maximize. The RM should be a model trained as described in the RM Training section.
+3. The PPO Critic Network (also known as the Value Network): Since PPO is an Actor-Critic algorithm, we need a Critic to guide the Actor during training. The Critic will provide value estimates for each token in the responses provided by the Actor. These values can be seen as an estimate of the total reward the Actor will receive after generating all the remaining tokens. The Critic should be initialized from the RM so as to provide useful feedback in the early stages of training. Note: The RM generates a single reward for the entire sequence, whereas the Critic generates a value for each token.
+4. The Initial Policy Network (also known as the Reference Model): We use this model to compute a KL Divergence penalty term that ensures that the PPO Actor does not diverge too much from the Initial Policy. This way, we prevent the PPO Actor from overfitting to the rewards given by the RM, and ensure it does not forget the knowledge it acquired during pretraining and SFT. This model should be the one used to initialize the PPO Actor Network.
 
-To launch a full PPO training job, we need to launch the RM and the Initial Policy as inference servers. These two models are not trained, so they only need to perform inference and share their result with the PPO Actor. However, the PPO Actor and PPO Critic need to be trained.
+To launch a full PPO training job, we need to launch the RM and the Initial Policy as inference servers. These two models are not trained, so they only need to perform inference and share their results with the PPO Actor. However, both the PPO Actor and Critic need to be trained.
 
-Our architecture is designed to launch all four models completely separately. Therefore, we will launch two inference servers (one for the RM and one for the initial policy), one server that can do inference and training (the PPO Critic), and one master job to do training (the PPO Actor). Next we will look at how to launch each of those four jobs.
+Our architecture is designed to launch all four models completely separately. Therefore, we will launch two inference servers (one for the RM and one for the initial policy), one server that can do inference and training (the PPO Critic), and one master job to control the training (the PPO Actor). Next we will look at how to launch each of those four jobs.
 
 ##### 5.16.2.1 Launching the Reward Model Inference Server
 <a id="markdown-launching-the-reward-model-inference-server" name="launching-the-reward-model-inference-server"></a>
 
-To launch the Reward Model inference server in a Linux system, this command can be run inside the container:
+To launch the Reward Model inference server, this command can be run inside the container:
 
 ```bash
 cd /opt/nemo-rlhf \
@@ -5132,7 +5165,7 @@ cd /opt/nemo-rlhf \
 && python -u examples/nlp/gpt/serve_reward_model.py \
     --config-path=examples/nlp/gpt/conf/ \
     --config-name=inference_rm \
-    gpt_rm_model_file=/path/to/model.nemo \
+    gpt_rm_model_file=/path/to/trained_rm.nemo \
     port=5555
 ```
 
@@ -5141,7 +5174,7 @@ This command will launch the RM inference server on the local computer, using po
 ##### 5.16.2.2 Launching the Initial Policy Inference Server
 <a id="markdown-launching-the-initial-policy-inference-server" name="launching-the-initial-policy-inference-server"></a>
 
-To launch the Initial Policy inference server in a Linux system, this command can be run inside the container:
+To launch the Initial Policy inference server, this command can be run inside the container:
 
 ```bash
 cd /opt/nemo-rlhf \
@@ -5150,7 +5183,7 @@ cd /opt/nemo-rlhf \
 && python -u examples/nlp/gpt/serve_initial_policy.py \
     --config-path=examples/nlp/gpt/conf/ \
     --config-name=inference_initial_policy \
-    gpt_model_file=/path/to/model.nemo \
+    gpt_model_file=/path/to/sft_model.nemo \
     port=5556
 ```
 
@@ -5159,7 +5192,8 @@ This command will launch the Initial Policy inference server on the local comput
 ##### 5.16.2.3 Launching the PPO Critic Training and Inference Server
 <a id="markdown-launching-the-ppo-critic-training-and-inference-server" name="launching-the-ppo-critic-training-and-inference-server"></a>
 
-The PPO Critic has to perform both training and inference. We designed the Critic to have both capabilities. To launch the PPO Critic server in a Linux system, this command can be run inside the container:
+The PPO Critic has to perform both inference *and* training.
+To launch the PPO Critic server, which provides both functionalities, this command can be run inside the container:
 
 ```bash
 cd /opt/nemo-rlhf \
@@ -5168,15 +5202,17 @@ cd /opt/nemo-rlhf \
 && python -u examples/nlp/gpt/serve_ppo_critic.py \
     --config-path=examples/nlp/gpt/conf/ \
     --config-name=gpt_ppo_critic \
+    exp_manager.explicit_log_dir=/path/to/critic_output_dir \
     model.pretrained_checkpoint.restore_from_path=/path/to/trained_rm.nemo \
-    port=5557
+    inference.port=5557
 ```
 
-This command will launch the PPO Critic server on the local computer, using port 5557. All the configuration parameters can be modified in the `gpt_ppo_critic.yaml` file, or by overriding them through the CLI command. Ensure `inference.server=True` is set in the configuration of this job to correctly launch the server.
+This command will launch the PPO Critic server on the local computer, using port 5557. All the configuration parameters can be modified in the `gpt_ppo_critic.yaml` file, or by overriding them through the CLI command: in particular, the Critic's model config should match the one used to train the RM, and you may need to provide the correct name of the tokenizer files as described in the RM training section above.
+Ensure `inference.server=True` is set in the configuration of this job to correctly launch the server.
 
 ##### 5.16.2.4 Launching the PPO Actor Training
 <a id="markdown-launching-the-ppo-actor-training" name="launching-the-ppo-actor-training"></a>
-The PPO Actor training job contains the master HTTP controller that makes the HTTP calls to all three servers when needed. To launch the PPO Actor server in a Linux system, this command can be run inside the container:
+The PPO Actor training job contains the master controller that makes the HTTP calls to all three servers when needed. To launch the PPO Actor server, this command can be run inside the container:
 
 ```bash
 cd /opt/nemo-rlhf \
@@ -5185,14 +5221,27 @@ cd /opt/nemo-rlhf \
 && python -u examples/nlp/gpt/train_gpt_ppo_actor.py \
     --config-path=examples/nlp/gpt/conf/ \
     --config-name=gpt_ppo_actor \
-    "model.data.data_prefix={train: [/path/to/train_data], validation: [/path/to/val_data], test: [/path/to/test_data]}" \
-    model.pretrained_checkpoint.restore_from_path=/path/to/model.nemo
+    exp_manager.explicit_log_dir=/path/to/actor_output_dir \
+    "model.data.data_prefix={train: [/path/to/actor_train], validation: [/path/to/actor_val], test: [/path/to/actor_test]}" \
+    model.pretrained_checkpoint.restore_from_path=/path/to/sft_model.nemo
 ```
-This command will launch the PPO Actor job on the local computer. All the configuration parameters can be modified in the `gpt_ppo_actor.yaml` file, or by overriding them through the CLI command.
 
-##### 5.16.2.5 Launching every job at once with SLURM
-<a id="markdown-launching-every-job-at-once-with-slurm" name="launching-every-job-at-once-with-slurm"></a>
-Heterogeneous jobs can be used to launch all four jobs simultaneously in different nodes, using a script like the one shown next:
+This command will launch the PPO Actor job on the local computer. All the configuration parameters can be modified in the `gpt_ppo_actor.yaml` file, or by overriding them through the CLI command: in particular, the Actor's model config should match the one used to train the SFT model, and you may need to provide the correct name of the tokenizer files as described in the RM training section above.
+
+The data files should point to the names of datasets (without the ".bin" or ".idx" suffix) generated in a manner similar to what is described in the RM training section, but with an important difference: they should only contain prompts.
+This means that the raw .jsonl from which the datasets are built should follow the following format:
+```
+{"text": prompt1}
+{"text": prompt2}
+{"text": prompt3}
+...
+```
+
+_Remark: currently, the example training script does not automatically run evaluation on the provided test set. This may change in a future release._
+
+##### 5.16.2.5 Launching all jobs at once with SLURM
+<a id="markdown-launching-all-jobs-at-once-with-slurm" name="launching-all-jobs-at-once-with-slurm"></a>
+Heterogeneous jobs can be used to launch all four jobs simultaneously on different nodes, using a script like:
 
 ```bash
 #!/bin/bash
@@ -5204,17 +5253,26 @@ Heterogeneous jobs can be used to launch all four jobs simultaneously in differe
 #SBATCH hetjob
 #SBATCH -N 8 --ntasks-per-node 8 -t 4:00:00 --exclusive
 
-RM_MODEL=/path/to/reward_model.nemo
+RM_MODEL=/path/to/trained_rm.nemo
 ACTOR_MODEL=/path/to/sft_model.nemo
+OUTPUT_DIR=/path/to/output_dir
+TRAIN_DATA_PATH=/path/to/train_actor
+VALID_DATA_PATH=/path/to/val_actor
+TEST_DATA_PATH=/path/to/test_actor
 
-DIR=/opt/nemo-rlhf
+NEMO_RLHF_DIR=/opt/nemo-rlhf
 CONTAINER="nvcr.io/ea-bignlp/nemofw-training:23.07-py3"
 
+mkdir -p $OUTPUT_DIR
+
 # START HETEROGENEUS JOB 0
 
+mkdir -p ${OUTPUT_DIR}/rm
+RM_OUT=${OUTPUT_DIR}/rm/rm-%j.log
+RM_ERR=${OUTPUT_DIR}/rm/rm-%j.err
 read -r -d '' cmd_rm_inference <<EOF
-cd ${DIR} \
-&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
+cd ${NEMO_RLHF_DIR} \
+&& export PYTHONPATH="${NEMO_RLHF_DIR}:${PYTHONPATH}" \
 && export HYDRA_FULL_ERROR=1 \
 && python -u examples/nlp/gpt/serve_reward_model.py \
     --config-path=examples/nlp/gpt/conf/ \
@@ -5223,7 +5281,7 @@ cd ${DIR} \
     port=${RM_PORT=5555}
 EOF
 
-srun --het-group=0 --container-image=${CONTAINER} bash -c "${cmd_rm_inference}" &
+srun -o $RM_OUT -e $RM_ERR --het-group=0 --container-image=${CONTAINER} bash -c "${cmd_rm_inference}" & pids[0]=$!
 
 # END HETEROGENEUS JOB 0
 
@@ -5231,9 +5289,12 @@ srun --het-group=0 --container-image=${CONTAINER} bash -c "${cmd_rm_inference}"
 
 # START HETEROGENEUS JOB 1
 
+mkdir -p ${OUTPUT_DIR}/init_policy
+IP_OUT=${OUTPUT_DIR}/init_policy/init_policy-%j.log
+IP_ERR=${OUTPUT_DIR}/init_policy/init_policy-%j.err
 read -r -d '' cmd_init_policy_inference <<EOF
-cd ${DIR} \
-&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
+cd ${NEMO_RLHF_DIR} \
+&& export PYTHONPATH="${NEMO_RLHF_DIR}:${PYTHONPATH}" \
 && export HYDRA_FULL_ERROR=1 \
 && python -u examples/nlp/gpt/serve_initial_policy.py \
     --config-path=examples/nlp/gpt/conf/ \
@@ -5242,27 +5303,30 @@ cd ${DIR} \
     port=${INIT_POLICY_PORT=5556}
 EOF
 
-srun --het-group=1 -o $INIT_POLICY_OUTFILE -e $INIT_POLICY_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_init_policy_inference}" &
+srun -o $IP_OUT -e $IP_ERR --het-group=1 --container-image=${CONTAINER} bash -c "${cmd_init_policy_inference}" & pids[1]=$!
 
 # END HETEROGENEUS JOB 1
 
-sleep 30
 ######################################################
 
 # START HETEROGENEUS JOB 2
 
+mkdir -p ${OUTPUT_DIR}/critic
+CRIT_OUT=${OUTPUT_DIR}/critic/critic-%j.log
+CRIT_ERR=${OUTPUT_DIR}/critic/critic-%j.err
 read -r -d '' cmd_critic_inference <<EOF
-cd ${DIR} \
-&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
+cd ${NEMO_RLHF_DIR} \
+&& export PYTHONPATH="${NEMO_RLHF_DIR}:${PYTHONPATH}" \
 && export HYDRA_FULL_ERROR=1 \
 && python -u examples/nlp/gpt/serve_ppo_critic.py \
     --config-path=examples/nlp/gpt/conf/ \
     --config-name=gpt_ppo_critic \
+    exp_manager.explicit_log_dir=${OUTPUT_DIR}/critic \
     model.pretrained_checkpoint.restore_from_path=${RM_MODEL} \
     inference.port=${CRITIC_PORT=5557}
 EOF
 
-srun --het-group=2 --container-image=${CONTAINER} bash -c "${cmd_critic_inference}" &
+srun -o $CRIT_OUT -e $CRIT_ERR --het-group=2 --container-image=${CONTAINER} bash -c "${cmd_critic_inference}" & pids[2]=$!
 
 # END HETEROGENEUS JOB 2
 
@@ -5271,69 +5335,111 @@ sleep 30
 
 # START HETEROGENEUS JOB 3
 
-TRAIN_DATA_PATH=/path/to/train_data
-VALID_DATA_PATH=/path/to/val_data
-TEST_DATA_PATH=/path/to/test_data
-
 host_rm="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"
 host_init_policy="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_1 | head -n1)"
 host_critic="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_2 | head -n1)"
 
+mkdir -p ${OUTPUT_DIR}/actor
+ACT_OUT=${OUTPUT_DIR}/actor/actor-%j.log
+ACT_ERR=${OUTPUT_DIR}/actor/actor-%j.err
 read -r -d '' cmd_ppo <<EOF
-cd ${DIR} \
-&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
+cd ${NEMO_RLHF_DIR} \
+&& export PYTHONPATH="${NEMO_RLHF_DIR}:${PYTHONPATH}" \
 && export HYDRA_FULL_ERROR=1 \
 && python -u examples/nlp/gpt/train_gpt_ppo_actor.py \
     --config-path=examples/nlp/gpt/conf/ \
     --config-name=gpt_ppo_actor \
+    exp_manager.explicit_log_dir=${OUTPUT_DIR}/actor
     trainer.num_nodes=8 \
     "model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${TEST_DATA_PATH}]}" \
     model.pretrained_checkpoint.restore_from_path=${ACTOR_MODEL} \
     model.rlhf.reward_model.ip=${host_rm} \
-    model.rlhf.reward_model.port=${RM_PORT} \
+    model.rlhf.reward_model.port=${RM_PORT=5555} \
     model.rlhf.initial_policy.ip=${host_init_policy} \
-    model.rlhf.initial_policy.port=${INIT_POLICY_PORT} \
+    model.rlhf.initial_policy.port=${INIT_POLICY_PORT=5556} \
     model.rlhf.critic.ip=${host_critic} \
-    model.rlhf.critic.port=${CRITIC_PORT}
+    model.rlhf.critic.port=${CRITIC_PORT=5557}
 EOF
 
-srun --het-group=3 --container-image=${CONTAINER} bash -c "${cmd_ppo}" &
+srun -o $ACT_OUT -e $ACT_ERR --het-group=3 --container-image=${CONTAINER} bash -c "${cmd_ppo}" & pids[3]=$!
 
 # END HETEROGENEUS JOB 3
 
-wait
-```
-It is important to launch each job with & after the `srun` command, to ensure each job doesn’t block the next one. The wait statement at the end of script ensures that the entire job does not exit until each individual job is finished.
-
-Note: the three servers do not support data parallelism. Therefore, the SLURM `–ntasks-per-node` value should be set to the model parallelism value (tensor parallelism * pipeline parallelism) for that same job. And the trainer.devices value must also be set to that same value as well. However, the PPO actor supports data parallelism, so `–ntasks-per-node` can be set to the number of GPUs in each node.
-
-##### 5.16.2.6 PPO Hyper-parameters
+# The code below monitors the four SLURM jobs to ensure they are all stopped when one of them finishes.
+# (otherwise some jobs may remain pending until they reach the cluster time limit).
+all_done=false
+while ! $all_done; do
+    all_done=true
+    for pid in "${pids[@]}"; do
+        if ps -p "$pid" > /dev/null; then
+            # Process is still running.
+            all_done=false
+        else
+            # Process is no longer running => check its exit status.
+            wait "$pid"
+            exit_code=$?
+            echo "Process $pid exited with code $exit_code at $(date '+%Y-%m-%d %H:%M:%S')"
+            # Wait a bit (to get a clean stack trace in case there is one being generated), then kill the
+            # remaining processes if needed.
+            sleep 60
+            for other_pid in "${pids[@]}"; do
+                if ps -p "$other_pid" > /dev/null; then
+                    echo "Killing processs $other_pid"
+                    kill -9 "$other_pid"
+                fi
+            done
+            exit $exit_code
+        fi
+    done
+
+    # Sleep for a while before checking again.
+    sleep 60
+done
+```
+
+It is important to launch all jobs with `&` after the `srun` command, to ensure they do not block each other.
+
+Note: all four scripts support data parallelism. Therefore, the SLURM `–ntasks-per-node` value may be set to the number of GPUs on each node, and `trainer.devices` should also be set to that same value.
+
+##### 5.16.2.6 Ensuring consistency between jobs
+<a id="markdown-ensuring-consistency-between-jobs" name="ensuring-consistency-between-jobs"></a>
+
+Since there are four independent jobs, each with their own config, one must be careful to ensure that the various configs are compatible with each other by following the guidelines below:
+
+- `critic.exp_manager.checkpoint_callback_params.every_n_train_steps` should be set to `actor.trainer.val_check_interval * actor.model.global_batch_size / critic.model.global_batch_size` so that the Critic is saved at the same frequency as the Actor.
+- `critic.inference.micro_batch_size` should be set to `actor.model.rlhf.ppo.rollout_micro_batch_size` divided by the Critic's data parallel size (which is obtained by the total number of GPUs the Critic is running on, i.e., `trainer.devices * trainer.num_nodes`, divided by the product of `model.tensor_model_parallel_size * model.pipeline_model_parallel_size`), rounded up.
+This ensures that the Critic can process the Actor's requests as efficiently as possible.
+- Similarly, `rm.inference_micro_batch_size` and `init_policy.inference_micro_batch_size` should be set to `actor.model.rlhf.ppo.rollout_micro_batch_size` divided by the RM and Initial Policy's data parallel size, rounded up.
+- `critic.model.ppo_epochs` should be equal to `actor.model.rlhf.ppo.epochs` so that the Critic performs the same number of updates as the Actor on the rollout buffer data.
+
+##### 5.16.2.7 PPO Hyper-parameters
 <a id="markdown-ppo-hyper-parameters" name="ppo-hyper-parameters"></a>
 
-All the model related parameters can be controlled the same way as in other NeMo training jobs. However, we also provide full control of the behavior of PPO during training, with a section in the config yaml files inside `model.rlhf`. These are the descriptions of the available hyper-parameters:
+All the model parameters can be controlled the same way as in other NeMo training jobs. However, we also provide full control of the behavior of PPO during training, with a section in the Actor config yaml file inside `model.rlhf`. These are the available hyper-parameters:
 
-- `rlhf.reward_model`: Provide the ip address and the port where the Reward Model will be running, to enable communication with it.
-- `rlhf.critic`: Provide the ip address and the port where the PPO Critic will be running, to enable communication with it.
-- `rlhf.initial_policy`: Provide the ip address and the port where the Initial Policy will be running, to enable communication with it.
-- `rlhf.ppo.entropy_penalty`: Control the effect of the entropy term in PPO.
-- `rlhf.ppo.inital_pollicy_kl_penalty`: Control the effect of the initial policy KL Divergence term in PPO.
-- `rlhf.ppo.use_absolute_kl`: Whether to use the absolute value of the initial policy KL Divergence or not.
-- `rlhf.ppo.epochs`: Number of epochs the actor and critic will perform on the data stored in the rollout buffer each time.
+- `rlhf.{reward_model,critic,initial_policy}.{ip,port}`: Provide the ip address and the port where the Reward Model, PPO Critic and Initial Policy will be running, to enable communication with them.
+- `rlhf.ppo.entropy_bonus`: Weight of the entropy term in the PPO loss.
+- `rlhf.ppo.inital_pollicy_kl_penalty`: Weight of the KL Divergence w.r.t. the Initial Policy in the PPO loss.
+- `rlhf.ppo.use_absolute_kl`: Whether or not to use the absolute value of the KL Divergence w.r.t. the Initial Policy.
+- `rlhf.ppo.epochs`: Number of training epochs the Actor will perform on the samples stored in the rollout buffer before generating new samples.
 - `rlhf.ppo.num_rollout_samples`: Number of samples that will be generated during the rollout stage before moving to the training stage.
-- `rlhf.ppo.rollout_micro_batch_size`: Micro batch size for the rollout phase. Each GPU will load this many prompts and generate responses for them.
-- `rlhf.ppo.ratio_eps`: epsilon value for clipping the PPO ratio during training.
-- `rlhf.ppo.discount`: discount factor for calculating the returns and advantages.
-- `rlhf.ppo.gae_lambda`: lambda value for the Generalized Advantage Estimation (GAE) calculation.
-- `rlhf.ppo.normalize_advantage`: whether to normalize the advantages to have a mean of zero and standard deviation of one.
+- `rlhf.ppo.rollout_micro_batch_size`: Micro batch size for the rollout phase. Each GPU will load this many prompts at once and generate responses for them.
+- `rlhf.ppo.ratio_eps`: Epsilon value for clipping the PPO ratio during training.
+- `rlhf.ppo.discount`: Discount factor for calculating the returns and advantages.
+- `rlhf.ppo.gae_lambda`: Lambda value for the Generalized Advantage Estimation (GAE) calculation.
+- `rlhf.ppo.normalize_advantage`: Whether or not to normalize the advantages to have a mean of zero and standard deviation of one within each global batch.
+
+Note that although the sampling parameters during the rollout phase can also be modified (through `model.sampling_params.*`), it is not recommended to do so because the implementation currently does not account for these changes when computing the log probabilities of the generated responses.
 
-During the rollout phase, the sampling parameters for the model can also be modified, by using the parameters in `model.sampling_params`.
+The Critic's config also contains a `model.rlhf` section with the following hyper-parameter:
+- `rlhf.ppo.critic_loss_clip_value`: Used in the Critic loss term that clamps the difference between the current Critic value predictions and those that were predicted during rollout generation (disabled when set to zero).
 
 #### 5.16.3. Future Work
 <a id="markdown-future-work" name="future-work"></a>
 
-- The reward model training only supports datasets with two responses per prompt. We will add support for training with datasets that have more than 2 responses per prompt in future releases.
-- The throughput of PPO will be greatly increased in future releases.
-- The stability of the PPO learning process is not good enough. We will continue working to improve the PPO learning for our models.
+- The throughput of PPO will be increased in future releases.
+- We will continue improving the stability of the PPO learning phase.
+- We will add more learning algorithms beyond PPO.
 
 ### 5.17 Curating pretraining datasets with the NeMo Data Curator