[fp8] Merge feature/fp8_comm to main branch of Colossalai (#6016)

* add SimPO * fix dataloader * remove debug code * add orpo * fix style * fix colossalai, transformers version * fix colossalai, transformers version * fix colossalai, transformers version * fix torch colossalai version * update transformers version * [shardformer] DeepseekMoE support (#5871) * [Feature] deepseek moe expert parallel implement * [misc] fix typo, remove redundant file (#5867) * [misc] fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] deepseek support & unit test * [misc] remove debug code & useless print * [misc] fix typos (#5872) * [Feature] remove modeling file, use auto config. (#5884) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [Deepseek] remove redundant code (#5888) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [misc] remove redundant code * [Feature/deepseek] resolve comment. (#5889) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [misc] remove redundant code * [misc] mv module replacement into if branch * [misc] add some warning message and modify some code in unit test * [misc] fix typos --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Hoxfix] Fix CUDA_DEVICE_MAX_CONNECTIONS for comm overlap Co-authored-by: Edenzzzz <[email protected]> * [Feat] Diffusion Model(PixArtAlpha/StableDiffusion3) Support (#5838) * Diffusion Model Inference support * Stable Diffusion 3 Support * pixartalpha support * [HotFix] CI,import,requirements-test for #5838 (#5892) * [Hot Fix] CI,import,requirements-test --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] Enable PP + SP for llama (#5868) * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use a one cross entropy func for all shardformer models --------- Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [ShardFormer] Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM (#5897) * add benchmark for sft, dpo, simpo, orpo. Add benchmarking result. Support lora with gradient checkpoint * fix style * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix eval * hotfix citation * [zero] support all-gather overlap (#5898) * [zero] support all-gather overlap * [zero] add overlap all-gather flag * [misc] fix typo * [zero] update api * fix orpo cross entropy loss * [Auto Parallel]: Speed up intra-op plan generation by 44% (#5446) * Remove unnecessary calls to deepcopy * Build DimSpec's difference dict only once This change considerably speeds up construction speed of DimSpec objects. The difference_dict is the same for each DimSpec object, so a single copy of it is enough. * Fix documentation of DimSpec's difference method * [ShardFormer] fix qwen2 sp (#5903) * [compatibility] support torch 2.2 (#5875) * Support Pytorch 2.2.2 * keep build_on_pr file and update .compatibility * fix object_to_tensor usage when torch>=2.3.0 (#5820) * [misc] support torch2.3 (#5893) * [misc] support torch2.3 * [devops] update compatibility ci * [devops] update compatibility ci * [devops] add debug * [devops] add debug * [devops] add debug * [devops] add debug * [devops] remove debug * [devops] remove debug * [release] update version (#5912) * [plugin] support all-gather overlap for hybrid parallel (#5919) * [plugin] fixed all-gather overlap support for hybrid parallel * add kto * fix style, add kto data sample * [Examples] Add lazy init to OPT and GPT examples (#5924) Co-authored-by: Edenzzzz <[email protected]> * [ColossalChat] Hotfix for ColossalChat (#5910) * add ignore and tiny llama * fix path issue * run style * fix issue * update bash * add ignore and tiny llama * fix path issue * run style * fix issue * update bash * fix ddp issue * add Qwen 1.5 32B * refactor tokenization * [FIX BUG] UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a value (#5931) * cannot access local variable 'default_conversation' where it is not associated with a value set default value for 'default_conversation' * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix test data * refactor evaluation * remove real data path * remove real data path * Add n_fused as an input from native_module (#5894) * [FIX BUG] convert env param to int in (#5934) * [Hotfix] Fix ZeRO typo #5936 Co-authored-by: Edenzzzz <[email protected]> * [Feature] Add a switch to control whether the model checkpoint needs to be saved after each epoch ends (#5941) * Add a switch to control whether the model checkpoint needs to be saved after each epoch ends * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix style * fix style * fix style * [shardformer] hotfix attn mask (#5945) * [shardformer] hotfix attn mask (#5947) * [Feat] Distrifusion Acceleration Support for Diffusion Inference (#5895) * Distrifusion Support source * comp comm overlap optimization * sd3 benchmark * pixart distrifusion bug fix * sd3 bug fix and benchmark * generation bug fix * naming fix * add docstring, fix counter and shape error * add reference * readme and requirement * [zero] hotfix update master params (#5951) * [release] update version (#5952) * [Chat] Fix lora (#5946) * fix merging * remove filepath * fix style * Update README.md (#5958) * [hotfix] Remove unused plan section (#5957) * remove readme * fix readme * update * [test] add mixtral for sequence classification * [test] add mixtral transformer test * [moe] fix plugin * [test] mixtra pp shard test * [chore] handle non member group * [zero] solve hang * [test] pass mixtral shardformer test * [moe] implement transit between non moe tp and ep * [zero] solve hang * [misc] solve booster hang by rename the variable * solve hang when parallel mode = pp + dp * [moe] implement submesh initialization * [moe] add mixtral dp grad scaling when not all experts are activated * [chore] manually revert unintended commit * [chore] trivial fix * [chore] arg pass & remove drop token * [test] add mixtral modelling test * [moe] implement tp * [moe] test deepseek * [moe] clean legacy code * [Feature] MoE Ulysses Support (#5918) * moe sp support * moe sp bug solve * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [chore] minor fix * [moe] init moe plugin comm setting with sp * moe sp + ep bug fix * [moe] finalize test (no pp) * [moe] full test for deepseek and mixtral (pp + sp to fix) * [chore] minor fix after rebase * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [chore] solve moe ckpt test failure and some other arg pass failure * [moe] remove ops * [test] fix test: test_zero1_2 * [bug] fix: somehow logger hangs the program * [moe] deepseek moe sp support * [test] add check * [deepseek] replace attn (a workaround for bug in transformers) * [misc] skip redunant test * [misc] remove debug/print code * [moe] refactor mesh assignment * Revert "[moe] implement submesh initialization" This reverts commit 2f9bce6. * [chore] change moe_pg_mesh to private * [misc] remove incompatible test config * [misc] fix ci failure: change default value to false in moe plugin * [misc] remove useless condition * [chore] docstring * [moe] remove force_overlap_comm flag and add warning instead * [doc] add MoeHybridParallelPlugin docstring * [moe] solve dp axis issue * [chore] remove redundant test case, print string & reduce test tokens * [feat] Dist Loader for Eval (#5950) * support auto distributed data loader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support auto distributed data loader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix tp error * remove unused parameters * remove unused * update inference * update docs * update inference --------- Co-authored-by: Michelle <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [lora] lora support hybrid parallel plugin (#5956) * lora support hybrid plugin * fix * fix * fix * fix * Support overall loss, update KTO logging * [Docs] clarify launch port Co-authored-by: Edenzzzz <[email protected]> * [Hotfix] README link (#5966) * update ignore * update readme * run style * update readme * [Hotfix] Avoid fused RMSnorm import error without apex (#5985) Co-authored-by: Edenzzzz <[email protected]> * [Chat] fix readme (#5989) * fix readme * fix readme, tokenization fully tested * fix readme, tokenization fully tested * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: root <root@notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9-0.notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9.colossal-ai.svc.cluster.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix sync condition (#6000) * [plugin] add cast inputs option for zero (#6003) * [pre-commit.ci] pre-commit autoupdate (#5995) updates: - [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] Bypass the huggingface bug to solve the mask mismatch problem (#5991) * [Feature] Zigzag Ring attention (#5905) * halfway * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unified cross entropy func for all shardformer models * remove redundant lines * add basic ring attn; debug cross entropy * fwd bwd logic complete * fwd bwd logic complete; add experimental triton rescale * precision tests passed * precision tests passed * fix typos and remove misc files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add sp_mode to benchmark; fix varlen interface * update softmax_lse shape by new interface * change tester name * remove buffer clone; support packed seq layout * add varlen tests * fix typo * all tests passed * add dkv_group; fix mask * remove debug statements --------- Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] update compatibility (#6008) * [misc] update compatibility * [misc] update requirements * [devops] disable requirements cache * [test] fix torch ddp test * [test] fix rerun on address in use * [test] fix lazy init * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the merge * fix the merge * overlap kv comm with output rescale (#6017) Co-authored-by: Edenzzzz <[email protected]> * fix the merge * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the merge * fix * fix * fix the merge * fix * [misc] Use dist logger in plugins (#6011) * use dist logger in plugins * remove trash * print on rank 0 --------- Co-authored-by: Edenzzzz <[email protected]> * fix * fix * fix * fix * fix the merge * fix * fix * fix * fix --------- Co-authored-by: YeAnbang <[email protected]> Co-authored-by: Haze188 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: Runyu Lu <[email protected]> Co-authored-by: Guangyao Zhang <[email protected]> Co-authored-by: YeAnbang <[email protected]> Co-authored-by: Hongxin Liu <[email protected]> Co-authored-by: Stephan Kö <[email protected]> Co-authored-by: アマデウス <[email protected]> Co-authored-by: Tong Li <[email protected]> Co-authored-by: zhurunhua <[email protected]> Co-authored-by: Insu Jang <[email protected]> Co-authored-by: Gao, Ruiyuan <[email protected]> Co-authored-by: hxwang <[email protected]> Co-authored-by: Michelle <[email protected]> Co-authored-by: root <root@notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9-0.notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9.colossal-ai.svc.cluster.local>
hpcaitech · Aug 22, 2024 · eea37da · eea37da
1 parent 0a51319
commit eea37da
Show file tree

Hide file tree

Showing 92 changed files with 2,222 additions and 463 deletions.
diff --git a/.compatibility b/.compatibility
@@ -1,3 +1,4 @@
 2.1.0-12.1.0
 2.2.2-12.1.0
 2.3.0-12.1.0
+2.4.0-12.4.1
diff --git a/.cuda_ext.json b/.cuda_ext.json
@@ -5,8 +5,8 @@
  "cuda_image": "hpcaitech/cuda-conda:12.1"
  },
  {
- "torch_command": "pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118",
- "cuda_image": "hpcaitech/cuda-conda:11.8"
+ "torch_command": "pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124",
+ "cuda_image": "hpcaitech/cuda-conda:12.4"
  }
  ]
 }
diff --git a/.github/workflows/build_on_pr.yml b/.github/workflows/build_on_pr.yml
@@ -141,7 +141,7 @@ jobs:
  - name: Install Colossal-AI
  run: |
  BUILD_EXT=1 pip install -v -e .
- pip install -r requirements/requirements-test.txt
+ pip install --no-cache-dir -r requirements/requirements-test.txt
 
  - name: Store Colossal-AI Cache
  run: |

diff --git a/.github/workflows/build_on_schedule.yml b/.github/workflows/build_on_schedule.yml
@@ -57,7 +57,7 @@ jobs:
  [ ! -z "$(ls -A /github/home/cuda_ext_cache/)" ] && cp -r /github/home/cuda_ext_cache/* /__w/ColossalAI/ColossalAI/
  BUILD_EXT=1 pip install -v -e .
  cp -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/
- pip install -r requirements/requirements-test.txt
+ pip install --no-cache-dir -r requirements/requirements-test.txt
 
  - name: Unit Testing
  if: steps.check-avai.outputs.avai == 'true'

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -12,9 +12,10 @@ repos:
  hooks:
  - id: isort
  name: sort all imports (python)
+ args: ["--profile", "black"] # avoid conflict with black
 
  - repo: https://github.com/psf/black-pre-commit-mirror
- rev: 24.4.2
+ rev: 24.8.0
  hooks:
  - id: black
  name: black formatter

diff --git a/applications/ColossalChat/.gitignore b/applications/ColossalChat/.gitignore
@@ -151,6 +151,7 @@ examples/training_scripts/wandb
 examples/training_scripts/output
 
 examples/awesome-chatgpt-prompts/
+examples/inference/round.txt
 temp/
 
 # ColossalChat

diff --git a/applications/ColossalChat/README.md b/applications/ColossalChat/README.md
@@ -121,7 +121,7 @@ cd $COLOSSAL_AI_ROOT
 BUILD_EXT=1 pip install .
 
 # Install ColossalChat
-cd $COLOSSAL_AI_ROOT/applications/Chat
+cd $COLOSSAL_AI_ROOT/applications/ColossalChat
 pip install .
 ```
 

diff --git a/applications/ColossalChat/coati/dataset/tokenization_utils.py b/applications/ColossalChat/coati/dataset/tokenization_utils.py
@@ -49,6 +49,10 @@ def tokenize_sft(
 
  messages = data_point["messages"]
  template = deepcopy(conversation_template)
+
+ if messages[0]["from"] == "system":
+ template.system_message = str(messages[0]["content"])
+ messages.pop(0)
  template.messages = []
  for idx, mess in enumerate(messages):
  if mess["from"] != template.roles[idx % 2]:
@@ -148,11 +152,14 @@ def tokenize_prompt(
  template = deepcopy(conversation_template)
  template.messages = []
 
+ if messages[0]["from"] == "system":
+ template.system_message = str(messages[0]["content"])
+ messages.pop(0)
+
  for idx, mess in enumerate(messages):
  if mess["from"] != template.roles[idx % 2]:
  raise ValueError(
- f"Message should iterate between user and assistant and starts with a \
- line from the user. Got the following data:\n{messages}"
+ f"Message should iterate between user and assistant and starts with a line from the user. Got the following data:\n{messages}"
  )
  template.append_message(mess["from"], mess["content"])
 
@@ -162,7 +169,7 @@ def tokenize_prompt(
  template.messages = template.messages[:-1]
 
  # Prepare data
- prompt = template.get_prompt(length=len(template.messages) - 1, add_generation_prompt=True)
+ prompt = template.get_prompt(length=len(template.messages), add_generation_prompt=True)
  tokenized = tokenizer([prompt], add_special_tokens=False)["input_ids"][0]
 
  if tokenizer.bos_token_id is not None:
@@ -225,6 +232,10 @@ def tokenize_rlhf(
  template = deepcopy(conversation_template)
  template.clear()
 
+ if context[0]["from"] == "system":
+ template.system_message = str(context[0]["content"])
+ context.pop(0)
+
  for idx, mess in enumerate(context):
  if mess["from"] != template.roles[idx % 2]:
  raise ValueError(
@@ -345,6 +356,10 @@ def tokenize_kto(
  template = deepcopy(conversation_template)
  template.clear()
 
+ if prompt[0]["from"] == "system":
+ template.system_message = str(prompt[0]["content"])
+ prompt.pop(0)
+
  if prompt[0].get("from", None) != "user":
  raise ValueError("conversation should start with user")
  if completion.get("from", None) != "assistant":

diff --git a/applications/ColossalChat/coati/models/loss.py b/applications/ColossalChat/coati/models/loss.py
@@ -46,7 +46,10 @@ def forward(
  action_mask: Optional[torch.Tensor] = None,
  ) -> torch.Tensor:
  skip = False
- ratio_ = ((log_probs - old_log_probs) * action_mask).exp()
+ if action_mask is None:
+ ratio_ = (log_probs - old_log_probs).exp()
+ else:
+ ratio_ = ((log_probs - old_log_probs) * action_mask).exp()
 
  # note that if dropout is disabled (recommanded), ratio will always be 1.
  if ratio_.mean() > self.skip_threshold:
@@ -56,7 +59,10 @@ def forward(
  surr1 = ratio * advantages
  surr2 = ratio.clamp(1 - self.clip_eps, 1 + self.clip_eps) * advantages
  loss = -torch.min(surr1, surr2)
- loss = masked_mean(loss, action_mask)
+ if action_mask is not None:
+ loss = masked_mean(loss, action_mask)
+ else:
+ loss = loss.mean(dim=1)
  loss = loss.mean()
  return loss, skip, ratio_.max()
 
@@ -81,8 +87,10 @@ def forward(
  values_clipped = old_values + (values - old_values).clamp(-self.clip_eps, self.clip_eps)
  surr1 = (values_clipped - returns) ** 2
  surr2 = (values - returns) ** 2
- loss = torch.max(surr1, surr2) / torch.sum(action_mask)
- loss = torch.sum(loss * action_mask)
+ if action_mask is not None:
+ loss = torch.sum(torch.max(surr1, surr2) / torch.sum(action_mask) * action_mask)
+ else:
+ loss = torch.mean(torch.max(surr1, surr2))
  return 0.5 * loss
 
 

diff --git a/applications/ColossalChat/coati/models/utils.py b/applications/ColossalChat/coati/models/utils.py
@@ -138,6 +138,7 @@ def disable_dropout(model: torch.nn.Module):
  Returns:
  None
  """
- for module in model.modules():
- if isinstance(module, torch.nn.Dropout):
- module.p = 0.0
+ if model is not None:
+ for module in model.modules():
+ if isinstance(module, torch.nn.Dropout):
+ module.p = 0.0
diff --git a/applications/ColossalChat/coati/trainer/dpo.py b/applications/ColossalChat/coati/trainer/dpo.py
@@ -56,6 +56,7 @@ def __init__(
  beta: float = 0.1,
  gamma: float = 0.0,
  length_normalization: bool = False,
+ apply_loss_mask: bool = True,
  accumulation_steps: int = 1,
  start_epoch: int = 0,
  save_interval: int = 0,
@@ -67,6 +68,7 @@ def __init__(
  self.actor_scheduler = actor_lr_scheduler
  self.tokenizer = tokenizer
  self.actor_loss_fn = DpoLoss(beta, gamma)
+ self.apply_loss_mask = apply_loss_mask
  self.save_interval = save_interval
  self.coordinator = coordinator
  self.save_dir = save_dir
@@ -135,6 +137,10 @@ def _train(self, epoch: int):
  batch["reject_attention_mask"],
  batch["reject_loss_mask"],
  )
+ if not self.apply_loss_mask:
+ chosen_loss_mask = chosen_loss_mask.fill_(1.0)
+ reject_loss_mask = reject_loss_mask.fill_(1.0)
+
  batch_size = chosen_input_ids.size()[0]
 
  actor_all_logits = self.model(
@@ -284,6 +290,9 @@ def _eval(self, epoch: int):
  batch["reject_attention_mask"],
  batch["reject_loss_mask"],
  )
+ if not self.apply_loss_mask:
+ chosen_loss_mask = chosen_loss_mask.fill_(1.0)
+ reject_loss_mask = reject_loss_mask.fill_(1.0)
 
  batch_size = chosen_input_ids.size()[0]
 

diff --git a/applications/ColossalChat/coati/trainer/kto.py b/applications/ColossalChat/coati/trainer/kto.py
@@ -6,7 +6,7 @@
 from typing import Any, Optional
 
 import torch
-import torch.distributed
+import torch.distributed as dist
 from coati.models.loss import KTOLoss
 from coati.models.utils import calc_masked_log_probs
 from coati.trainer.utils import all_reduce_mean
@@ -59,6 +59,7 @@ def __init__(
  beta: float = 0.1,
  desirable_weight: float = 1.0,
  undesirable_weight: float = 1.0,
+ apply_loss_mask: bool = True,
  accumulation_steps: int = 1,
  start_epoch: int = 0,
  save_interval: int = 0,
@@ -70,6 +71,7 @@ def __init__(
  self.actor_scheduler = actor_lr_scheduler
  self.tokenizer = tokenizer
  self.kto_loss = KTOLoss(beta=beta, desirable_weight=desirable_weight, undesirable_weight=undesirable_weight)
+ self.apply_loss_mask = apply_loss_mask
  self.save_interval = save_interval
  self.coordinator = coordinator
  self.save_dir = save_dir
@@ -134,6 +136,10 @@ def _train(self, epoch: int):
  batch["kl_attention_mask"],
  batch["kl_loss_mask"],
  )
+ if not self.apply_loss_mask:
+ loss_mask = loss_mask.fill_(1.0)
+ kl_loss_mask = kl_loss_mask.fill_(1.0)
+
  batch_size = input_ids.size()[0]
 
  # actor logits
@@ -182,8 +188,28 @@ def _train(self, epoch: int):
 
  # sync
  loss_mean = all_reduce_mean(tensor=loss)
- chosen_rewards_mean = all_reduce_mean(tensor=chosen_rewards.mean())
- rejected_rewards_mean = all_reduce_mean(tensor=rejected_rewards.mean())
+ chosen_reward_mean = chosen_rewards.mean()
+ chosen_rewards_list = [
+ torch.tensor(0, dtype=loss.dtype, device=loss.device) for _ in range(dist.get_world_size())
+ ]
+ dist.all_gather(chosen_rewards_list, chosen_reward_mean)
+ rejected_reward_mean = rejected_rewards.mean()
+ rejected_rewards_list = [
+ torch.tensor(0, dtype=loss.dtype, device=loss.device) for _ in range(dist.get_world_size())
+ ]
+ dist.all_gather(rejected_rewards_list, rejected_reward_mean)
+ chosen_rewards_list = [i for i in chosen_rewards_list if not i.isnan()]
+ rejected_rewards_list = [i for i in rejected_rewards_list if not i.isnan()]
+ chosen_rewards_mean = (
+ torch.stack(chosen_rewards_list).mean()
+ if len(chosen_rewards_list) > 0
+ else torch.tensor(torch.nan, dtype=loss.dtype, device=loss.device)
+ )
+ rejected_rewards_mean = (
+ torch.stack(rejected_rewards_list).mean()
+ if len(rejected_rewards_list) > 0
+ else torch.tensor(torch.nan, dtype=loss.dtype, device=loss.device)
+ )
  self.accumulative_meter.add("chosen_rewards", chosen_rewards_mean.to(torch.float16).mean().item())
  self.accumulative_meter.add("rejected_rewards", rejected_rewards_mean.to(torch.float16).mean().item())
  self.accumulative_meter.add("loss", loss_mean.to(torch.float16).detach().item())
@@ -256,6 +282,11 @@ def _eval(self, epoch: int):
  batch["kl_attention_mask"],
  batch["kl_loss_mask"],
  )
+
+ if not self.apply_loss_mask:
+ loss_mask = loss_mask.fill_(1.0)
+ kl_loss_mask = kl_loss_mask.fill_(1.0)
+
  batch_size = input_ids.size()[0]
 
  # actor logits

diff --git a/applications/ColossalChat/coati/trainer/orpo.py b/applications/ColossalChat/coati/trainer/orpo.py
@@ -52,6 +52,7 @@ def __init__(
  tokenizer: PreTrainedTokenizerBase,
  max_epochs: int = 1,
  lam: float = 0.1,
+ apply_loss_mask: bool = True,
  accumulation_steps: int = 1,
  start_epoch: int = 0,
  save_interval: int = 0,
@@ -67,6 +68,7 @@ def __init__(
  self.save_dir = save_dir
  self.num_train_step = 0
  self.lam = lam
+ self.apply_loss_mask = apply_loss_mask
  self.accumulation_steps = accumulation_steps
  self.device = get_current_device()
  self.accumulative_meter = AccumulativeMeanMeter()
@@ -130,6 +132,11 @@ def _train(self, epoch: int):
  batch["reject_attention_mask"],
  batch["reject_loss_mask"],
  )
+
+ if not self.apply_loss_mask:
+ chosen_loss_mask = chosen_loss_mask.fill_(1.0)
+ reject_loss_mask = reject_loss_mask.fill_(1.0)
+
  batch_size = chosen_input_ids.size()[0]
  actor_out = self.model(
  input_ids=torch.cat([chosen_input_ids, reject_input_ids]),
@@ -263,6 +270,11 @@ def _eval(self, epoch: int):
  batch["reject_attention_mask"],
  batch["reject_loss_mask"],
  )
+
+ if not self.apply_loss_mask:
+ chosen_loss_mask = chosen_loss_mask.fill_(1.0)
+ reject_loss_mask = reject_loss_mask.fill_(1.0)
+
  batch_size = chosen_input_ids.size()[0]
  actor_out = self.model(
  input_ids=torch.cat([chosen_input_ids, reject_input_ids]),

diff --git a/applications/ColossalChat/coati/trainer/ppo.py b/applications/ColossalChat/coati/trainer/ppo.py
@@ -102,6 +102,7 @@ def __init__(
  sample_buffer: bool = False,
  dataloader_pin_memory: bool = True,
  offload_inference_models: bool = True,
+ apply_loss_mask: bool = True,
  accumulation_steps: int = 1,
  save_interval: int = 0,
  save_dir: str = None,
@@ -140,6 +141,7 @@ def __init__(
  self.actor_optim = actor_optim
  self.critic_optim = critic_optim
  self.save_interval = save_interval
+ self.apply_loss_mask = apply_loss_mask
  self.coordinator = coordinator
  self.actor_save_dir = os.path.join(save_dir, "actor")
  self.critic_save_dir = os.path.join(save_dir, "critic")
@@ -229,7 +231,10 @@ def _training_step(self, experience: Experience):
  action_log_probs = calc_action_log_probs(actor_logits, experience.sequences, num_actions)
 
  actor_loss, to_skip, max_ratio = self.actor_loss_fn(
- action_log_probs, experience.action_log_probs, experience.advantages, action_mask=experience.action_mask
+ action_log_probs,
+ experience.action_log_probs,
+ experience.advantages,
+ action_mask=experience.action_mask if self.apply_loss_mask else None,
  )
  actor_loss = (1 - self.ptx_coef) * actor_loss
  if not to_skip:
@@ -249,7 +254,10 @@ def _training_step(self, experience: Experience):
  input_ids=experience.sequences, attention_mask=experience.attention_mask
  ) # [batch size, prompt_length + response_length]
  critic_loss = self.critic_loss_fn(
- values[:, -num_actions:], experience.values, experience.advantages, action_mask=experience.action_mask
+ values[:, -num_actions:],
+ experience.values,
+ experience.advantages,
+ action_mask=experience.action_mask if self.apply_loss_mask else None,
  )
  critic_loss = critic_loss * self.vf_coef
  self.critic_booster.backward(loss=critic_loss, optimizer=self.critic_optim)