Policy Log Probs and Reference Log Probs differ at 1st iteration of DPO/RPO #227

shengyangs · 2024-07-03T20:10:35Z

Describe the bug

In DPO and all its variants, the policy is initialized at the reference policy. Therefore, in the first iteration, the log probs from the policy and the log probs from the reference policy should be exactly the same.

However, I found that the log probs differ at the 1st iteration, as shown in the figure below.

TP4 DP1. They differ.

TP2 DP1. They are exactly the same.

Steps/Code to reproduce bug

Pick a model
Set TP=4
Print out the pi_logprobs and ref_logprobs at iteration=0

Expected behavior

No matter the GBS, MBS, TP, PP, DP, Forward-MBS, they should be exactly the same.

The text was updated successfully, but these errors were encountered:

shengyangs added the bug Something isn't working label Jul 3, 2024

shengyangs assigned trias702, gshennvm and shengyangs Jul 3, 2024

shengyangs mentioned this issue Jul 3, 2024

fix log probs mismatch #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy Log Probs and Reference Log Probs differ at 1st iteration of DPO/RPO #227

Policy Log Probs and Reference Log Probs differ at 1st iteration of DPO/RPO #227

shengyangs commented Jul 3, 2024

Policy Log Probs and Reference Log Probs differ at 1st iteration of DPO/RPO #227

Policy Log Probs and Reference Log Probs differ at 1st iteration of DPO/RPO #227

Comments

shengyangs commented Jul 3, 2024