You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In DPO and all its variants, the policy is initialized at the reference policy. Therefore, in the first iteration, the log probs from the policy and the log probs from the reference policy should be exactly the same.
However, I found that the log probs differ at the 1st iteration, as shown in the figure below.
TP4 DP1. They differ.
TP2 DP1. They are exactly the same.
Steps/Code to reproduce bug
Pick a model
Set TP=4
Print out the pi_logprobs and ref_logprobs at iteration=0
Expected behavior
No matter the GBS, MBS, TP, PP, DP, Forward-MBS, they should be exactly the same.
The text was updated successfully, but these errors were encountered:
Describe the bug
In DPO and all its variants, the policy is initialized at the reference policy. Therefore, in the first iteration, the log probs from the policy and the log probs from the reference policy should be exactly the same.
However, I found that the log probs differ at the 1st iteration, as shown in the figure below.
TP4 DP1. They differ.
TP2 DP1. They are exactly the same.
Steps/Code to reproduce bug
pi_logprobs
andref_logprobs
atiteration=0
Expected behavior
No matter the GBS, MBS, TP, PP, DP, Forward-MBS, they should be exactly the same.
The text was updated successfully, but these errors were encountered: