You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the function torch_call of DataCollatorForCompletionOnlyLM, the suggested new feature can support correct masking on user requests even if the user and assistant messages are not present alternately.
The current version requires an assistant message must follow a user message, and a user message follows an assistant message.
Two adjacent messages with the same role will cause wrong masking, as the current codes haven't considered that a large start variable can be paired with a small end variable when two roles don't take turn by turn :
foridx, (start, end) inenumerate(zip(human_token_ids_idxs, response_token_ids_idxs)):
# Make pytorch loss function ignore all non response tokensifidx!=0:
batch["labels"][i, start:end] =self.ignore_indexelse:
batch["labels"][i, :end] =self.ignore_indexiflen(response_token_ids_idxs) <len(human_token_ids_idxs):
batch["labels"][i, human_token_ids_idxs[-1] :] =self.ignore_index
Using two pointers can solve the issue, below is an example solution:
#build test cases for response_token_ids_idxs and human_token_ids_idxsresponse_token_ids_idxs= [1, 4, 6, 7, 8, 9, 15, 36, 57, 88, 89, 200]
human_token_ids_idxs= [2, 5, 12, 13, 56, 66, 90, 199, 201, 202]
pointer_human=0pointer_response=0mask_start=-1mask_end=-1whilepointer_response<=len(response_token_ids_idxs) -1andpointer_human<=len(human_token_ids_idxs) -1:
ifmask_start==-1:
mask_start=0ifresponse_token_ids_idxs[0] !=0elsehuman_token_ids_idxs[pointer_human]
ifmask_end==-1:
mask_end=response_token_ids_idxs[0]
ifresponse_token_ids_idxs[pointer_response] >human_token_ids_idxs[pointer_human]:
ifmask_end<mask_start:
mask_end=response_token_ids_idxs[pointer_response]
pointer_human+=1elifresponse_token_ids_idxs[pointer_response] <human_token_ids_idxs[pointer_human]:
ifmask_start<mask_end:
print(mask_start, "~", mask_end) #will substitute this line with batch["labels"][i, mask_start:mask_end] = self.ignore_index when pulling a requestmask_start=human_token_ids_idxs[pointer_human]
pointer_response+=1else:
raiseException("response_token_id and human_token_id could not be the same. Please check your response and human template ids")
ifpointer_human<len(human_token_ids_idxs) -1:
whilehuman_token_ids_idxs[pointer_human] <mask_end:
pointer_human+=1ifpointer_human<=len(human_token_ids_idxs) -1:
print(human_token_ids_idxs[pointer_human], "~", "end") #will substitute this line with batch["labels"][i, mask_start:mask_end] = self.ignore_index when pulling a request
This code can be tested to output:
### output0~12~45~612~1556~5766~8890~200201~end
Motivation
Support flexible and correct masking strategies for DataCollatorForCompletionOnlyLM, especially allowing masking for continuous messages from the same role.
The current version requires an assistant message must follow a user message, and a user message follows an assistant message.
I'm not sure why we would want to have a dataset in which the role is not interleaved. Moreover, some chat templates explicitly assume that messages are an interleaving of user and assistant messages.
Do you have an example?
We encountered this problem because we wanted to fine-tune models on real-human conversations. In natural conversation, it is common to see an utterance followed by another from the same speaker in a context where that pause happens., for example, in counseling conversations.
I think not all researchers aim to build LLM as an AI assistant, which if so I agree only needs interleaved role-play.
Thank you very much for the clarification. We are currently working on a new dataset format that could be linked (for different motivation though). See #2148
Feature request
In the function torch_call of DataCollatorForCompletionOnlyLM, the suggested new feature can support correct masking on user requests even if the user and assistant messages are not present alternately.
The current version requires an assistant message must follow a user message, and a user message follows an assistant message.
Two adjacent messages with the same role will cause wrong masking, as the current codes haven't considered that a large start variable can be paired with a small end variable when two roles don't take turn by turn :
Using two pointers can solve the issue, below is an example solution:
This code can be tested to output:
Motivation
Support flexible and correct masking strategies for DataCollatorForCompletionOnlyLM, especially allowing masking for continuous messages from the same role.
Your contribution
I submit a PR: #2000
The text was updated successfully, but these errors were encountered: