LLaMA3_1-8B-Instruct Lora 微调数据格式化问题 #275

Evilxya · 2024-11-01T12:03:04Z

我注意到response里面添加了<|eot_id|>，但是在input_ids中同样添加了[tokenizer.pad_token_id]，这两个是不是添加重复了呢？

def process_func(example):
MAX_LENGTH = 384 # Llama分词器会将一个中文字切分为多个token，因此需要放开一些最大长度，保证数据的完整性
input_ids, attention_mask, labels = [], [], []
instruction = tokenizer(f"<|start_header_id|>user<|end_header_id|>\n\n{example['instruction'] + example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", add_special_tokens=False) # add_special_tokens 不在开头加 special_tokens
response = tokenizer(f"{example['output']}<|eot_id|>", add_special_tokens=False)
input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1] # 因为eos token咱们也是要关注的所以补充为1
labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
if len(input_ids) > MAX_LENGTH: # 做一个截断
input_ids = input_ids[:MAX_LENGTH]
attention_mask = attention_mask[:MAX_LENGTH]
labels = labels[:MAX_LENGTH]
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels
}

GithubX-F · 2024-11-14T01:29:12Z

   
    input_ids = instruction["input_ids"] + response["input_ids"]
    attention_mask = instruction["attention_mask"] + response["attention_mask"]
    
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMA3_1-8B-Instruct Lora 微调数据格式化问题 #275

LLaMA3_1-8B-Instruct Lora 微调数据格式化问题 #275

Evilxya commented Nov 1, 2024

GithubX-F commented Nov 14, 2024

LLaMA3_1-8B-Instruct Lora 微调 数据格式化问题 #275

LLaMA3_1-8B-Instruct Lora 微调 数据格式化问题 #275

Comments

Evilxya commented Nov 1, 2024

GithubX-F commented Nov 14, 2024

LLaMA3_1-8B-Instruct Lora 微调数据格式化问题 #275

LLaMA3_1-8B-Instruct Lora 微调数据格式化问题 #275