-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 #256
Comments
我也有相同的问题。 |
@tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py, line 179 |
这样训练之后推理的时候会出现 inf 的情况 不知道与这个有无关系
…---- 回复的原邮件 ----
| 发件人 | ***@***.***> |
| 发送日期 | 2024年05月22日 09:29 |
| 收件人 | eosphoros-ai/DB-GPT-Hub ***@***.***> |
| 抄送人 | Zzzzz ***@***.***>,
Mention ***@***.***> |
| 主题 | Re: [eosphoros-ai/DB-GPT-Hub] 我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 (Issue #256) |
@tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py文件line 179,right改成left
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
看了下作者注释写了# training with left-padded tensors in fp16 precision may cause overflow |
好的 感谢 |
04/12/2024 10:26:38 - INFO - dbgpt_hub.llm_base.adapter - Fine-tuning method: LoRA
04/12/2024 10:26:39 - INFO - dbgpt_hub.llm_base.load_tokenizer - trainable params: 15597568 || all params: 6259181568 || trainable%: 0.2492
Running tokenizer on dataset: 0%| | 0/8659 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\run_sft.py", line 79, in
start_sft(train_args)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train_api.py", line 43, in start_sft
sft_train.train(args)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train.py", line 144, in train
run_sft(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train.py", line 53, in run_sft
dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, "sft")
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\data_process\data_utils.py", line 810, in preprocess_dataset
dataset = dataset.map(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 593, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 558, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3105, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3482, in _map_single
batch = apply_function_on_filtered_inputs(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\data_process\data_utils.py", line 664, in preprocess_supervised_dataset
for source_ids, target_ids in template.encode_multiturn(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 270, in encode_multiturn
encoded_pairs = self._encode(tokenizer, system, history)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 321, in _encode
prefix_ids = self._convert_inputs_to_ids(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 368, in _convert_inputs_to_ids
token_ids = token_ids + tokenizer.encode(elem, **kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 2600, in encode
encoded_inputs = self.encode_plus(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3008, in encode_plus
return self._encode_plus(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils.py", line 722, in _encode_plus
return self.prepare_for_model(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3487, in prepare_for_model
encoded_inputs = self.pad(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3292, in pad
encoded_inputs = self._pad(
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\glm3_Parameter\tokenization_chatglm.py", line 271, in _pad
assert self.padding_side == "left"
AssertionError
The text was updated successfully, but these errors were encountered: