NER_CRF任务出现数据形状不一致错误 #491

Thove · 2022-08-19T08:51:32Z

提问时请尽可能提供如下信息：

基本信息

你使用的操作系统: windows 10
你使用的Python版本: 3.6
你使用的Tensorflow版本: 1.14.0
你使用的Keras版本: 2.3.1
你使用的bert4keras版本: 0.11.4
你使用纯keras还是tf.keras: 纯keras
你加载的预训练模型:中文bert chinese_L-12_H-768_A-12

核心代码

# 请在此处贴上你的核心代码。
# 请尽量只保留关键部分，不要无脑贴全部代码。

被输入这个函数的traindata形状为【text，labels】。

class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, d in self.sample(random):
            tokens = tokenizer.tokenize(d[0], maxlen=maxlen)
            tokens_ids = tokenizer.tokens_to_ids(tokens)
            d1 = d[1]
            d1.insert(0, 0)
            d1.insert(-1, 0)
            lab = np.array(d1)
            seg = [0]* len(tokens_ids)
            batch_token_ids.append(tokens_ids)
            batch_segment_ids.append(seg)
            batch_labels.append(lab)
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []

输出信息

# 请在此处贴上你的调试输出
Epoch 1/10

   1/4358 [..............................] - ETA: 20:45:01 - loss: 191.2775 - sparse_accuracy: 0.0365
   2/4358 [..............................] - ETA: 12:53:31 - loss: 150.3369 - sparse_accuracy: 0.1306
   3/4358 [..............................] - ETA: 10:30:37 - loss: 122.2665 - sparse_accuracy: 0.2740Traceback (most recent call last):
  File "C:/Users/cypress/Desktop/nlp-master/nlp_induction_training/task4/preprosessing.py", line 256, in <module>
    epochs=epochs,
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1147, in fit
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\keras\backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 850 values, but the requested shape has 840
	 [[{{node loss/conditional_random_field_1_loss/sparse_loss/Reshape}}]]

自我尝试

不管什么问题，请先尝试自行解决，“万般努力”之下仍然无法解决再来提问。此处请贴上你的努力过程。
我基本上没有改动太多原本代码，因此报错第一时间认为是我的预处理有问题，因此我尝试过修改多次datagenerater，此外我还把batch_size从32改成了10，把输入数据裁剪以适应512的最大长度，修改keras的引用而改用bert4keras.backend中的keras，修改dense层的大小，但都失败了
我百思不得其解的是，为什么明明可以训练几个batch，却还在之后报错。我同样尝试过修改学习率为2e-6，这依然没有奏效。
经过调试，我的数据生成器每次生成的三条数据都有着完美的一样大小。

i4never · 2022-09-01T10:59:59Z

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

Thove · 2022-09-02T02:21:44Z

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

非常感谢您的认真回答

Thove · 2022-09-02T02:52:20Z

这正是问题所在了

bojone · 2022-09-19T06:41:04Z

你这逻辑上就错了，先有tokenizer，然后对输入进行tokenize，然后根据tokenize的结果构建标签。你这是妄想tokenizer按照你所给标签进行对齐么？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER_CRF任务出现数据形状不一致错误 #491

NER_CRF任务出现数据形状不一致错误 #491

Thove commented Aug 19, 2022

i4never commented Sep 1, 2022

Thove commented Sep 2, 2022

Thove commented Sep 2, 2022

bojone commented Sep 19, 2022

NER_CRF任务出现数据形状不一致错误 #491

NER_CRF任务出现数据形状不一致错误 #491

Comments

Thove commented Aug 19, 2022

基本信息

核心代码

输出信息

自我尝试

i4never commented Sep 1, 2022

Thove commented Sep 2, 2022

Thove commented Sep 2, 2022

bojone commented Sep 19, 2022