Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER_CRF任务出现数据形状不一致错误 #491

Open
Thove opened this issue Aug 19, 2022 · 4 comments
Open

NER_CRF任务出现数据形状不一致错误 #491

Thove opened this issue Aug 19, 2022 · 4 comments

Comments

@Thove
Copy link

Thove commented Aug 19, 2022

提问时请尽可能提供如下信息:

基本信息

  • 你使用的操作系统: windows 10
  • 你使用的Python版本: 3.6
  • 你使用的Tensorflow版本: 1.14.0
  • 你使用的Keras版本: 2.3.1
  • 你使用的bert4keras版本: 0.11.4
  • 你使用纯keras还是tf.keras: 纯keras
  • 你加载的预训练模型:中文bert chinese_L-12_H-768_A-12

核心代码

# 请在此处贴上你的核心代码。
# 请尽量只保留关键部分,不要无脑贴全部代码。

被输入这个函数的traindata形状为textlabels】。

class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, d in self.sample(random):
            tokens = tokenizer.tokenize(d[0], maxlen=maxlen)
            tokens_ids = tokenizer.tokens_to_ids(tokens)
            d1 = d[1]
            d1.insert(0, 0)
            d1.insert(-1, 0)
            lab = np.array(d1)
            seg = [0]* len(tokens_ids)
            batch_token_ids.append(tokens_ids)
            batch_segment_ids.append(seg)
            batch_labels.append(lab)
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []

输出信息

# 请在此处贴上你的调试输出
Epoch 1/10

   1/4358 [..............................] - ETA: 20:45:01 - loss: 191.2775 - sparse_accuracy: 0.0365
   2/4358 [..............................] - ETA: 12:53:31 - loss: 150.3369 - sparse_accuracy: 0.1306
   3/4358 [..............................] - ETA: 10:30:37 - loss: 122.2665 - sparse_accuracy: 0.2740Traceback (most recent call last):
  File "C:/Users/cypress/Desktop/nlp-master/nlp_induction_training/task4/preprosessing.py", line 256, in <module>
    epochs=epochs,
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1147, in fit
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\keras\backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "D:\anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 850 values, but the requested shape has 840
	 [[{{node loss/conditional_random_field_1_loss/sparse_loss/Reshape}}]]

自我尝试

不管什么问题,请先尝试自行解决,“万般努力”之下仍然无法解决再来提问。此处请贴上你的努力过程。
我基本上没有改动太多原本代码,因此报错第一时间认为是我的预处理有问题,因此我尝试过修改多次datagenerater,此外我还把batch_size从32改成了10,把输入数据裁剪以适应512的最大长度,修改keras的引用而改用bert4keras.backend中的keras,修改dense层的大小,但都失败了
我百思不得其解的是,为什么明明可以训练几个batch,却还在之后报错。我同样尝试过修改学习率为2e-6,这依然没有奏效。
经过调试,我的数据生成器每次生成的三条数据都有着完美的一样大小。

@i4never
Copy link
Contributor

i4never commented Sep 1, 2022

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

@Thove
Copy link
Author

Thove commented Sep 2, 2022

看上去d[0]是输入文本d[1]是label、你能确保d[0]tokenize、转id后的长度与d[1]只差头尾的2个token吗

非常感谢您的认真回答

@Thove
Copy link
Author

Thove commented Sep 2, 2022

这正是问题所在了

@bojone
Copy link
Owner

bojone commented Sep 19, 2022

你这逻辑上就错了,先有tokenizer,然后对输入进行tokenize,然后根据tokenize的结果构建标签。你这是妄想tokenizer按照你所给标签进行对齐么?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants