You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After that, I wondered if I was doing the right thing by submitting Docs with non-native marked tokens of another tokenizer to the transformer model.
Please tell me how the internal learning process of the model works. Is it important to write a custom tokenizer, or do tokens become understandable to the transformer in any way during the learning process?
class ChineseTokenizer(TransformerTokenizer):
def __call__(self, text):
encoding = self._tokenizer(text, return_offsets_mapping=True, add_special_tokens=False)
tokens = encoding["input_ids"]
offsets = encoding["offset_mapping"]
words = []
last_offset = 0
for token, (start, end) in zip(tokens, offsets):
word = text[start:end]
words.append(word)
last_offset = end
return Doc(self.vocab, words=words, spaces=([False] * len(tokens)))
@spacy.registry.tokenizers("CHINESE-AUTOTOKENIZER")
def zh_auto_tokenizer(tokenizer_name: str):
def create_tokenizer(nlp):
return ChineseTokenizer(tokenizer_name)
return create_tokenizer
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have a task to build a ner pipeline. It consists of a transformer and ner pipelines. I used to use a standard tokenizer for Chinese.
After that, I wondered if I was doing the right thing by submitting Docs with non-native marked tokens of another tokenizer to the transformer model.
Please tell me how the internal learning process of the model works. Is it important to write a custom tokenizer, or do tokens become understandable to the transformer in any way during the learning process?
Beta Was this translation helpful? Give feedback.
All reactions