Custom tokenizer for trf models from hf #13562

K-Grachev-2106756 · 2024-07-08T21:33:11Z

K-Grachev-2106756
Jul 8, 2024

I have a task to build a ner pipeline. It consists of a transformer and ner pipelines. I used to use a standard tokenizer for Chinese.

[nlp.tokenizer]
@tokenizers = "spacy.zh.ChineseTokenizer"
segmenter = "char"

After that, I wondered if I was doing the right thing by submitting Docs with non-native marked tokens of another tokenizer to the transformer model.
Please tell me how the internal learning process of the model works. Is it important to write a custom tokenizer, or do tokens become understandable to the transformer in any way during the learning process?

class ChineseTokenizer(TransformerTokenizer):

    def __call__(self, text):
        encoding = self._tokenizer(text, return_offsets_mapping=True, add_special_tokens=False)
        tokens = encoding["input_ids"]
        offsets = encoding["offset_mapping"]

        words = []
        last_offset = 0

        for token, (start, end) in zip(tokens, offsets):
            word = text[start:end]
            words.append(word)
            last_offset = end

        return Doc(self.vocab, words=words, spaces=([False] * len(tokens)))


@spacy.registry.tokenizers("CHINESE-AUTOTOKENIZER")
def zh_auto_tokenizer(tokenizer_name: str):
    def create_tokenizer(nlp):
        return ChineseTokenizer(tokenizer_name)

    return create_tokenizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom tokenizer for trf models from hf #13562

{{title}}

Replies: 0 comments

Select a reply

Custom tokenizer for trf models from hf #13562

K-Grachev-2106756 Jul 8, 2024

Replies: 0 comments

K-Grachev-2106756
Jul 8, 2024