Character-level features and ngrams for the TextCategorizer #3678

heyalistair · 2019-05-06T09:24:42Z

heyalistair
May 6, 2019

Feature description

The TextCategorizer currently comes with three different architecture options (ensemble, simple-cnn, bow). Though these three already cover a range of use cases, they are all based on word-level features, and they are not ideal for extremely short documents (1 to 3 word documents). This feature request is for another architecture option that uses character-level ngrams (or something similar) at features for TextCategorizer.

Could the feature be a custom component or spaCy plugin?

Possibly, but it would integrate nicely as a general architecture option. However, maybe extremely short documents are outside the scope of what spaCy is trying to cover as they are borderline "natural language"?

honnibal · 2019-05-10T23:55:47Z

honnibal
May 10, 2019
Maintainer

Well, we do use some character level features in the neural network at least: the word vectors are calculated by concatenating vectors for the word form, the prefix, the suffix, and the word shape. The prefix and suffix are currently defined as 1 character long and 3 characters long respectively.

If your documents are 1 to 3 words long, you usually have some extra metadata it's worth using as features. So you might want to come up with a custom model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character-level features and ngrams for the TextCategorizer #3678

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Character-level features and ngrams for the TextCategorizer #3678

heyalistair May 6, 2019

Feature description

Could the feature be a custom component or spaCy plugin?

Replies: 1 comment

honnibal May 10, 2019 Maintainer

heyalistair
May 6, 2019

honnibal
May 10, 2019
Maintainer