Character-level features and ngrams for the TextCategorizer #3678
heyalistair
started this conversation in
New Features & Project Ideas
Replies: 1 comment
-
Well, we do use some character level features in the neural network at least: the word vectors are calculated by concatenating vectors for the word form, the prefix, the suffix, and the word shape. The prefix and suffix are currently defined as 1 character long and 3 characters long respectively. If your documents are 1 to 3 words long, you usually have some extra metadata it's worth using as features. So you might want to come up with a custom model. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Feature description
The TextCategorizer currently comes with three different architecture options (ensemble, simple-cnn, bow). Though these three already cover a range of use cases, they are all based on word-level features, and they are not ideal for extremely short documents (1 to 3 word documents). This feature request is for another architecture option that uses character-level ngrams (or something similar) at features for TextCategorizer.
Could the feature be a custom component or spaCy plugin?
Possibly, but it would integrate nicely as a general architecture option. However, maybe extremely short documents are outside the scope of what spaCy is trying to cover as they are borderline "natural language"?
Beta Was this translation helpful? Give feedback.
All reactions