Skip to content

How to make tokenizer handle double-word countries name ? #2793

Discussion options

You must be logged in to vote

If you want "South Korea" to be one token, the best approach would be to find and then merge the tokens afterwards. You can do this by adding a custom component to your pipeline.

This example shows a pretty similar use case: https://github.com/explosion/spaCy/blob/master/examples/pipeline/custom_component_countries_api.py

Given a list of countries, it uses the PhraseMatcher to find them in the Doc and merges them into one token. Optionally, you can also set entity labels or custom attributes on the merged spans.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #2793 on December 10, 2020 13:30.