Accuracy improvement for indian location names extraction using NER #9832

prakhar-s · 2021-12-08T10:03:10Z

prakhar-s
Dec 8, 2021

Hi,
I am trying to extract Indian location names from text data using NER through Spacy, but the results are not very satisfactory. This might be due the lesser number of India's location names in the training dataset used by spacy. Since this is for a large industrial project we really need it to perform better for India's location names, so can anyone provide me with possible solutions for this? any help would be greatly appreciated.

Thanks

Answered by polm

Dec 10, 2021

Sorry you're having trouble with this, you're correct that there probably aren't enough Indian location names in our training data.

I would recommend you make a list of Indian location names and use an EntityRuler to label the data and see how much coverage that gets. If the coverage is reasonable, you can use that data as training data for an NER component. You can put that component in the pipeline with the existing NER component and see how that works. I suspect that putting it after with overwrite is the best thing to do, but you should try different combinations of before/after the default NER and using overwrite or not. See the double NER example project for notes on how that works.

Y…

View full answer

polm · 2021-12-10T04:48:58Z

polm
Dec 10, 2021

Sorry you're having trouble with this, you're correct that there probably aren't enough Indian location names in our training data.

I would recommend you make a list of Indian location names and use an EntityRuler to label the data and see how much coverage that gets. If the coverage is reasonable, you can use that data as training data for an NER component. You can put that component in the pipeline with the existing NER component and see how that works. I suspect that putting it after with overwrite is the best thing to do, but you should try different combinations of before/after the default NER and using overwrite or not. See the double NER example project for notes on how that works.

You can also try combining annotations from the default NER and your EntityRuler to train one NER component to replace the default NER component. That's simpler in some ways and has fewer computational requirements, but is more likely to run into accuracy issues, so I would definitely try the above approach first.

There are two assumptions I'm making here:

Using the EntityRuler you can get decent coverage, say at least 50%
Your documents are prose sentences

If either of those are not true a different approach would be necessary, and we'd need more info about your data.

0 replies

AMAN1620 · 2024-05-29T05:18:59Z

AMAN1620
May 29, 2024

from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_lg")
data = ['location 1', 'location 2 ', 'location 3', 'location 4',]
nlp.add_pipe("custom_entity_ruler", before="ner")
patterns = [{"label": "GPE", "pattern": location} for location in data]
ruler = nlp.get_pipe("custom_entity_ruler")
ruler.add_patterns(patterns)
doc = nlp("Location 1")
for ent in doc.ents:
print(ent.text, ent.label_)

you can try this code. Also can add as many location you want..

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy improvement for indian location names extraction using NER #9832

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Accuracy improvement for indian location names extraction using NER #9832

prakhar-s Dec 8, 2021

Replies: 2 comments

polm Dec 10, 2021

AMAN1620 May 29, 2024

prakhar-s
Dec 8, 2021

polm
Dec 10, 2021

AMAN1620
May 29, 2024