-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tagging format #1
Comments
Hi blester125, So, I could understand why this notation has not been reproduced in the lexicon, which is only a glossary. Mapping from the lexicon entries to the B- and I- notation is quite easy : for each entry, split the term using "space" as the separator and prefix the first token with B- and the following ones with I-. Best regards |
Thank you, I missed this discussion. The annotation format is the one CoreNLP suggests here: https://stanfordnlp.github.io/CoreNLP/ner.html#training-or-retraining-new-models There must be an assumption that the tagged tokens, if not separated by an O tagged tokens, are part of a contiguous entity. That's the assumption made by brat javascript renderer on the CoreNLP server's visual output. It won't always hold, will it, semantically? I've never looked in to the underlying LSTM. I would be glad to hear of alternative more sophisticated approaches! |
As for the source references from which the annotated sentences were extracted (during an unrelated project in the early 2000s) , many but not all of them are available as JP2 scans under an Open Government Licence. The list of sources is here: https://github.com/BritishGeologicalSurvey/geo-ner-model/blob/main/REFERENCES.md |
In most NER datasets there is some sort of span labeling scheme where prefixes like
B-
orI-
are used to separate mentions of the same type that are adjacent.In the data it looks like there isn't a span labeling scheme used.
Are there no mentions in the datasets that touch or am I missing some strategy that delims them?
The text was updated successfully, but these errors were encountered: