Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagging format #1

Open
blester125 opened this issue Nov 2, 2019 · 3 comments
Open

Tagging format #1

blester125 opened this issue Nov 2, 2019 · 3 comments

Comments

@blester125
Copy link

In most NER datasets there is some sort of span labeling scheme where prefixes like B- or I- are used to separate mentions of the same type that are adjacent.

In the data it looks like there isn't a span labeling scheme used.

the	O
Mearns	LEXICON
Glacigenic	LEXICON
Subgroup	LEXICON
of O

Are there no mentions in the datasets that touch or am I missing some strategy that delims them?

@jeromemassot
Copy link

Hi blester125,
In fact the B- and I- notations are related to n-grams : i.e. when a particular entity is made of several items. But, if two entites of the same label are following each others but are distinct, they should have been tagged with the B-prefix each time.

So, I could understand why this notation has not been reproduced in the lexicon, which is only a glossary.

Mapping from the lexicon entries to the B- and I- notation is quite easy : for each entry, split the term using "space" as the separator and prefix the first token with B- and the following ones with I-.

Best regards
Jerome

@metazool
Copy link
Contributor

metazool commented Oct 27, 2020

Thank you, I missed this discussion. The annotation format is the one CoreNLP suggests here:

https://stanfordnlp.github.io/CoreNLP/ner.html#training-or-retraining-new-models

There must be an assumption that the tagged tokens, if not separated by an O tagged tokens, are part of a contiguous entity. That's the assumption made by brat javascript renderer on the CoreNLP server's visual output. It won't always hold, will it, semantically? I've never looked in to the underlying LSTM.

I would be glad to hear of alternative more sophisticated approaches!

@metazool
Copy link
Contributor

As for the source references from which the annotated sentences were extracted (during an unrelated project in the early 2000s) , many but not all of them are available as JP2 scans under an Open Government Licence. The list of sources is here: https://github.com/BritishGeologicalSurvey/geo-ner-model/blob/main/REFERENCES.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants