Post-processing filter for (Named) Entity Linking

This program was created for filtering entity linking candidates and reorder them based on heuristics and data coming from Wikidata and DBpedia, including DBpedia Chapters. The filter tries to delete improbable candidates, such as disambiguation pages or people born after the publication of the document (if dates are available). It can improve the global performance of an Entity Linking system.

We present here below an example of how the filter can remove certain candidates:

Where it has been used?

This filter been used in diverse pubications:

MELHISSA: a multilingual entity linking architecture for historical press articles
Robust Named Entity Recognition and Linking on Historical Multilingual Documents
Exploratory Analysis of News Sentiment Using Subgroup Discovery
Entity Linking for Historical Documents: Challenges and Solutions

The current version is the one used in our latest publication MELHISSA: a multilingual entity linking architecture for historical press articles. For previous versions, please visit the old_versions branch or you can download directly the files from here.

Input format

The code use a column-based format, in which it is necessary to have data regarding the tokens, named entity and entity linking. It has been created for the format used in CLEF-HIPE-2020:

TOKEN	NE-COARSE-LIT	NE-COARSE-METO	NE-FINE-LIT	NE-FINE-METO	NE-FINE-COMP	NE-NESTED	NEL-LIT	NEL-METO
# language = en
# newspaper = sn83030483
# date = 1790-01-02
# document_id = sn83030483-1790-01-02-a-i0004
FROM	O	O	O	O	O	_	_	_
A	O	O	O	O	O	_	_	_
VIRGINIA	B-loc	O	B-loc	O	O	_	Q1370|Q1070529|NIL|Q16155633|Q4112016	_
PAPER	O	O	O	O	O	_	_	_
.	O	O	O	O	O	_	_	_

We use the comment date = to extract the publication date.

Although in the last version (ICADL), it is possible to indicate the columns in which this data is available, and separators and comments, we haven't tested it with other formats.

Furthermore, the filter uses the data provided by the NER tags to process the candidates. Currently, it only supports NER tags encoded with a IOB format.

Citing

Please use this publication for citing this work:

@Article{LinharesPontes2021,
	author={Linhares Pontes, Elvys
	and Cabrera-Diego, Luis Adrián
	and Moreno, Jose G.
	and Boros, Emanuela
	and Hamdi, Ahmed
	and Doucet, Antoine
	and Sidere, Nicolas
	and Coustaty, Mickaël},
	title={MELHISSA: a multilingual entity linking architecture for historical press articles},
	journal={International Journal on Digital Libraries},
	year={2021},
	month={Nov},
	day={29},
	issn={1432-1300},
	doi={10.1007/s00799-021-00319-6},
	url={https://doi.org/10.1007/s00799-021-00319-6}
}

If you use the Weighted-Levenshtein and you use the weights provided in the code, please cite as well:

@Inproceedings{8791206,
  author={Nguyen, Thi-Tuyet-Hai and Jatowt, Adam and Coustaty, Mickael and Nguyen, Nhu-Van and Doucet, Antoine},
  booktitle={2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)}, 
  title={Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing}, 
  year={2019},
  volume={},
  number={},
  pages={29-38},
  doi={10.1109/JCDL.2019.00015}}

DBpedia Chapters

Some of the DBpedia chapters have become offline during 2020-2021, and we do not know if they will come online again. Thus, there might be some issues in specific configurations. This version should be more robust if a chapter becomes offile.

Cached data

We provide the cached data that was used for the latest publication. The use of a cache decreases the number of queries to DBpedia and WikiData, and therefore increases the processing speed.

Dependencies

This project has been tested with Python 3.8. The requirements can be found in requirements.txt and can be installed using pip.

Parent projects

This work is is result of the European Union H2020 Project Embeddia and NewsEye. Embeddia is a project that creates NLP tools that focuses on European under-represented languages and that has for objective to improve the accessibility of these tools to the general public and to media enterprises. Visit Embeddia's Github to discover more NLP tools and models created within this project. NewsEye is a project that develops methods and tools for digital humanities that can enhance the access to historical newspapers to a wide range of users. Visit NewsEye's Github to discover the range of tools developed for the digital humanities.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
Filter		Filter
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Post-processing filter for (Named) Entity Linking

Where it has been used?

Input format

Citing

DBpedia Chapters

Cached data

Dependencies

Parent projects

About

Releases 2

Packages

Languages

License

EMBEDDIA/NEL_Filter

Folders and files

Latest commit

History

Repository files navigation

Post-processing filter for (Named) Entity Linking

Where it has been used?

Input format

Citing

DBpedia Chapters

Cached data

Dependencies

Parent projects

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages