This package acts as a Entity Recogniser and Linker using DBpedia Spotlight, annotating SpaCy's Spans and adding them to the entities annotations.
It can be added to an existing spaCy Language object, or create a new one from an empty pipeline.
The results are put in doc.ents
, overwriting existing entities in case of conflict depending on the overwrite_ents
parameter.
The spans produced have the following properties:
span.label_ = 'DBPEDIA_ENT'
span.ent_kb_id_
containing the URI of the linked entityspan._.dbpedia_raw_result
containing the raw json for the entity from DBpedia spotlight (@URI
,@support
,@types
,@surfaceForm
,@offset
,@similarityScore
,@percentageOfSecondRank
)
This package works with:
- python 3.7/3.8/3.9/3.10/3.11
- spaCy>=3.0.0,<4.0.0, last tested on version 3.5
With pip: pip install spacy-dbpedia-spotlight
From GitHub (after clone): pip install .
With a blank new language
import spacy_dbpedia_spotlight
# a new blank model will be created, with the language code provided in the parameter
nlp = spacy_dbpedia_spotlight.create('en')
# in this case, the pipeline will be only contain the EntityLinker
print(nlp.pipe_names)
# ['dbpedia_spotlight']
On top of an existing nlp object (added as last pipeline stage by default)
import spacy
# this is any existing model
nlp = spacy.load('en_core_web_lg')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight')
# see the pipeline, the added stage is at the end
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'dbpedia_spotlight']
The pipeline stage can be added at any point of an existing pipeline (using the arguments before
, after
, first
or last
).
A specific positioning can be useful if you are using the output of one stage as input to another stage.
import spacy
# this is any existing model
nlp = spacy.load('en_core_web_lg')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight', first=True)
# see the pipeline, the added stage is at the beginning
print(nlp.pipe_names)
# ['dbpedia_spotlight', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
After having instantiated the component, you can use the spaCy API as usual, and you will get the DBPedia spotlight entities
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight')
doc = nlp('Google LLC is an American multinational technology company.')
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])
Output example:
[('Google LLC', 'http://dbpedia.org/resource/Google', '0.9999999999999005'), ('American', 'http://dbpedia.org/resource/United_States', '0.9861264878996763')]
This component can be used with several parameters, which control the usage of the DBpedia Spotlight API and the behaviour of this bridge library.
All the configuration options described in detail below can be passed when instantiating the pipeline component with the config
optional parameter.
import spacy
nlp = spacy.load('en_core_web_lg')
# instantiate Italian EntityLinker on the English model
nlp.add_pipe('dbpedia_spotlight', config={'language_code': 'it'})
Or, in alternative, the values can be changed also after the pipeline stage creation. In this case, you can modify them directly in the pipeline stage object
import spacy
text = 'And the boy said "voglio andare negli Stati Uniti"'
nlp = spacy.blank('en')
# at the beginning we want to use default parameters (in this case the english API endpoint is used)
nlp.add_pipe('dbpedia_spotlight')
doc = nlp(text)
# no entities found
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])
# we want to change the `language_code`
nlp.get_pipe('dbpedia_spotlight').language_code = 'it'
# you need to re-create the document, because the entities are computed at document creation
doc = nlp(text)
# now we have one entity
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])
language_code
controls the language of DBpedia Spotlight. The API is located at https://api.dbpedia-spotlight.org/{language_code}
.
By default the language to be used is derived from the nlp.meta['lang']
. So if you are using a French pipeline, the default is fr
.
When you pass a value in the configuration, this will override the default value. If you are using a pipeline in a language not supported by DBPedia Spotlight, you will be required to set this configuration option.
To support a language, it needs to be supported both by spaCy and by DBpedia-spotlight. While for spaCy there is only one column (supported / not supported), for DBpedia there are two columns:
- DBpedia REST endpoint available (remote API): if there the REST endpoint is available directly from dbpedia-spotlight (https://api.dbpedia-spotlight.org/{LANGUAGE}/). You don't need a local deployment of the API, but keep in mind that if you do too many request it would be better to deploy the model locally
- DBpedia model available (local deployment): the API model can be downloaded and executed locally (https://databus.dbpedia.org/dbpedia/spotlight/spotlight-model/). See below in the "Deploying a local model" section
language | code | spaCy supported | DBpedia REST endpointavailable (remote API) | DBpedia model available (local API) |
---|---|---|---|---|
Catalan | ca |
✅ | ✅ | ✅ |
Chinese | zh |
✅ | ❌ | ❌ |
Croatian | hr |
✅ | ❌ | ❌ |
Danish | da |
✅ | ✅ | ✅ |
Dutch | nl |
✅ | ✅ | ✅ |
English | en |
✅ | ✅ | ✅ |
Finnish | fi |
✅ | ✅ | ✅ |
French | fr |
✅ | ✅ | ✅ |
German | de |
✅ | ✅ | ✅ |
Greek | el |
✅ | ❌ | ❌ |
Hungarian | hu |
✅ | ✅ | ✅ |
Italian | it |
✅ | ✅ | ✅ |
Japanese | ja |
✅ | ❌ | ❌ |
Korean | ko |
✅ | ❌ | ❌ |
Lithuanian | lt |
✅ | ❌ | ✅ |
Macedonian | mk |
✅ | ❌ | ❌ |
Norwegian Bokmål | nb |
✅ | ❌ | ✅ (no ) |
Polish | pl |
✅ | ❌ | ❌ |
Portuguese | pt |
✅ | ✅ | ✅ |
Romanian | ro |
✅ | ✅ | ✅ |
Russian | ru |
✅ | ✅ | ✅ |
Spanish | es |
✅ | ✅ | ✅ |
Swedish | sv |
✅ | ✅ | ✅ |
Turkish | tr |
✅ | ✅ | ✅ |
Ukrainian | uk |
✅ | ❌ | ❌ |
Multi-language | xx |
✅ | ❌ | ❌ |
Example:
import spacy
# Greek not supported by spotlight
nlp = spacy.blank('el')
# so let's try to use the English endpoint on the greek language
nlp.add_pipe('dbpedia_spotlight', config={'language_code': 'en'})
If you don't want to use api.dbpedia-spotlight.org
as server (for example because you have your local DBPedia Spotlight deployed), you can use the dbpedia_rest_endpoint
parameter to point to a custom server.
The default value is http://api.dbpedia-spotlight.org/{language_code}
By setting this parameter, the language_code
parameter will be ignored. You are providing the URL of the endpoint to be used (excluding the last part which is /annotate
or /spot
or /candidates
).
Example:
import spacy
nlp = spacy.blank('en')
# Use your endpoint: don't put any trailing slashes, and don't include the /annotate path
nlp.add_pipe('dbpedia_spotlight', config={'dbpedia_rest_endpoint': 'http://localhost:2222/rest'})
Especially if you need to use dbpedia-spotlight intensively, you may need to deploy a local copy. There are several advantages:
- faster response time
- more languages available (
lt
andno
) - less overload for the public API. Yes it's publicly shared but bombarding it with thousands of requests is not very polite
You can choose to deploy a local model with Docker or without it.
The full and updated list of models is available here: https://databus.dbpedia.org/dbpedia/spotlight/spotlight-model/
If you already have some knowledge about Docker, this is the easier and fastest option.
# pull the official image
docker pull dbpedia/dbpedia-spotlight
# create a volume for persistently saving the language models
docker volume create spotlight-models
# start the container (here assuming we want the en model, but any other supported language code can be used)
docker run -ti \
--restart unless-stopped \
--name dbpedia-spotlight.en \
--mount source=spotlight-models,target=/opt/spotlight \
-p 2222:80 \
dbpedia/dbpedia-spotlight \
spotlight.sh en
# download main jar
wget https://repo1.maven.org/maven2/org/dbpedia/spotlight/rest/1.1/rest-1.1-jar-with-dependencies.jar
# download latest model (last checked on 10/10/2022) (assuming en model)
wget -O en.tar.gz http://downloads.dbpedia.org/repo/dbpedia/spotlight/spotlight-model/2022.03.01/spotlight-model_lang=en.tar.gz
# extract model
tar xzf en.tar.gz
# run server
java -Xmx8G -jar rest-1.1-jar-with-dependencies.jar en http://localhost:2222/rest
First of all, make sure that the local server is working.
curl http://localhost:2222/rest/annotate \
--data-urlencode "text=President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance." \
--data "confidence=0.35" \
-H "Accept: text/turtle"
Then in Python you can configure the endpoint in the following way
import spacy
nlp = spacy.load('en_core_web_lg')
# Use your endpoint: don't put any trailing slashes, and don't include the /annotate path
nlp.add_pipe('dbpedia_spotlight', config={'dbpedia_rest_endpoint': 'http://localhost:2222/rest'})
The parameter process
conrols which specific type of processing is done. The possible values are:
annotate
: A 4(four) step process - Spotting, Candidate Mapping, Disambiguation and Linking / Stats - for linking unstructured information sourcesspot
: A 1(one) step process - Spotting - for linking unstructured information sourcescandidates
: A 2(two) step process - Spotting, Candidate Mapping - for linking unstructured information sources
The default value is annotate
. This parameter works both for the default DBpedia endpoint and for custom ones.
Example:
import spacy
nlp = spacy.blank('en')
# run the candidates process
nlp.add_pipe('dbpedia_spotlight', config={'process': 'candidates'})
doc = nlp('Google LLC is an American multinational technology company.')
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['resource']['@contextualScore']) for ent in doc.ents])
As can be seen in the documentation of the DBpedia REST API, there are 5 parameters (confidence
, support
, types
, sparql
and policy
) which can be used to filter the results. You can use them through the config
object:
confidence
: confidence score for disambiguation / linkingsupport
: how prominent is this entity in Lucene Model, i.e. number of inlinks in Wikipediatypes
: types filter (Eg.DBpedia:Place)sparql
: SPARQL filteringpolicy
: (whitelist) select all entities that have the same type; (blacklist) - select all entities that have not the same type.
Example:
import spacy
nlp = spacy.blank('en')
text ='Google LLC is an American multinational technology company.'
# get only the places (DBpedia:Place) with confidence above 0.75
nlp.add_pipe('dbpedia_spotlight', config={'types': 'DBpedia:Place', 'confidence': 0.75})
doc = nlp(text)
# this will output [('American', 'http://dbpedia.org/resource/United_States', '0.9861264878996763')]
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])
# now only get the organisations
nlp.get_pipe('dbpedia_spotlight').types = 'DBpedia:Organisation'
# re-create the document
doc = nlp(text)
# this will output [('Google LLC', 'http://dbpedia.org/resource/Google', '0.9999999999999005')]
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])
# now get all together
nlp.get_pipe('dbpedia_spotlight').types = None
# re-create the document
doc = nlp(text)
# this will output both Google and American
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])
This pipeline stage can be loaded on existing language models which already have Entity recognition/linking and can also be loaded on models that don't have it. For this reason you may want to control the behaviour of writing to doc.ents
and decide where the results of DBpedia Spotlight are saved.
By default, this pipeline stage writes to a dedicated span group which can be accessed with doc.spans['dbpedia_spotlight']
, where the name of the span group is dbpedia_spotlight
. You can change the name by using the span_group
parameter.
By default, the doc.ents
are overwritten with the new results. The parameter overwrite_ents
can be used to control how the overwriting of doc.ents
is performed, because other components may have already written there (e.g., the en_core_web_lg
model has a ner
pipeline component which already sets some entities). The component tries to add the new ones from DBpedia, which can be successful if the entities do not overlap in terms of tokens. The cases are the following:
- no tokens overlap between the pre-exisiting
doc.ents
and the new entities: in this casedoc.ents
will contain both the previous entities and the new entities - some tokens overlap and
overwrite_ents=True
: the previous value ofdoc.ents
is saved indoc.spans['ents_original']
and only the dbpedia entities will be saved indoc.ents
- some tokens overlap and
overwrite_ents=False
: the previous value ofdoc.ents
is left untouched, and the dbpedia entities can be found indoc.spans['dbpedia_spotlight']
In case there is a HTTPError from the REST API, you can use the parameter raise_http_errors
to select which behaviour to have:
False
: will ignore the errors (they will be logged and visible on STDOUT).True
: the exception will be rethrown and will stop your processing. This is the default.
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight')
# this time you will get a HTTPError: 400 Client Error
doc = nlp('')
# now change it to False
nlp.get_pipe('dbpedia_spotlight').raise_http_errors = False
# this will generate a warning, but will not break your processing (e.g. in a loop)
doc = nlp('')
In case you need to disable SSL verification (e.g. you are getting SSLCertVerificationError
and you are certain that you know what you are doing), you can use the parameter verify_ssl
to do it:
True
: HTTPS requests are verified with SSL verification. This is the default.False
: HTTPS requests will trigger a certificate verification. Use carefully.
import spacy
nlp = spacy.blank('en')
# during the pipeline instantiation (e.g. custom dbpedia_rest_endpoint with HTTPS but self-signed certificate)
nlp.add_pipe('dbpedia_spotlight', config={'verify_ssl': False})
# or afterwards
nlp.get_pipe('dbpedia_spotlight').verify_ssl = False
# this will generate a warning, but will not break your processing (e.g. in a loop)
doc = nlp('Google LLC is an American multinational technology company.')
print(doc.ents)
# you can suppress warnings with this
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
# and now no warnings
doc = nlp('Google LLC is an American multinational technology company.')
print(doc.ents)
If you are training a pipeline and you want to include the component in it, you can add to your config.cfg
:
[nlp]
lang = "en"
pipeline = ["dbpedia"]
[components]
[components.dbpedia]
factory = "dbpedia_spotlight"
overwrite_ents = false
debug = false
After a few requests to DBpedia spotlight, the public web service will reply with some bad HTTP codes.
The solution is to use a local DBpedia instance. See above for the "Local
pip install -r requirements.txt
# test
coverage run --source=spacy_dbpedia_spotlight -m pytest
coverage xml
# build the archive
python setup.py sdist
# upload to pypi
twine upload dist/spacy_dbpedia_spotlight-0.2.6.tar.gz