Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SemTab matching (joint entity linking) #59

Open
VladimirAlexiev opened this issue Nov 24, 2020 · 12 comments
Open

SemTab matching (joint entity linking) #59

VladimirAlexiev opened this issue Nov 24, 2020 · 12 comments

Comments

@VladimirAlexiev
Copy link

VladimirAlexiev commented Nov 24, 2020

http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Describes a task called

  • Tabular Data to Knowledge Graph Matching, or
  • Entity Linking for Tabular Data

The key difference is this:

  • Reconciliation matches one column (with possible assistance from other columns), so if you need to match multiple columns, you have to do it sequentially
  • Entity matching could match several columns at once, thus doing joint disambiguation (aka joint WSD).

The latter is very useful for non-stratifiable scenarios like "company and CEO", "sportsman and team", etc. So it works for a wider variety of data tasks.

They have a bunch of test cases called "tough tables". See https://tinyurl.com/iswc2020-resources-2T-dataset

You can watch the 15m presentation of this year's winner for a quick intro to the concepts.
https://drive.google.com/file/d/1vz-6nkc9t6MQZYzgg-PZNLs-9TT86wRD/view

  • @wetneb Is this in scope for an extended protocol?
  • @thadguidry is it feasible for OpenRefine to implement this some day?
@wetneb
Copy link
Member

wetneb commented Nov 25, 2020

Hi @VladimirAlexiev, yes I would love the two communities to work together, and I have started discussions in the OAEI workshop to see how that could happen.

If you already have ideas about how the protocol could evolve to get closer to the SemTab challenges, don't hesitate!

@thadguidry
Copy link
Contributor

thadguidry commented Nov 25, 2020

@VladimirAlexiev Yes, OpenRefine actually worked that way before because the Freebase Recon Service used extra columns as hints (with special highly weighted scoring) for disambiguated properties in Freebase, which is somewhat equivalent to the list of Wikidata's P1963 set of properties for any particular type, but was more constrained to only the most important properties in identifying similarly-named topics from each other, which we called "disambiguating properties". Here's some old archived info from the old wiki about them: https://web.archive.org/web/20151002083332/http://wiki.freebase.com/wiki/Disambiguation

@shigapov
Copy link

shigapov commented Dec 3, 2020

Hi @VladimirAlexiev,
Tabular Data to Knowledge Graph Matching or Semantic Table Interpretation contains three tasks: cell entity annotation (CEA), column type annotation (CTA: via P31 or P279 in Wikidata) and columns property annotation (CPA).

I participated in SemTab2020 with bbw-team (3rd place) and our code (https://github.com/UB-Mannheim/bbw) is open source. We used contextual matching (both vertical and horizontal) and meta-lookup for spell checking.

The mentioned 'tough tables' were used only in round 4 of SemTab2020 and they were only a part of the whole dataset, although the most challenging part. Majority of the tables in SemTab2020 were 'synthetically generated' (https://doi.org/10.5281/zenodo.4282879).

Hi @wetneb,
A possible way to account for it is to add the 8th Section "Table annotation service" with specifications of those three tasks (CEA, CTA and CPA). May be some definitions of the tasks can be already added to the 2nd Section "Core concepts".

Hi @thadguidry,
Are there somewhere docs on how candidate retrieval and scoring is currently implemented in OpenRefine? I have read the Section 4.4 "A Note on Candidate Retrieval and Scoring". :)

@wetneb
Copy link
Member

wetneb commented Dec 3, 2020

A possible way to account for it is to add the 8th Section "Table annotation service" with specifications of those three tasks (CEA, CTA and CPA). May be some definitions of the tasks can be already added to the 2nd Section "Core concepts".

I am keen to go in this direction! But I think there is a big overlap between CEA and the reconciliation queries we have, so I have been thinking about generalizing reconciliation queries so that CEA tasks could be formulated as reconciliation queries.

Are there somewhere docs on how candidate retrieval and scoring is currently implemented in OpenRefine? I have read the Section 4.4 "A Note on Candidate Retrieval and Scoring". :)

Candidate retrieval and scoring are not done in OpenRefine, they are done in the reconciliation services themselves. You can get an overview of how services generally do it here:
https://arxiv.org/pdf/1906.08092v2.pdf (in particular the table at the end)

@thadguidry
Copy link
Contributor

@shigapov In addition to Antonin's @wetneb excellent paper, I would say just knowing and learning about Web Search technologies, Text Analysis strategies, and in general lexicographic or linguistic research. If you want a quick primer on things to build better Reconciliation services for particular domains, then you might start with a tool like Elasticsearch that many use to build out with and then create their custom Reconciliation API's from.
I'm not sure if the community already has some scripting or plugins for Elasticsearch that might help you build out even quicker? You might ask around on the W3C Reconciliation mailing list or poke around in GitHub using it's advanced search to see if anyone already has some good starting scripts, code, plugins.

Anyways,
Elasticsearch's wealth of documentation is extremely important I would say for anyone to begin to wrap their hands around building something robust and without much pain.
Starting here perhaps in this order to get you quickly acquainted and thinking in the right direction for your needs.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-engine.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/rest-apis.html

@shigapov
Copy link

shigapov commented Dec 4, 2020

@thadguidry, thank you for many links!

@wetneb, thank you for the paper!
As far as I know at the next SemTab2021 there will be a requirement to participants: to provide publicly available API of a semantic annotator. The specifications for table annotation would be very relevant there.

@shigapov
Copy link

shigapov commented Dec 4, 2020

In the current specification "query" corresponds to a label or to an alias of an entity in Wikidata, right? As additional context we can specify "type" and "properties".

What if additionally to a label of an entity in Wikidata ("query") we could specify a label of an object in the statements corresponding to the entity? Specifying both "query_subject" and "query_object" is yet another way to include context. Then matching could return entities, properties and types. This would be already very close to what we are doing in SemTab.

@wetneb
Copy link
Member

wetneb commented Dec 4, 2020

We can already specify "a label of an object in the statements corresponding to the entity": that is something you can do in the "properties" section. But at the moment you are required to say which property it is a value of. If I understand correctly, in SemTab you do not specify the linking property (the CPA challenge is about inferring it), right? So one way to relax this would be to make the property id optional there.

@VladimirAlexiev
Copy link
Author

@shigapov thanks for mentioning the other two SemTab tasks!
Adding CEA, CPA, CTA as extra sections in the Recon spec won't work, they need to be integrated somehow.

A quick overview and elaboration about what was already mentioned:

  • CEA is the main OR task. It provides a main column (always the label) and some extra columns (properties). Unlike CEA, it targets one entity per row
  • CPA: the closest that OR has is "suggest property", which is auto-complete on property name/definition, i.e. manual selection. Antonin suggested allowing this to be done automatically, but I'm not sure that would work. CPA is done after CEA, eg if you have a table of peaks and ask about a column of numbers, you must already know these are peaks and which peaks are they, before you can guess the column contains the peak height
  • CTA: OR has two related features: "type guessing" based on the most prevalent type of the first 10-20 rows (unfortunately matched only on the basis of label), and "type suggestion" based on auto-completion.
    • IMHO an important subtask is "type generalization". If the 20 rows suggest 5 different types, I don't want to pick just one of them: I want the server to compute the most specific supertype and suggest that.

@shigapov Do you have contact info and timing of SemTab 2021?

@wetneb
Copy link
Member

wetneb commented Dec 20, 2020

CPA is done after CEA

Not necessarily - you can try to infer relations even without reconciling first. This is especially useful when some (or all!) of the entities involved do not exist in the target KB, but the ontology does have a property to represent their relations. If I give you a table where the first column looks like people names, the second looks like city names and the third looks like dates in second half of the 20th century, you can already suggest some likely relations between the column of names and of cities: placeOfBirth is one that comes to mind, for instance.

Your example of peaks and altitudes is another great one: looking at the table, one should be able to guess the property even if we do not know any of the peaks involved (because their names look like peak names and the numbers look like typical altitudes of peaks).

@shigapov
Copy link

@VladimirAlexiev, there is a discussion group for SemTab (https://groups.google.com/g/sem-tab-challenge). You could also contact an organizer Ernesto Jimenez-Ruiz (https://www.city.ac.uk/people/academics/ernesto-jimenez-ruiz). I do not know timing of SemTab2021, but I expect that it starts in April-May and ends in October.

@shigapov
Copy link

Please take a look at the tool MTab for tabular data annotation in SemTab setting. It has now API which returns the results of Cell Entity Annotation, Columns Property Annotation and Column Type Annotation. This is something to discuss with respect to the specs, isn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants