SemPub16_Task2

Task 2

… of the Semantic Publishing Challenge 2016.

Motivation

Several information about papers published in CEUR-WS.org is hidden within PDFs. Our goal is to extract some data and to make them available as LOD.

That information should describe the organization of the content of the paper and should provide a deeper understanding of the context in which it was written. In particular, the extracted information is expected to answer queries about the internal organization of sections, tables, figures and about the authors’ affiliations and research institutions, and fundings source.

The queries participants are required to answer are shown below.

The task requires techniques for extracting data from PDF, sided by techniques for Named-entity Recognition and Natural Language Processing.

Data Source

The input datasets consists of a set of PDF papers, taken from some of the workshops analysed in Task 1. The papers use different formats and different rules for bibliographic references, headers, affiliations and acknowledgements.

Datasets can be downloaded here:

Training Dataset TD2

PDF papers available in CEUR-WS.org. Individual description given here; list of URLs for convenient one-time download below.

List of URLs for one-time download:

http://ceur-ws.org/Vol-1518/paper1.pdf
http://ceur-ws.org/Vol-1518/paper2.pdf
http://ceur-ws.org/Vol-1518/paper3.pdf
http://ceur-ws.org/Vol-1518/paper4.pdf
http://ceur-ws.org/Vol-1518/paper5.pdf
http://ceur-ws.org/Vol-1518/paper6.pdf
http://ceur-ws.org/Vol-1518/paper7.pdf
http://ceur-ws.org/Vol-1518/paper8.pdf
http://ceur-ws.org/Vol-1518/paper9.pdf
http://ceur-ws.org/Vol-1521/paper1.pdf
http://ceur-ws.org/Vol-1521/paper2.pdf
http://ceur-ws.org/Vol-1521/paper3.pdf
http://ceur-ws.org/Vol-1521/paper4.pdf
http://ceur-ws.org/Vol-1521/paper5.pdf
http://ceur-ws.org/Vol-1521/paper6.pdf
http://ceur-ws.org/Vol-1521/paper7.pdf
http://ceur-ws.org/Vol-1500/paper1.pdf
http://ceur-ws.org/Vol-1500/paper2.pdf
http://ceur-ws.org/Vol-1500/paper3.pdf
http://ceur-ws.org/Vol-1500/paper4.pdf
http://ceur-ws.org/Vol-1500/paper6.pdf
http://ceur-ws.org/Vol-1319/morse14_paper_07.pdf
http://ceur-ws.org/Vol-1317/om2014_Tpaper1.pdf
http://ceur-ws.org/Vol-1514/paper1.pdf
http://ceur-ws.org/Vol-1514/paper2.pdf
http://ceur-ws.org/Vol-1514/paper3.pdf
http://ceur-ws.org/Vol-1514/paper4.pdf
http://ceur-ws.org/Vol-1514/paper5.pdf
http://ceur-ws.org/Vol-1514/paper6.pdf
http://ceur-ws.org/Vol-1405/paper-06.pdf
http://ceur-ws.org/Vol-1303/paper_3.pdf
http://ceur-ws.org/Vol-1001/paper2.pdf
http://ceur-ws.org/Vol-1006/paper5.pdf
http://ceur-ws.org/Vol-1504/uai2015aci_paper3.pdf
http://ceur-ws.org/Vol-1504/uai2015aci_abstract1.pdf
http://ceur-ws.org/Vol-1531/paper1.pdf
http://ceur-ws.org/Vol-1309/paper2.pdf
http://ceur-ws.org/Vol-1309/paper3.pdf
http://ceur-ws.org/Vol-1313/paper_8.pdf
http://ceur-ws.org/Vol-1315/paper9.pdf
http://ceur-ws.org/Vol-1315/paper13.pdf
http://ceur-ws.org/Vol-1320/paper_7.pdf
http://ceur-ws.org/Vol-1320/paper_12.pdf
http://ceur-ws.org/Vol-1320/paper_22.pdf
http://ceur-ws.org/Vol-1320/paper_31.pdf

Expected output on TD2

The following ZIP file contains the expected output of all queries on all papers in the training dataset: sempub16-T2.zip

The archive contains the full list of queries (in QUERIES-LIST.csv) and the output of each of them in a separate .csv file.

For each query there is an entry in QUERIES-LIST.csv indicating the identifier of the query and the natural language description. The output of that query is contained in the corresponding .csv file, as shown below:

QueryID	Natural language description	CSV output file
Q1.1	Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper1.pdf	Q1.1.csv

Evalation dataset ED2

The final evaluation will be performed by using SemPubEvaluator on the set of 40 papers described below.

We will use the following list of queries:

https://github.com/angelobo/SemPubEvaluator/blob/master/data/SemPub2016/queries/Task2_queries_ED.csv

This configuration file is available in the GitHub repository of the evaluation tool.

IMPORTANT: participants are required to submit the set of .CSV files that will be evaluated against the golden standard. Please follow the same numbering scheme used in the queries configuration file.

For instance:

QueryID	Natural language description	Name of the submitted CSV file
Q1.1	Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1006/paper2.pdf	Q1.1.csv

For your convenience, the same data are provided in a format that can be used to generate automatically the expected .csv output for each query:

https://github.com/angelobo/SemPubEvaluator/blob/master/data/SemPub2016/queries/Task2_queries_ED_parameters.csv

This file can be downloaded from the GitHub repository of the evaluation tool as well.

List of papers

Individual links to the papers in ED are given here; list of URLs for convenient one-time download below.

List of URLs for one-time download:

http://ceur-ws.org/Vol-1006/paper2.pdf
http://ceur-ws.org/Vol-1044/paper-01.pdf
http://ceur-ws.org/Vol-1116/paper1.pdf
http://ceur-ws.org/Vol-1116/paper6.pdf
http://ceur-ws.org/Vol-1184/ldow2014_paper_02.pdf
http://ceur-ws.org/Vol-1215/paper-05.pdf
http://ceur-ws.org/Vol-1303/paper_4.pdf
http://ceur-ws.org/Vol-1313/paper_11.pdf
http://ceur-ws.org/Vol-1313/paper_13.pdf
http://ceur-ws.org/Vol-1313/paper_14.pdf
http://ceur-ws.org/Vol-1313/paper_4.pdf
http://ceur-ws.org/Vol-1315/paper15.pdf
http://ceur-ws.org/Vol-1315/paper3.pdf
http://ceur-ws.org/Vol-1315/paper8.pdf
http://ceur-ws.org/Vol-1317/om2014_Tpaper5.pdf
http://ceur-ws.org/Vol-1317/om2014_poster3.pdf
http://ceur-ws.org/Vol-1317/om2014_poster4.pdf
http://ceur-ws.org/Vol-1317/om2014_poster8.pdf
http://ceur-ws.org/Vol-1319/morse14_paper_04.pdf
http://ceur-ws.org/Vol-1320/paper_25.pdf
http://ceur-ws.org/Vol-1405/paper-02.pdf
http://ceur-ws.org/Vol-1405/paper-03.pdf
http://ceur-ws.org/Vol-1405/paper-04.pdf
http://ceur-ws.org/Vol-1405/paper-07.pdf
http://ceur-ws.org/Vol-1504/uai2015aci_abstract2.pdf
http://ceur-ws.org/Vol-1531/paper2.pdf
http://ceur-ws.org/Vol-1531/paper4.pdf
http://ceur-ws.org/Vol-1531/paper8.pdf
http://ceur-ws.org/Vol-1554/PD_MoDELS_2015_paper_3.pdf
http://ceur-ws.org/Vol-1554/PD_MoDELS_2015_paper_5.pdf
http://ceur-ws.org/Vol-1554/PD_MoDELS_2015_paper_6.pdf
http://ceur-ws.org/Vol-1558/paper5.pdf
http://ceur-ws.org/Vol-1558/paper9.pdf
http://ceur-ws.org/Vol-1559/paper04.pdf
http://ceur-ws.org/Vol-1559/paper05.pdf
http://ceur-ws.org/Vol-1559/paper07.pdf
http://ceur-ws.org/Vol-1560/paper2.pdf
http://ceur-ws.org/Vol-1565/bmaw2015_paper4.pdf
http://ceur-ws.org/Vol-1565/bmaw2015_paper8.pdf
http://ceur-ws.org/Vol-1567/paper2.pdf

Queries

Participants are required to produce a dataset for answering the following queries.

Q2.1 (Affiliations in a paper): Identify the affiliations of the authors of the paper X.
Q2.2 (Countries in affiliations): Identify the countries of the affiliations of the authors in the paper X.
Q2.3 (Supplementary material): Identify the supplementary material(s) for the paper X.
Q2.4 (Sections): Identify the titles of the first-level sections of the paper X.
Q2.5 (Tables): Identify the captions of the tables in the paper X
Q2.6 (Figures): Identify the captions of the figures in the paper X.
Q2.7 (Funding agencies): Identify the funding agencies that funded the research presented in the paper X (or part of it).
Q2.8 (EU projects): Identify the EU project(s) that supported the research presented in the paper X (or part of it).

These queries have to be translated in SPARQL according to the challenge's general rules and have to produce an output according to the detailed rules.