Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link EC Projects (Awards vocabulary) to EuroSciVoc subjects and participating organizations with data from CORDIS #382

Open
1 task
ptamarit opened this issue Aug 9, 2024 · 0 comments · Fixed by inveniosoftware/invenio-app-rdm#2791 · May be fixed by #399

Comments

@ptamarit
Copy link
Member

ptamarit commented Aug 9, 2024

Tasks

CORDIS Data sources

  1. As an API (aka Datalab)
  2. As a dataset

Determining CORDIS data modification dates

  1. The web UI contains a global "Updated" date as well as individual "Updated" dates for each file download.
  2. The "Linked data" downloadable from the web UI also contain matching "modified" fields, but the download URLs do not seem to be guessable.
  3. The individual XML files contain a contentCreationDate, contentUpdateDate, sourceUpdateDate, and lastUpdateDate tag for each project file. However, none of them is as recent as the "2024-08-08" date published on the web UI as of this writing (2024-08-12).
  4. The CSV and JSON project files only contain contentUpdateDate, and it does match with contentUpdateDate in the individual XML files.

Participant Identification Code (PIC) to ROR

  1. paulmillar/PIC-to-ROR
    • The version hosted on paulmillar.github.io contains 2474 entries
    • Re-running the process generates a file containing 2641 entries.
      • git clone https://github.com/paulmillar/PIC-to-ROR.git
      • cd PIC-to-ROR
      • Edit bin/acquire-data, for the variable metadata_url remove the / after records.
      • bin/acquire-data
      • python -m venv .venv
      • source .venv/bin/activate
      • pip install gitpython sparqlwrapper geopy
      • git tag 0.1.0
      • git tag -a 0.1.0 -m "release: 0.1.0"
      • $ ./process.py data/organization.csv
        Loaded 110212 organisations from ROR data dump
        Loaded 41836 organisations from CORDIS data
        Wikidata has 2629 organisations with PIC and ROR information
        Wikidata has 2913 organisations with EU VAT and ROR information
        MEDIAN DISTANCE: 1.5681415233811342 km
        Skipping 905144443 --> https://ror.org/053194h78: No location information for PIC 905144443 in CORDIS data
        Summary:
            2641 Total mapped
        Mapped organisations written as "pic-to-ror.json".
    • Re-running the process for Horizon Europe instead of H2020 generates a file containing 1797 entries.
      • Edit acquire-data, replace h2020 by HORIZON twice
      • bin/acquire-data
      • $ ./process.py data/organization.csv
        Loaded 110212 organisations from ROR data dump
        Loaded 24049 organisations from CORDIS data
        Wikidata has 2629 organisations with PIC and ROR information
        Wikidata has 2913 organisations with EU VAT and ROR information
        MEDIAN DISTANCE: 1.4825546435263037 km
        Skipping 909860874 --> https://ror.org/057g20z61: No location information for PIC 909860874 in CORDIS data
        Skipping 938773179 --> https://ror.org/027ynra39: No location information for PIC 938773179 in CORDIS data
        Skipping 987022046 --> https://ror.org/04jt5e503: No location information for PIC 987022046 in CORDIS data
        Summary:
            1797 Total mapped
        Mapped organisations written as "pic-to-ror.json".
  2. OpenAIRE OpenOrgs
  3. Some organizations in the full OpenAIRE Graph Dataset organization.tar file contain both ROR and PIC identifiers

Open questions

  • Should we use the API or the dataset as a data source? -> Let's use the data set as a first step (which is consistent with other HTTP Readers like RORHTTPReader and OpenAIREProjectHTTPReader).
  • How do we determine the last modification date?
  • Should we only process Horizon Europe data or Horizon 2020 and FP7 data? -> Let's start with Horizon Europe.
  • Should this belong in awards (along with OpenAIRE Project) or in a new vocabulary? -> Let's try to "augment" the existing awards vocabulary, adding organizations and subjects data to existing projects imported from OpenAIRE. We should check what to do if the project exists in CORDIS but not in OpenAIRE (should we skip it or create a project?).
@ptamarit ptamarit changed the title Integrate EuroSci vocabulary as datastreams Integrate CORDIS vocabulary as datastreams Aug 9, 2024
@ptamarit ptamarit self-assigned this Aug 9, 2024
@ptamarit ptamarit changed the title Integrate CORDIS vocabulary as datastreams Integrate CORDIS vocabulary as datastream Aug 9, 2024
@ptamarit ptamarit changed the title Integrate CORDIS vocabulary as datastream Link EC Projects (Awards vocabulary) to EuroSciVoc subjects and participating organizations with data from CORDIS Aug 16, 2024
@ptamarit ptamarit reopened this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
1 participant