Using Graph Database #12

ZNBai · 2023-03-23T12:06:11Z

At the moment we are using SQLite databases, it is fine for the current amount of data, but in the future the amount of data will become much larger, so I should try graph databases such as Neo4j, which has a higher performance and runs much faster.

ZNBai · 2023-03-28T08:19:44Z

Progress I have completed functions to store data from information in scopus csv files to Neo4j nodes and edges, and is currently storing all papers from 2022 that have Dutch researchers involved.

Why Scopus? Scopus has very comprehensive paper data, especially its metadata contains details of authors' affiliations, countries and paper keywords (which are not available on other paper search websites)

How? As the number of papers involving Dutch researchers in just one year is 50,000+, the Scopus API does not offer to handle such a large amount of data. Therefore, I use the Scopus Document Search website (which requires academic IPs, such as the UvA VPN). The query string is as follows:
PUBYEAR > 2015 AND PUBYEAR < 2024 AND ( LIMIT-TO ( OA , "all" ) ) AND ( LIMIT-TO ( AFFILCOUNTRY , "Netherlands" ) ) AND ( LIMIT-TO ( PUBSTAGE , "final" ) ) AND ( LIMIT-TO ( PUBYEAR , 2022 ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) )
Using this statement we can get: papers (in 2022) with researchers working in Dutch institutions among the authors. So the authors in the data we obtain are most Dutch researchers, and researchers from other countries who have collaborated with them.

Neo4j database structure:

Two types of nodes: Person and Publication;
One type of relationship: IS_AUTHOR_OF

Person Node properties:
scopus_id (string), name (string), affiliation (stirng), country (string), keywords (list), year of published papers (list), subject (list)

Publication Node properties:
doi (string), title (string), cited_by (num), year (string), keywords (list), subject (list)

IS_Author relationship properties:
author_name (string), title (string), year (string)

ZNBai · 2023-03-28T08:23:51Z

It takes a long time to store data (storing 20,000 papers' metadata costs 6h+, but there are 50,000+ papers every year)
, so I can store the data while I do the next tasks (like find visualization tools that can connect to Neo4j).

Solved After creating CONSTRAINT for node Person and Publication, it only takes 20min to store a year's paper data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Graph Database #12

Using Graph Database #12

ZNBai commented Mar 23, 2023 •

edited

Loading

ZNBai commented Mar 28, 2023

ZNBai commented Mar 28, 2023 •

edited

Loading

Using Graph Database #12

Using Graph Database #12

Comments

ZNBai commented Mar 23, 2023 • edited Loading

ZNBai commented Mar 28, 2023

ZNBai commented Mar 28, 2023 • edited Loading

ZNBai commented Mar 23, 2023 •

edited

Loading

ZNBai commented Mar 28, 2023 •

edited

Loading