Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Graph Database #12

Open
ZNBai opened this issue Mar 23, 2023 · 2 comments
Open

Using Graph Database #12

ZNBai opened this issue Mar 23, 2023 · 2 comments

Comments

@ZNBai
Copy link
Collaborator

ZNBai commented Mar 23, 2023

At the moment we are using SQLite databases, it is fine for the current amount of data, but in the future the amount of data will become much larger, so I should try graph databases such as Neo4j, which has a higher performance and runs much faster.

@ZNBai
Copy link
Collaborator Author

ZNBai commented Mar 28, 2023

Progress I have completed functions to store data from information in scopus csv files to Neo4j nodes and edges, and is currently storing all papers from 2022 that have Dutch researchers involved.

Why Scopus? Scopus has very comprehensive paper data, especially its metadata contains details of authors' affiliations, countries and paper keywords (which are not available on other paper search websites)

How? As the number of papers involving Dutch researchers in just one year is 50,000+, the Scopus API does not offer to handle such a large amount of data. Therefore, I use the Scopus Document Search website (which requires academic IPs, such as the UvA VPN). The query string is as follows:
PUBYEAR > 2015 AND PUBYEAR < 2024 AND ( LIMIT-TO ( OA , "all" ) ) AND ( LIMIT-TO ( AFFILCOUNTRY , "Netherlands" ) ) AND ( LIMIT-TO ( PUBSTAGE , "final" ) ) AND ( LIMIT-TO ( PUBYEAR , 2022 ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) )
Using this statement we can get: papers (in 2022) with researchers working in Dutch institutions among the authors. So the authors in the data we obtain are most Dutch researchers, and researchers from other countries who have collaborated with them.

Neo4j database structure:

  • Two types of nodes: Person and Publication;
  • One type of relationship: IS_AUTHOR_OF

Person Node properties:
scopus_id (string), name (string), affiliation (stirng), country (string), keywords (list), year of published papers (list), subject (list)

Publication Node properties:
doi (string), title (string), cited_by (num), year (string), keywords (list), subject (list)

IS_Author relationship properties:
author_name (string), title (string), year (string)

@ZNBai
Copy link
Collaborator Author

ZNBai commented Mar 28, 2023

It takes a long time to store data (storing 20,000 papers' metadata costs 6h+, but there are 50,000+ papers every year)
, so I can store the data while I do the next tasks (like find visualization tools that can connect to Neo4j).

Solved After creating CONSTRAINT for node Person and Publication, it only takes 20min to store a year's paper data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant