Skip to content

Commit

Permalink
Merge branch 'master' into report
Browse files Browse the repository at this point in the history
  • Loading branch information
Giovanni1085 authored Apr 17, 2020
2 parents e48793c + 9d5733d commit f1fd3de
Show file tree
Hide file tree
Showing 8 changed files with 35,494 additions and 1,354 deletions.
2,322 changes: 1,137 additions & 1,185 deletions Notebook_1_SQL_database.ipynb

Large diffs are not rendered by default.

357 changes: 210 additions & 147 deletions Notebook_2_API_queries.ipynb

Large diffs are not rendered by default.

61 changes: 44 additions & 17 deletions Notebook_3_metadata_overview.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
},
{
"cell_type": "code",
"execution_count": 96,
"execution_count": 22,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -47,21 +47,19 @@
},
{
"cell_type": "code",
"execution_count": 97,
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# load metadata\n",
"\n",
"df_meta = pd.read_csv(\"datasets_output/df_pub.csv\",compression=\"gzip\")\n",
"df_cord = pd.read_csv(\"datasets_output/sql_tables/cord19_metadata.csv\",sep=\"\\t\",header=None,names=[ 'cord19_metadata_id', 'source', 'license', 'full_text_file', 'ms_academic_id',\n",
" 'who_covidence', 'sha', 'full_text', 'pub_id'])\n",
"df_meta.drop(columns=\"Unnamed: 0\",inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"execution_count": 24,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -222,7 +220,7 @@
"4 2020-03-28 08:46:55.291546 "
]
},
"execution_count": 98,
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -233,7 +231,7 @@
},
{
"cell_type": "code",
"execution_count": 99,
"execution_count": 25,
"metadata": {},
"outputs": [
{
Expand All @@ -245,7 +243,7 @@
" dtype='object')"
]
},
"execution_count": 99,
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -263,7 +261,7 @@
},
{
"cell_type": "code",
"execution_count": 100,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -283,7 +281,7 @@
},
{
"cell_type": "code",
"execution_count": 101,
"execution_count": 7,
"metadata": {},
"outputs": [
{
Expand All @@ -300,7 +298,7 @@
"Name: publication_year, dtype: float64"
]
},
"execution_count": 101,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -311,16 +309,16 @@
},
{
"cell_type": "code",
"execution_count": 102,
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a6fc3d6d0>"
"<matplotlib.axes._subplots.AxesSubplot at 0x1a232c0d50>"
]
},
"execution_count": 102,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
Expand All @@ -341,16 +339,16 @@
},
{
"cell_type": "code",
"execution_count": 103,
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a6fdae150>"
"<matplotlib.axes._subplots.AxesSubplot at 0x1a1ca2c990>"
]
},
"execution_count": 103,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
Expand All @@ -369,6 +367,35 @@
"sns.distplot(df_meta[(pd.notnull(df_meta.publication_year)) & (df_meta.publication_year > 2000)].publication_year.tolist(), bins=20, hist=True, kde=False)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"df_meta[\"abstract_length\"] = df_meta.abstract.str.len()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(39154, 14)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_meta[df_meta.abstract_length>0].shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ This workflow can be illustrated as follows:

For the moment, we consider publications from the following sources:

* [CORD19](https://pages.semanticscholar.org/coronavirus-research) (last updated March 28, 2020):
* [Dimensions](https://docs.google.com/spreadsheets/d/1-kTZJZ1GAhJ2m4GAIhw1ZdlgO46JpvX0ZQa232VWRmw/edit#gid=2034285255) (last updated March 28, 2020):
* [WHO](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov) (last updated March 28, 2020)
* [CORD19](https://pages.semanticscholar.org/coronavirus-research) (last updated April 10, 2020):
* [Dimensions](https://docs.google.com/spreadsheets/d/1-kTZJZ1GAhJ2m4GAIhw1ZdlgO46JpvX0ZQa232VWRmw/edit#gid=2034285255) (last updated April 10, 2020):
* [WHO](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov) (last updated April 10, 2020)

You will need to download these datasets and add them to a local folder in order to process them. We assume that you will have a local copy of the whole CORD19 dataset, and a `csv` file with publication metadata for Dimensions and WHO. Previous releases of the Dimensions and WHO lists can be found in the [datasets_input](datasets_input) folder. Please also see the notebooks below for more details.

Expand All @@ -39,14 +39,14 @@ You can use the [Notebook_1_SQL_database](Notebook_1_SQL_database.ipynb) noteboo
* The `pub` table contains publications from all data sources. If you would like to work with publications coming exclusively from one data source, join it with the `datasource` table via the `pub_datasource` table.
* The primary keys of all tables (`pub_id`, `covid19_mtadata_id`, `who_metadata_id`, `dimensions_metadata_id`, `datasource_id`) are not stable and are only internally consistent: if you create different versions of the database, they will likely differ.
* In order to work with Dimensions and Altmetrics data, *publication identifiers* should be used. Please give preference to DOIs, then to PMIDs, then to PMCIDs (listed in order of coverage).
* We removed a few (<1000) publications which had no known identifier among these three options. These are usually pre-prints, which are likely to be equipped with an identifier in future releases.
* We removed a few (~1200) publications which had no known identifier among these three options. These are usually pre-prints, which are likely to be equipped with an identifier in future releases.
* The `metadata` tables contain fields which are specific to a datasource, and we considered potentially useful. They are only available for publications coming from that datasource.

### Query Dimensions and Altmetrics

You can then query [Dimensions](https://docs.dimensions.ai/dsl) and [Altmetrics](https://api.altmetric.com) APIs using your own keys, using the [Notebook_2_API_queries](Notebook_2_API_queries.ipynb) notebook. You can request access as a researcher here: https://www.dimensions.ai/scientometric-research.

### Data overview
### Data analysis

Using the [Notebook_3_metadata_overview](Notebook_3_metadata_overview.ipynb) and [Notebook_4_API_data_overview](Notebook_4_API_data_overview.ipynb) notebooks, you can get an overview of some of the resulting metadata and data.

Expand Down
Loading

0 comments on commit f1fd3de

Please sign in to comment.