Merge branch 'master' into report

CWTSLeiden · Apr 17, 2020 · f1fd3de · f1fd3de
2 parents e48793c + 9d5733d
commit f1fd3de
Show file tree

Hide file tree

Showing 8 changed files with 35,494 additions and 1,354 deletions.
diff --git a/Notebook_1_SQL_database.ipynb b/Notebook_1_SQL_database.ipynb
diff --git a/Notebook_2_API_queries.ipynb b/Notebook_2_API_queries.ipynb
diff --git a/Notebook_3_metadata_overview.ipynb b/Notebook_3_metadata_overview.ipynb
@@ -9,7 +9,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 96,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
@@ -47,21 +47,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 97,
+   "execution_count": 23,
    "metadata": {},
    "outputs": [],
    "source": [
     "# load metadata\n",
     "\n",
     "df_meta = pd.read_csv(\"datasets_output/df_pub.csv\",compression=\"gzip\")\n",
-    "df_cord = pd.read_csv(\"datasets_output/sql_tables/cord19_metadata.csv\",sep=\"\\t\",header=None,names=[ 'cord19_metadata_id', 'source', 'license', 'full_text_file', 'ms_academic_id',\n",
-    "       'who_covidence', 'sha', 'full_text', 'pub_id'])\n",
     "df_meta.drop(columns=\"Unnamed: 0\",inplace=True)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 98,
+   "execution_count": 24,
    "metadata": {},
    "outputs": [
     {
@@ -222,7 +220,7 @@
        "4  2020-03-28 08:46:55.291546  "
       ]
      },
-     "execution_count": 98,
+     "execution_count": 24,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -233,7 +231,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 99,
+   "execution_count": 25,
    "metadata": {},
    "outputs": [
     {
@@ -245,7 +243,7 @@
        "      dtype='object')"
       ]
      },
-     "execution_count": 99,
+     "execution_count": 25,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -263,7 +261,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 100,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -283,7 +281,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 101,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
@@ -300,7 +298,7 @@
        "Name: publication_year, dtype: float64"
       ]
      },
-     "execution_count": 101,
+     "execution_count": 7,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -311,16 +309,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 102,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<matplotlib.axes._subplots.AxesSubplot at 0x1a6fc3d6d0>"
+       "<matplotlib.axes._subplots.AxesSubplot at 0x1a232c0d50>"
       ]
      },
-     "execution_count": 102,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     },
@@ -341,16 +339,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 103,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<matplotlib.axes._subplots.AxesSubplot at 0x1a6fdae150>"
+       "<matplotlib.axes._subplots.AxesSubplot at 0x1a1ca2c990>"
       ]
      },
-     "execution_count": 103,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     },
@@ -369,6 +367,35 @@
     "sns.distplot(df_meta[(pd.notnull(df_meta.publication_year)) & (df_meta.publication_year > 2000)].publication_year.tolist(), bins=20, hist=True, kde=False)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_meta[\"abstract_length\"] = df_meta.abstract.str.len()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(39154, 14)"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df_meta[df_meta.abstract_length>0].shape"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/README.md b/README.md
@@ -16,9 +16,9 @@ This workflow can be illustrated as follows:
 
 For the moment, we consider publications from the following sources:
 
-* [CORD19](https://pages.semanticscholar.org/coronavirus-research) (last updated March 28, 2020): 
-* [Dimensions](https://docs.google.com/spreadsheets/d/1-kTZJZ1GAhJ2m4GAIhw1ZdlgO46JpvX0ZQa232VWRmw/edit#gid=2034285255) (last updated March 28, 2020): 
-* [WHO](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov) (last updated March 28, 2020)
+* [CORD19](https://pages.semanticscholar.org/coronavirus-research) (last updated April 10, 2020): 
+* [Dimensions](https://docs.google.com/spreadsheets/d/1-kTZJZ1GAhJ2m4GAIhw1ZdlgO46JpvX0ZQa232VWRmw/edit#gid=2034285255) (last updated April 10, 2020): 
+* [WHO](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov) (last updated April 10, 2020)
 
 You will need to download these datasets and add them to a local folder in order to process them. We assume that you will have a local copy of the whole CORD19 dataset, and a `csv` file with publication metadata for Dimensions and WHO. Previous releases of the Dimensions and WHO lists can be found in the [datasets_input](datasets_input) folder. Please also see the notebooks below for more details. 
 
@@ -39,14 +39,14 @@ You can use the [Notebook_1_SQL_database](Notebook_1_SQL_database.ipynb) noteboo
 * The `pub` table contains publications from all data sources. If you would like to work with publications coming exclusively from one data source, join it with the `datasource` table via the `pub_datasource` table. 
 * The primary keys of all tables (`pub_id`, `covid19_mtadata_id`, `who_metadata_id`, `dimensions_metadata_id`, `datasource_id`) are not stable and are only internally consistent: if you create different versions of the database, they will likely differ.
 * In order to work with Dimensions and Altmetrics data, *publication identifiers* should be used. Please give preference to DOIs, then to PMIDs, then to PMCIDs (listed in order of coverage). 
-* We removed a few (<1000) publications which had no known identifier among these three options. These are usually pre-prints, which are likely to be equipped with an identifier in future releases.
+* We removed a few (~1200) publications which had no known identifier among these three options. These are usually pre-prints, which are likely to be equipped with an identifier in future releases.
 * The `metadata` tables contain fields which are specific to a datasource, and we considered potentially useful. They are only available for publications coming from that datasource.
 
 ### Query Dimensions and Altmetrics
 
 You can then query [Dimensions](https://docs.dimensions.ai/dsl) and [Altmetrics](https://api.altmetric.com) APIs using your own keys, using the [Notebook_2_API_queries](Notebook_2_API_queries.ipynb) notebook. You can request access as a researcher here: https://www.dimensions.ai/scientometric-research.
 
-### Data overview
+### Data analysis
 
 Using the [Notebook_3_metadata_overview](Notebook_3_metadata_overview.ipynb) and [Notebook_4_API_data_overview](Notebook_4_API_data_overview.ipynb) notebooks, you can get an overview of some of the resulting metadata and data.