Citation overrepresentation tool

Added a new tool which takes a bibliography and prints which labs, journals, & institutions are overrepresented in your citations
emilyasterjones · Aug 22, 2020 · 22e2078 · 22e2078
1 parent d0f7fb0
commit 22e2078
Show file tree

Hide file tree

Showing 3 changed files with 233 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,34 @@
 # BioRxiv Speaker Finder
-This iPython notebook extracts first and last authors who have published bioRxiv preprints relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.
-
-**Launch:** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/emilyasterjones/bioRxiv_speaker_finder/master?filepath=bioRxiv_speaker_finder.ipynb)
+This iPython notebook extracts first and last authors who have published bioRxiv preprints or Pubmed manuscripts relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.
 
 **Inputs:** keyword (in title or abstract) & whether you are looking for trainees (first authors) or PIs (last authors)
 
-**Outputs:** Printed data frame & CSV with a list of authors who have published preprints containing all keywords, with the following attributes:
+**Outputs:** Printed data frame & CSV with a list of authors who have published manuscripts containing all keywords, with the following attributes:
 * institution
 * email
 * ORCID link
-* \# preprints with provided keywords
+* \# manuscripts with provided keywords
+* source (bioRxiv or Pubmed)
 * \# total preprints
-* \# total downloads.
+* \# total downloads
+* \# total works (from ORCID).
 
-Lists are sorted in order of # of keyword preprints so the most relevant authors will be at the top.
+Lists are sorted in order of # of keyword manuscripts so the most relevant authors will be at the top.
 
 **Optional:** code cells at the end use APIs to predict the gender & ethnicity of the authors. Predicted female & minority authors are printed and the predictions are appended as columns to the CSV.  
 If you use these cells, please read the important caveats at the top of the notebook and treat these predictions with caution.
 
 If you use this tool, please cite the original paper which created the Rxivist API:  
 *Abdill RJ, Blekhman R. "Tracking the popularity and outcomes of all bioRxiv preprints." eLife (2019). doi: 10.7554/eLife.45133.*
+
+# Citation Overrepresentation Tool
+This second iPython notebook identifies overrepresented authors, journals, & institutions from a citation list.
+
+**Inputs:** .bib file extracted from a paper
+
+**Outputs:** prints sorted list most-cited last authors, journals, & institutions (as extracted from ORCID)
+
+**Launch:** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/emilyasterjones/bioRxiv_speaker_finder/master)
+
+
+
diff --git a/bioRxiv_speaker_finder.ipynb b/bioRxiv_speaker_finder.ipynb
@@ -324,6 +324,7 @@
    "outputs": [],
    "source": [
     "# get list of missing ORCIDs\n",
+    "#NB this section calls the API once for each author, so it takes a while\n",
     "for index, row in auth_df.iterrows():\n",
     "    if len(row['ORCID'])==0:\n",
     "        given = re.sub(' ','%20',row['First Name'])\n",

diff --git a/citation_overrepresentation_tool.ipynb b/citation_overrepresentation_tool.ipynb
@@ -0,0 +1,213 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Citation Overrepresentation Tool\n",
+    "This tool extracts last authors & journals of papers you have cited and creates frequency tables so you can see who you cite the most often. It then attempts to map authors to instutions via the ORCID database. You can use it to find which journals, labs, and universities receive most of your attention.\n",
+    "\n",
+    "1. Extract your citations as a .bib file. Either extract all refs from a single folder in your citation manager or extract them straight from a Word doc if you use Mendeley or Zotero using [this tool](https://rintze.zelle.me/ref-extractor/).\n",
+    "2. Upload your .bib file to the binder.\n",
+    "3. Run each cell (Shift+Enter or Play button).\n",
+    "4. Optional: get an ORCID API key (instructions below) to extract institutions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#imports\n",
+    "!pip install pybtex\n",
+    "\n",
+    "from pybtex.database.input import bibtex\n",
+    "import glob\n",
+    "import pandas as pd\n",
+    "import requests\n",
+    "import re"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#read .bib file\n",
+    "ID = glob.glob('*bib')\n",
+    "parser = bibtex.Parser()\n",
+    "try:\n",
+    "    bib_data = parser.parse_file(ID[0])\n",
+    "except:\n",
+    "    raise ValueError(\"Your .bib file has non-UTF8 characters it in (like smart quotes). Please remove them & try again.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#extract author & journal names from each citation\n",
+    "authors = list()\n",
+    "for key in bib_data.entries:\n",
+    "    author = bib_data.entries[key].persons['author']\n",
+    "    first_name = author[-1].rich_first_names\n",
+    "    last_name = author[-1].rich_last_names\n",
+    "    first_name = str(first_name)[7:-3]\n",
+    "    last_name = str(last_name)[7:-3]\n",
+    "\n",
+    "    try:\n",
+    "        journal = bib_data.entries[key].fields['journal']\n",
+    "    except:\n",
+    "        journal = 'Book'\n",
+    "    authors.append([first_name, last_name, journal])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#build data frame & print\n",
+    "auth_df = pd.DataFrame(authors, columns=['First Name','Last Name', 'Journal'])\n",
+    "print('Overcited Authors')\n",
+    "print(auth_df.groupby(['First Name','Last Name']).size().sort_values(ascending=False).head(10))\n",
+    "print('\\nOvercited Journals')\n",
+    "print(auth_df.groupby(['Journal']).size().sort_values(ascending=False).head(10))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Optional: Get institutions from ORCID\n",
+    "Get an [ORCID API client ID & secret](https://support.orcid.org/hc/en-us/articles/360006897174)\n",
+    "\n",
+    "You can learn more about how to [search for an ORCID](https://members.orcid.org/api/tutorial/search-orcid-registry) and [find info about an author given their ORCID](https://members.orcid.org/api/tutorial/read-orcid-records) in the API documentation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#input your client ID & key\n",
+    "ORCIDAPI_ID = 'YOUR ACCOUNT ID HERE'\n",
+    "ORCIDAPI_key = 'YOUR ACCOUNT KEY HERE'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# build a request for a token\n",
+    "payload = {'client_id': ORCIDAPI_ID,\n",
+    "                   'client_secret': ORCIDAPI_key,\n",
+    "                   'scope': '/read-public',\n",
+    "                   'grant_type': 'client_credentials'\n",
+    "                   }\n",
+    "url = 'https://orcid.org/oauth/token'\n",
+    "headers = {'Accept': 'application/json'}\n",
+    "response = requests.post(url, data=payload, headers=headers, timeout=None)\n",
+    "response.raise_for_status()\n",
+    "token = response.json()['access_token']\n",
+    "\n",
+    "# set up headers for searches\n",
+    "headers = {'Accept': 'application/vnd.orcid+json',\n",
+    "           'Authorization type': 'Bearer',\n",
+    "           'Access token': token}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#find ORCIDs\n",
+    "#NB this section calls the API once for each author, so it takes a while\n",
+    "orcid_list = list()\n",
+    "for index, row in auth_df.iterrows():\n",
+    "    given = re.sub(' ','%20',row['First Name'])\n",
+    "    family = re.sub(' ','%20',row['Last Name'])\n",
+    "\n",
+    "    #build search\n",
+    "    url = \"https://pub.orcid.org/v3.0/search/?q=\" \\\n",
+    "        + \"family-name:\" + family + \"+AND+given-names:\" + given \\\n",
+    "        + \"&rows=1\"\n",
+    "    auth_id = requests.get(url, headers=headers, timeout=None)\n",
+    "\n",
+    "    #get first returned ORCID\n",
+    "    if auth_id.json()['result'] is not None:\n",
+    "        orcid_list.append(auth_id.json()['result'][0]['orcid-identifier']['path'])\n",
+    "    else:\n",
+    "        orcid_list.append('')\n",
+    "\n",
+    "auth_df['ORCID'] = orcid_list"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get institution from ORCID\n",
+    "#NB this section calls the API once for each author, so it takes a while\n",
+    "inst_list = list()\n",
+    "for index, row in auth_df.iterrows():\n",
+    "    if len(row['ORCID'])>0:\n",
+    "        url = \"https://pub.orcid.org/v2.1/\" + row['ORCID'] + \"/record\"\n",
+    "        orcid_request = requests.get(url, headers=headers, timeout=None)\n",
+    "        affil = orcid_request.json()['activities-summary']['employments']['employment-summary']\n",
+    "        if len(affil)>0:\n",
+    "#         print(json.dumps(var, indent=2, separators=(',', ':')))\n",
+    "            inst_list.append(affil[0]['organization']['name'])\n",
+    "        else:\n",
+    "            inst_list.append('Undetermined')\n",
+    "    else:\n",
+    "        inst_list.append('Undetermined')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#add to df & print\n",
+    "auth_df['Institution'] = inst_list\n",
+    "print('\\nOvercited Institutions')\n",
+    "print(auth_df.groupby(['Institution']).size().sort_values(ascending=False).head(10))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}