Skip to content

Commit

Permalink
Citation overrepresentation tool
Browse files Browse the repository at this point in the history
Added a new tool which takes a bibliography and prints which labs, journals, & institutions are overrepresented in your citations
  • Loading branch information
emilyasterjones committed Aug 22, 2020
1 parent d0f7fb0 commit 22e2078
Show file tree
Hide file tree
Showing 3 changed files with 233 additions and 7 deletions.
26 changes: 19 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,34 @@
# BioRxiv Speaker Finder
This iPython notebook extracts first and last authors who have published bioRxiv preprints relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.

**Launch:** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/emilyasterjones/bioRxiv_speaker_finder/master?filepath=bioRxiv_speaker_finder.ipynb)
This iPython notebook extracts first and last authors who have published bioRxiv preprints or Pubmed manuscripts relevant to an inputted subject area. You can use it to find researchers outside of your network to invite them as a speaker or cite their work.

**Inputs:** keyword (in title or abstract) & whether you are looking for trainees (first authors) or PIs (last authors)

**Outputs:** Printed data frame & CSV with a list of authors who have published preprints containing all keywords, with the following attributes:
**Outputs:** Printed data frame & CSV with a list of authors who have published manuscripts containing all keywords, with the following attributes:
* institution
* email
* ORCID link
* \# preprints with provided keywords
* \# manuscripts with provided keywords
* source (bioRxiv or Pubmed)
* \# total preprints
* \# total downloads.
* \# total downloads
* \# total works (from ORCID).

Lists are sorted in order of # of keyword preprints so the most relevant authors will be at the top.
Lists are sorted in order of # of keyword manuscripts so the most relevant authors will be at the top.

**Optional:** code cells at the end use APIs to predict the gender & ethnicity of the authors. Predicted female & minority authors are printed and the predictions are appended as columns to the CSV.
If you use these cells, please read the important caveats at the top of the notebook and treat these predictions with caution.

If you use this tool, please cite the original paper which created the Rxivist API:
*Abdill RJ, Blekhman R. "Tracking the popularity and outcomes of all bioRxiv preprints." eLife (2019). doi: 10.7554/eLife.45133.*

# Citation Overrepresentation Tool
This second iPython notebook identifies overrepresented authors, journals, & institutions from a citation list.

**Inputs:** .bib file extracted from a paper

**Outputs:** prints sorted list most-cited last authors, journals, & institutions (as extracted from ORCID)

**Launch:** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/emilyasterjones/bioRxiv_speaker_finder/master)



1 change: 1 addition & 0 deletions bioRxiv_speaker_finder.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@
"outputs": [],
"source": [
"# get list of missing ORCIDs\n",
"#NB this section calls the API once for each author, so it takes a while\n",
"for index, row in auth_df.iterrows():\n",
" if len(row['ORCID'])==0:\n",
" given = re.sub(' ','%20',row['First Name'])\n",
Expand Down
213 changes: 213 additions & 0 deletions citation_overrepresentation_tool.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Citation Overrepresentation Tool\n",
"This tool extracts last authors & journals of papers you have cited and creates frequency tables so you can see who you cite the most often. It then attempts to map authors to instutions via the ORCID database. You can use it to find which journals, labs, and universities receive most of your attention.\n",
"\n",
"1. Extract your citations as a .bib file. Either extract all refs from a single folder in your citation manager or extract them straight from a Word doc if you use Mendeley or Zotero using [this tool](https://rintze.zelle.me/ref-extractor/).\n",
"2. Upload your .bib file to the binder.\n",
"3. Run each cell (Shift+Enter or Play button).\n",
"4. Optional: get an ORCID API key (instructions below) to extract institutions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#imports\n",
"!pip install pybtex\n",
"\n",
"from pybtex.database.input import bibtex\n",
"import glob\n",
"import pandas as pd\n",
"import requests\n",
"import re"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#read .bib file\n",
"ID = glob.glob('*bib')\n",
"parser = bibtex.Parser()\n",
"try:\n",
" bib_data = parser.parse_file(ID[0])\n",
"except:\n",
" raise ValueError(\"Your .bib file has non-UTF8 characters it in (like smart quotes). Please remove them & try again.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#extract author & journal names from each citation\n",
"authors = list()\n",
"for key in bib_data.entries:\n",
" author = bib_data.entries[key].persons['author']\n",
" first_name = author[-1].rich_first_names\n",
" last_name = author[-1].rich_last_names\n",
" first_name = str(first_name)[7:-3]\n",
" last_name = str(last_name)[7:-3]\n",
"\n",
" try:\n",
" journal = bib_data.entries[key].fields['journal']\n",
" except:\n",
" journal = 'Book'\n",
" authors.append([first_name, last_name, journal])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#build data frame & print\n",
"auth_df = pd.DataFrame(authors, columns=['First Name','Last Name', 'Journal'])\n",
"print('Overcited Authors')\n",
"print(auth_df.groupby(['First Name','Last Name']).size().sort_values(ascending=False).head(10))\n",
"print('\\nOvercited Journals')\n",
"print(auth_df.groupby(['Journal']).size().sort_values(ascending=False).head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional: Get institutions from ORCID\n",
"Get an [ORCID API client ID & secret](https://support.orcid.org/hc/en-us/articles/360006897174)\n",
"\n",
"You can learn more about how to [search for an ORCID](https://members.orcid.org/api/tutorial/search-orcid-registry) and [find info about an author given their ORCID](https://members.orcid.org/api/tutorial/read-orcid-records) in the API documentation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#input your client ID & key\n",
"ORCIDAPI_ID = 'YOUR ACCOUNT ID HERE'\n",
"ORCIDAPI_key = 'YOUR ACCOUNT KEY HERE'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# build a request for a token\n",
"payload = {'client_id': ORCIDAPI_ID,\n",
" 'client_secret': ORCIDAPI_key,\n",
" 'scope': '/read-public',\n",
" 'grant_type': 'client_credentials'\n",
" }\n",
"url = 'https://orcid.org/oauth/token'\n",
"headers = {'Accept': 'application/json'}\n",
"response = requests.post(url, data=payload, headers=headers, timeout=None)\n",
"response.raise_for_status()\n",
"token = response.json()['access_token']\n",
"\n",
"# set up headers for searches\n",
"headers = {'Accept': 'application/vnd.orcid+json',\n",
" 'Authorization type': 'Bearer',\n",
" 'Access token': token}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#find ORCIDs\n",
"#NB this section calls the API once for each author, so it takes a while\n",
"orcid_list = list()\n",
"for index, row in auth_df.iterrows():\n",
" given = re.sub(' ','%20',row['First Name'])\n",
" family = re.sub(' ','%20',row['Last Name'])\n",
"\n",
" #build search\n",
" url = \"https://pub.orcid.org/v3.0/search/?q=\" \\\n",
" + \"family-name:\" + family + \"+AND+given-names:\" + given \\\n",
" + \"&rows=1\"\n",
" auth_id = requests.get(url, headers=headers, timeout=None)\n",
"\n",
" #get first returned ORCID\n",
" if auth_id.json()['result'] is not None:\n",
" orcid_list.append(auth_id.json()['result'][0]['orcid-identifier']['path'])\n",
" else:\n",
" orcid_list.append('')\n",
"\n",
"auth_df['ORCID'] = orcid_list"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get institution from ORCID\n",
"#NB this section calls the API once for each author, so it takes a while\n",
"inst_list = list()\n",
"for index, row in auth_df.iterrows():\n",
" if len(row['ORCID'])>0:\n",
" url = \"https://pub.orcid.org/v2.1/\" + row['ORCID'] + \"/record\"\n",
" orcid_request = requests.get(url, headers=headers, timeout=None)\n",
" affil = orcid_request.json()['activities-summary']['employments']['employment-summary']\n",
" if len(affil)>0:\n",
"# print(json.dumps(var, indent=2, separators=(',', ':')))\n",
" inst_list.append(affil[0]['organization']['name'])\n",
" else:\n",
" inst_list.append('Undetermined')\n",
" else:\n",
" inst_list.append('Undetermined')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#add to df & print\n",
"auth_df['Institution'] = inst_list\n",
"print('\\nOvercited Institutions')\n",
"print(auth_df.groupby(['Institution']).size().sort_values(ascending=False).head(10))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

0 comments on commit 22e2078

Please sign in to comment.