-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMID to PMC API from Medline cannot convert all provided PMID #37
Comments
I uploaded PMID-PMC pairs (size of 91 MB, not bad not bad) where we can download as follow:
For given file, you can convert PMID to PMC on your own. From here, we can modify def parse_citation_web(pmc):
"""
Parse citations from given PMC
Parameters
----------
pmc: str, PMC of the document e.g. 'PMC1217341'
Returns
-------
dict_out: dict, contains following keys
pmc: Pubmed Central ID
n_citations: number of citations for given articles
pmc_cited: list of PMCs that cite the given PMC
"""
link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/" % str(pmc)
page = requests.get(link)
tree = html.fromstring(page.content)
n_citations = extract_citations(tree)
n_pages = int(n_citations/30) + 1
pmc_cited_all = list() # all PMC cited
citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
pmc_cited = list(map(extract_pmc, citations))
pmc_cited_all.extend(pmc_cited)
if n_pages >= 2:
for i in range(2, n_pages+1):
link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/?page=%s" % (pmc, str(i))
page = requests.get(link)
tree = html.fromstring(page.content)
citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
pmc_cited = list(map(extract_pmc, citations))
pmc_cited_all.extend(pmc_cited)
pmc_cited_all = [p for p in pmc_cited_all if p is not pmc]
dict_out = {'n_citations': n_citations,
'pmc': pmc,
'pmc_cited': pmc_cited_all}
return dict_out |
Also, we also want to add Copyright Notice for scraping function so that users don't scrape too much and get blocked https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC |
What about ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv ? |
@nick-hahner, nice! It contains ~ 1.8M rows of PMID/ PMC pairs of Open Access Subset. I'm still thinking about how to update the list regularly by not hurting the repository. I mean, I could upload PMC-PMID pairs from MEDLINE somewhere as I mentioned. Do you have any preference or suggestions on how to make it available on the repository? |
Actually this file is probably better with 4,892,265 rows: First
How's that sound? |
@nick-hahner Yes! I used the ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz file for my local conversion. |
The API here cannot convert all PMID input. I was trying to parse citations from given set of PMIDs but it only returns subset of PMIDs that I provided. One possibility is to host pair of PMIDs/PMCs somewhere on the cloud and provide similar API or source file that user can use to convert PMID to PMC.
The text was updated successfully, but these errors were encountered: