-
Notifications
You must be signed in to change notification settings - Fork 169
Home
Titipat Achakulvisut edited this page Jan 16, 2020
·
18 revisions
We include PySpark snippets on how to parse Pubmed Open-Access and MEDLINE dataset on the wiki page here
- Setup Spark 2.1
- Download and preprocess MEDLINE dataset
- Download and preprocess Pubmed Open-Access Subset
Here are links for downloading Pubmed OA and MEDLINE data
- Pubmed Open-Access (OA) dataset is available at http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/. Here is the FTP link for downloading the bulk of dataset.
- the MEDLINE XMLs are available here ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/
- the MEDLINE XMLs weekly updates are available here ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz/
- MEDLINE Document Type Definitions (DTDs) file is available at this link. We can use it to see available tag from MEDLINE xml.
- Please see copyright notice when you scrape data from website here
- MEDLINE Kung-Fu, use medic to parse MEDLINE to database
- MEDLINEXMLToJSON implemented in JavaScript