The exact pipeline used is available on github under GITHUB_LINK. As some of the data sets are subjects to some restrictions, they cannot be made available.
Thus, this repo contains all the code used to merge repo, and below (and in the paper) we have info on how data was obtained. We can provide meta-data, however, as this is either part of publications or publicly available
- obtain all the genetic data files listed below, and copy them in the correct folders (see below)
- obtain genotype chip info files
- download Snakemake, the workflow was run using version 3.5.2, and is not working on some more recent versions
- make sure plink is in the path
- install numpy, pandas, scipy for python
- install dplyr, yaml, rworldmap, data.table for R
- run
snakemake pgs/gvar3.indiv_meta
The full data set was obtained from David Reich with permission for demographic
analyses. Sampling location information was obtained from table S9.4 of
Lazaridis et al. 2014. We used the population information in the vdata
subset
of all ascertainment panels, except for the analysis where we asses
ascertainment bias.The utility convert
from admixtools
(Patterson at al. 2012) was used to convert the data into plink format
The data generated by the Estonian Biocenter (REFS) was kindly provided in plink format by Mait Metspalu on 10/30/15, along with location information where it was available. This data set contained 1282568 SNP. Of those, 6770 SNP had non-unique ids and were therefore removed.
The data was downloaded on 6/24/15 from www.biotec.or.th/PASNP. Location-metadata was obtained on the same day from the map on the same website, and individuals were matched to populations using the individual identifiers. All individuals with the same tag were assigned the median of all locations from that tag. The data was first lifted onto hg19 (with 5 out of 54794 SNP being removed), and then reformated into binary plink format.
This data set was downloaded on 6/20/15 from http://jorde-lab.genetics.utah.edu/pub/affy6_xing2010/. Sampling locations were kindly provided by Jinchuan Xing. We used version 32 of the annotation file obtained from affymetrix.com to map SNP onto hg19, remove strand-ambiguous SNP and to flip SNP that were on the minus-strand.
POPRES data was obtained under dbGAB accession XXXX to John Novembre, and we used the data as processed in Novembre et al. (2008). We used version 32 of the annotation file obtained from www.affymetrix.com ("Mapping250K_sp.na32.annot.csv" and "Mapping250K_Sty.na32.annot.csv") to filter SNP that did not map onto hg19 and we removed strand-ambiguous AT and GC polymorphisms. Following Novembre et al. we only retained individuals for which all grand-parents were from the same country, and split up the Swiss sample according to language groups.
The data were obtained on 7/14/15 from Mark Stoneking with permission for demographic analyses. After merging the three different source files, SNP not mapping to hg19 using the annotation file "GenomeWideSNP_6.na32.annot.csv" were removed, as were AT and GC SNPs. Sampling locations were extracted from Figure 1 of Reich et al. (2011)
Data was obtained on 8/13/15 in binary plink format from http://drineas.org/Maritime_Route/RAW_DATA/PLINK_FILES/MARITIME_ROUTE.zip. Sampling location information was obtained from Supplementary Table 3 in Paschou et al. (2013). SNP not mapping to hg19 using the annotation file "GenomeWideSNP_6.na32.annot.csv" were removed, as were AT and GC SNPS.
This data was obtained from Choongwon Jeong and Anna Di Rienzo. We used the same filtering as in the Jeong et al. (2017) study, but only added the samples originating from these three studies with permissions from the respective authors.
All Sources with the exception of the Estonian Biocenter data provided (approximate) sampling coordinates. However, the level of accuracy varied between sources, with some providing specific ethnicities, some (such as POPRES) only providing country information and others just providing city- or state-level information. For POPRES-derived data, and most countries, we followed Novembre et al. (2008) and assigned individuals to the countries centerpoint, with the exception of Sweden, Finland, which were assigned their capital.
For the Estonian BioCentre data, sampling location data was highly heterogeneous. Samples that could not be confidently assigned toa region with an approx. 100km radius were excluded. For populations with samples from multiple studies, the most accurate source location was used. For locations covered with different accuracy, only the most accurate samples were retained. (For example, we excluded all Spanish individuals from POPRES (only country level data), as human origins provided samples from eleven different regions in Spain)
All genetic data was merged using plink. We excluded all sites that were not
biallelic or where alleles were ambigiously labeled in different source files.
This resulted in a file with 1.9M SNP in a total of 8698 individuals, but with
only 19.8% average genotype availability, with no SNP genotyped in all
individuals. To remove closely related individuals, we first created a LD-pruned
set of SNP using the --indep-pairwise 1000 1000 .1
flag in plink. then, we
calculated a relationship matrix using the --make-grm-bin
flag, and removed
individuals with a relationship larger than 0.6, which reduced the number of
individuals to 8062 individuals.
these are the files that are required to start the pipeline
- raw/paschou.zip #downloaded archive
- raw/MARITIME_ROUTE.bed #extracted
- raw/POPRES_Genotypes_QC1_v2.bed #popres data from John Novembre
- raw/reich2011/Australia.bed #stoneking/reich SEAsia data
- raw/reich2011/Denisova-SEAsia-Oceania.bed #stoneking/reich SEAsia data
- raw/reich2011/Stoneking.Data.tar #stoneking/reich SEAsia data
- raw/reich2011/STONEKING.malaysia.ped #stoneking/reich SEAsia data
- raw/verdu2014/allAutosomes_82-nativeAmericans_illuminaHuman610_unphased_passedQC_SNPs_dbGaP.ped #verdu data from dbgap, not used in paper
- raw/hugo/Genotypes_All.txt #downloaded HUGO genotypes
- raw/affy6_344_raw_genotype_xing #downloaded xing et al data
- raw/xing_sample_pop.txt #individual/pop data for xing et al
- raw/EuropeAllData/vdata.ind/snp/pop #reich format Lazaridis et al. data
- raw/Data_for_Ben.bed #estonian biocentre data from mait
- qatari/NWAfrica_HM3_Qat.bed (African data)
- qatari/qatari.bed (qatari data)
- qatari/hg37.bed (lifted african data)
- tib/HGDP_Tibetan_Merged_160509.bed #obtained from Choongwon Jeong
- sources/POPRES_Phenotypes.txt : obtained from John Novembre through data from 2008 paper
These files are require to annotate snp correctly, they were obtained from the manufacturer's website and are also required for the automated processing
- chip/GenomeWideSNP_6.na32.annot.csv
- chip/Mapping250K_Nsp.na32.annot.csv
- chip/Mapping250K_Sty.na32.annot.csv
intermediate Datafiles after basic cleaning, in plink format (also bim and fam files named similarly) they are automatically generated here
- data/Data_for_Ben.bed #estonian biocentre data from Mait Metspalu
- data/hugo.bed #hugo data
- data/MARITIME_ROUTE.bed #Paschou et al data
- data/POPRES_Genotypes_QC1_v2.bed #popres data
- data/reich2011.bed
- data/vdata.bed #Lazaridis full data
- data/verdu.bed #verdu et al 2014 data (not used in paper)
- data/xing.bed #xing et al 2010 data
All temporary mergeing files and the merged genotype data
- merged/*bed
- merged/*bim
- merged/*fam
- supplementary/lifted.xbed
- supplementary/unlifted.xbed list of duplicated labels across studies, used to merge and exclude samples
- duplicate_dict.txt
- sources/Data_for_Ben_Meta.xlsx: obtained from Mait Metspalu on November 2015 (email)
- sources/Stoneking.pops.txt : From Stoneking.Data.tar, obtained from Mark Stoneking
- sources/HGDP_SampleInformation.txt: obtained from wget -O HGDP_SampleInformation.txt http://web.stanford.edu/group/rosenberglab/data/rosenberg2006ahg/SampleInformation.txt
- sources/human_origins.csv : Table S9.4, Email from David Reich through John Novembre
- sources/POPRES_TS3.csv:table S3 from paper
- sources/PASNP_Map.htm : from the website http://www4a.biotec.or.th/PASNP/PASNP_Map
- sources/hugo_meta.csv : processed version
- sources/Pop_Positions_Xing_2010.csv: from Jichuan Xing by Email
- sources/botigue2013.pdf: paper for Botigue2013 data
- sources/1000g_loc.csv: from http://www.1000genomes.org/category/frequently-asked-questions/population
- sources/journal.pgen.*png: Verduetal paper Table 1 as image
- sources/paschou_locations.csv: Table S3 from paper
regions/estonian_bibtex.csv
regions/estonian_studies.csv
regions/location2.csv
regions/location_coords2.csv
regions/location_coords.csv
regions/location_full.csv
regions/location_hugo.csv
regions/locations_deduplicated.csv
regions/location_simplified.csv
regions/Stoneking.pops.csv
tib/tib.plink tib/tibetan.indiv_* #used here tib/tibetan.pop_* #used here tib/tib_tibetan.csv tib/HGDP_Tibetan_Merged_160509_tibetan.indiv* #all tibetan tib/HGDP_Tibetan_Merged_160509_tibetan.pop* #all tibetan (for Jeong et al 2017 tib/HGDP_Tibetan_Merged_160509.indiv* #all data from Jeong et al 2017) tib/HGDP_Tibetan_Merged_160509.pop* #all data from Jeong et al 2017)
qatari/codes.txt qatari/flip.txt qatari/keep_snp.txt
pgs/gvar3.names pgs/update_pos.csv pgs/merge.csv