Skip to content

This repository contains scripts for running and plotting results with the TT method, a method for estimating population divergence times with a sample of 2 haploid genomes, or a single diploid genome, from each of two populations.

Notifications You must be signed in to change notification settings

anubhabkhan/TT-method

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 

Repository files navigation


README : The TT method

This repository contains guidelines and scripts for implementating the TT method for estimating divergence times between populations. This quick and transparent method requires two haploid genomes (or a single diploid genome) from each of two populations. Please see the published paper for details.

The directory 'Pipeline' contains all the scripts necessary to run both the TT method (estimating divergence times without using an outgroup) and the TTO method (estimating population divergence using an outgroup). Users should download all scripts and empty directories to a suitable location.

Both TT and TTO methods require the ancestral states at all positions in the genome. An example of such files created for the human genome based on consensus among three species of apes can be found at the zenodo DOI below. The TT & TTO methods consider a particular position informative only when the ancestral state has consensus support among all three of Gorilla, Chimpanzee and Orangutan. If intending to estimate divergence times for human populations, the files in 'Ancestral_states.zip' can be downloaded and used (without decompressing). To estimate divergence times for other species, the user will need to create their own ancestral state files of similar format, (one '.txt' file per chromosome, one line per site, for all sites in the genome). DOI


Brief description of included scripts:

'get_file_name.py' - this links keywords to full vcf file names and paths for ease of implementation. The User should edit to include vcf file paths and a relevant keyword for each individual's vcfs.

'count_sample_confs_per_ind_TT.py' & 'count_sample_confs_per_ind_TTO.py'.
These scripts take all-sites vcfs as input, and return counts of sample configurations in 5MB blocks of the genome. Users should edit each script to include file paths for ancestral state and vcf files. Resulting counts are outputted to 'DIR_counts_per_5cm_TT/' and 'DIR_counts_per_5cm_TTO/' respectively.

'get_counts_TT.sh' and 'get_counts_TTO.sh'.
These are example SLURM submission scripts that can be used to implement the above scripts for the 22 autosomes of the human genome, and for as many pairwise individual comparisons as desired. Users should edit to include relevant vcf keywords and SLURM commands.

'get_estimates_TT.py'.
This uses the sample configuration counts previously obtained to estimate parameters including divergence times. Results are outputted to 'DIR_estimates_TT/'.

'get_estimates_TTO.py'.
This script uses the sample configuration counts previously obtained to estimate parameters including divergence times. Results are outputted to 'DIR_estimates_TTO/'.

'plot_TT.R'.
This R script will create plots of divergence time estimates present in 'DIR_estimates_TT/', and output plots to 'DIR_plots/'.

'plot_TTO.R'.
This R script will create plots of divergence time estimates present in 'DIR_estimates_TTO/', and output plots to 'DIR_plots/'.

'wbj.py'.
This script contains functions used by 'get_estimates_TT.py' and 'get_estimates_TTO.py' to perform weighted bloack jack-knife estimation of parameters.


Implementation:

The User should create the following directories at the same location as scripts:
DIR_counts_per_5cm_TT
DIR_counts_per_5cm_TTO_$OUTGROUP
DIR_error_TT
DIR_error_TTO
DIR_estimates_TT
DIR_estimates_TTO/$OUTGROUP_res/
DIR_plots/TTO_$OUTGROUP
(where $OUTGROUP is the keyword of an outgroup individual from 'get_file_name.py')

Once the necessary script have been edited to include vcf locations, the TT & TTO methods can be implemented simply by using:
bash get_counts_TT.sh
python get_estimates_TT.py
Rscript TT_plot.R

bash get_counts_TTO.sh
python get_estimates_TTO.py $OUTGROUP
Rscript TTO_plot.R $OUTGROUP
(where OUTGROUP is the keyword of an outgroup individual from 'get_file_name.py')


vcf file names:

Both the TT and TTO method are set up to take compressed all-sites vcfs as input files, with one vcf per chromosome (22 vcfs per individual). To be useable, vcf file names should follow the naming convention format "...chr1.vcf.gz" for chromosome 1, and "...chr2.vcf.gz" for chromosome 2 etc. For example, in the 'get_file_names.py' script provided, the keyword 'Neanderthal' points to a general vcf file name of 'AltaiNea.hg19_1000g.dq.bqual.RG.realn-snpAD_chr.vcf.gz'. The actual vcf file names should be 'AltaiNea.hg19_1000g.dq.bqual.RG.realn-snpAD_chr1.vcf.gz' for chromosome 1, and so on.


For reference:
Estimating divergence times from DNA sequences.
Per Sjödin, James McKenna, Mattias Jakobsson.
https://doi.org/10.1093/genetics/iyab008

About

This repository contains scripts for running and plotting results with the TT method, a method for estimating population divergence times with a sample of 2 haploid genomes, or a single diploid genome, from each of two populations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 60.1%
  • R 37.3%
  • Shell 2.6%