forked from lh3/psmc
-
Notifications
You must be signed in to change notification settings - Fork 0
muhammadsohailraza/psmc
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This software package infers population size history from a diploid sequence using the Pairwise Sequentially Markovian Coalescent (PSMC) model. The detailed model is described in file `psmc.tex'. To compile the binaries, you may run make; (cd utils; make) After that, you may try utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa utils/psmc2history.pl diploid.psmc | utils/history2ms.pl > ms-cmd.sh utils/psmc_plot.pl diploid diploid.psmc where `diploid.fq.gz' is typically the whole-genome diploid consensus sequence of one human individual, which can be generated by, for example: samtools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - \ | vcfutils.pl vcf2fq -d 10 -D 100 | gzip > diploid.fq.gz Here option -d sets and minimum read depth and -D sets the maximum. It is recommended to set -d to a third of the average depth and -D to twice. Program `fq2psmcfa' transforms the consensus sequence into a fasta-like format where the i-th character in the output sequence indicates whether there is at least one heterozygote in the bin [100i, 100i+100). Program `psmc' infers the population size history. In particular, the `-p' option specifies that there are 64 atomic time intervals and 28 (=1+25+1+1) free interval parameters. The first parameter spans the first 4 atomic time intervals, each of the next 25 parameters spans 2 intervals, the 27th spans 4 intervals and the last parameter spans the last 6 time intervals. The `-p' and `-t' options are manually chosen such that after 20 rounds of iterations, at least ~10 recombinations are inferred to occur in the intervals each parameter spans. Impropriate settings may lead to overfitting. The command line in the example above has been shown to be suitable for modern humans. The `psmc' program infers the scaled mutation rate, the recombination rate and the free population size parameters. All these parameters are scaled to 2N0. You may run `psmc2history.pl' combined with `history2ms.pl' to generate the ms command line that simulates the history inferred by PSMC, or visualize the result with `psmc_plot.pl'. To perform bootstrapping, one has to run splitfa first to split long chromosome sequences to shorter segments. When the `-b' option is applied, psmc will then randomly sample with replacement from these segments. As an example, the following command lines perform 100 rounds of bootstrapping: utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa utils/splitfa diploid.psmcfa > split.psmcfa psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa seq 100 | xargs -i echo psmc -N25 -t15 -r5 -b -p "4+25*2+4+6" \ -o round-{}.psmc split.fa | sh cat diploid.psmc round-*.psmc > combined.psmc utils/psmc_plot.pl -pY50000 combined combined.psmc One probably wants to modify the "xargs" command-line to parallelize PSMC. If you have questions about PSMC, please ask at <http://hengli.uservoice.com/>. You do not need to register unless you also want to modify your own questions. You may also post comments at github (if you have a github account). I want to make the question and the answer public such that others can see them and I do not need to answer the same question multiple times. Thank you for using PSMC. APPENDIX I: Scaling the PSMC output =================================== The PSMC output is scaled to the 2N_0. There are two ways of rescaling the time and the popuation size more meaningfully. Firstly, suppose we know the per-site per-generation mutation rate \mu, we can compute N_0 as: N_0 = \theta_0 / (4\mu) / s where \theta_0 is given at the 2nd column of "TR" lines, and s is the bin size we use for generating the PSMC input. Knowing N_0, we can scale time to generations and relative population size to effective size by T_k = 2N_0 * t_k N_k = N_0 * \lambda_k where t_k and \lambda_k are given at the 3rd and 4th columns of "RS" lines, respectively. A problem with the above strategy is that we do not know a definite answer of \mu and in fact it various with regions and mutation types. An alternative way is to use per-site pairwise sequence divergence to represent time: d_k = 2\mu * T_k = t_k * \theta_0 / s and use scaled mutation rate to represent population size: \theta_k = 4N_k * \mu = \lambda_k * \theta_0 / s where, again, t_k and \lambda_k are given at the "RS" line, \theta_0 at the "TR" line and s is the bin size, which defaults to 100 in fq2psmcfa.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Packages 0
No packages published