msprime generates low (< theta) nucleotide diversity #2097
-
I ran an msprime simulation using the demographic model given below and a mutation rate of 5.21e-7, and I'm consistently getting nucleotide diversity estimates that are about 10% too low. For Ne=2e4 (the size population A), theta = 0.04168, I get values closer to 0.038 with the following demographic model. The only explanation I can think of is that an ancestral Ne other than 2e4 is being assigned to the coalescent "prehistory", despite the fact that A, the ancestral population, has N = 2e4. Is there some other subtle error that I may be missing that would account for lower than expected nucleotide diversity for population A in mts? Additionally, Fst between the two subpopulations is lower than what I would expect, presumably due to the same error as the lower pi:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
Hi @mshpak76, I believe this is an issue with the chosen mutation model, it seems. If instead of using |
Beta Was this translation helpful? Give feedback.
-
Following your reasoning, isn't the observed nucleotide diversity
approximately pi(1-pi), so that the ratio should be 1-pi rather than 1-2*pi?
…On Fri, Jul 29, 2022 at 5:48 PM Peter Ralph ***@***.***> wrote:
I agree also. Note that since pi = proportion of sites that are mutated,
the proportion of already-mutated sites that get *two* mutations is of
order pi; and that's what @mshpak76 <https://github.com/mshpak76> is
reporting (since 0.038/0.041= 0.92, which is about 1 - 2 * 0.041). And, I
can confirm (with the following script) that (a) the 'branch' diversity
(which is the expectation under infinite-sites) is 0.041 + noise, and (b)
setting discrete_genome=False gets diversity of 0.041 + noise.
import msprime
samp_size = 100
mut_rate = 5.21e-7
demography = msprime.Demography()
demography.add_population(name = "A", initial_size = 2e+4, initially_active=True)
demography.add_population(name = "B", initial_size = 1e+4, initially_active=True)
demography.add_population_split(time=1e+3, derived=["B"], ancestral="A")
ancestral_ts = msprime.sim_ancestry(samples={"A":samp_size, "B":samp_size}, demography = demography, sequence_length=5000, recombination_rate = 1.09e-6, gene_conversion_rate = 6.25e-6, gene_conversion_tract_length = 100)
mutated_ts = msprime.sim_mutations(ancestral_ts, rate=mut_rate, model=msprime.BinaryMutationModel()) #, discrete_genome=False)
pi = mutated_ts.diversity(mutated_ts.samples(population=0))
bpi = mutated_ts.diversity(mutated_ts.samples(population=0), mode='branch') * mut_rate
theta = 4 * 2e4 * mut_rate
print(f"pi = {pi}, theta = {theta}, branch pi = {bpi}, 1 - pi/theta = {1 - pi/theta}")
(We could even do the math for the expected value under the binary
mutation model; it's easy in this case.)
—
Reply to this email directly, view it on GitHub
<#2097 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUZMZCP4YVPS64SUZH3IGJLVWRNUVANCNFSM545ZRKJQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
=======================
Max Shpak, Ph.D.
Department of Genetics
University of Wisconsin
Madison, WI 53706
|
Beta Was this translation helpful? Give feedback.
Hi @mshpak76, I believe this is an issue with the chosen mutation model, it seems. If instead of using
model=msprime.BinaryMutationModel()
, you usediscrete_genome=False
(i.e., simulate under an infinite sites assumption), you should get the diversity to match expectations equal to theta in population A.