msprime generates low (< theta) nucleotide diversity #2097

mshpak76 · 2022-07-28T15:57:14Z

mshpak76
Jul 28, 2022

I ran an msprime simulation using the demographic model given below and a mutation rate of 5.21e-7, and I'm consistently getting nucleotide diversity estimates that are about 10% too low. For Ne=2e4 (the size population A), theta = 0.04168, I get values closer to 0.038 with the following demographic model. The only explanation I can think of is that an ancestral Ne other than 2e4 is being assigned to the coalescent "prehistory", despite the fact that A, the ancestral population, has N = 2e4.

Is there some other subtle error that I may be missing that would account for lower than expected nucleotide diversity for population A in mts? Additionally, Fst between the two subpopulations is lower than what I would expect, presumably due to the same error as the lower pi:

demography = msprime.Demography()
demography.add_population(name = "A", initial_size = 2e+4, initially_active=True)
demography.add_population(name = "B", initial_size = 1e+4, initially_active=True)
demography.add_population_split(time=1e+3, derived=["B"], ancestral="A")

def run_sim(num_samples, ancestry_seed, mutation_seed):
ancestral_ts = msprime.sim_ancestry(samples={"A":samp_size, "B":samp_size}, demography = demography, sequence_length=5000, recombination_rate = 1.09e-6, gene_conversion_rate = 6.25e-6, gene_conversion_tract_length = 100, random_seed= ancestry_seed)    
mutated_ts = msprime.sim_mutations(ancestral_ts, rate=5.21e-7, keep=True, random_seed=mutation_seed, model=msprime.BinaryMutationModel())
return mutated_ts

mts = run_sim(num_replicates, *seeds[rep_index])

Answered by apragsdale

Jul 28, 2022

Hi @mshpak76, I believe this is an issue with the chosen mutation model, it seems. If instead of using model=msprime.BinaryMutationModel(), you use discrete_genome=False (i.e., simulate under an infinite sites assumption), you should get the diversity to match expectations equal to theta in population A.

View full answer

apragsdale · 2022-07-28T16:18:27Z

apragsdale
Jul 28, 2022

Hi @mshpak76, I believe this is an issue with the chosen mutation model, it seems. If instead of using model=msprime.BinaryMutationModel(), you use discrete_genome=False (i.e., simulate under an infinite sites assumption), you should get the diversity to match expectations equal to theta in population A.

8 replies

apragsdale Jul 28, 2022

Hmm. When I scale population sizes up and rates down, to reduce the chance of multiple mutations while maintaining theta and rho, I still get lower observed diversity than expected.

Edit: nvm, you expect to get the same number of hits at the same site in that case. So I agree it's due to recurrent mutations.

grahamgower Jul 28, 2022
Collaborator

I see gene conversion here. Could that be having an effect?

andrewkern Jul 28, 2022
Collaborator

i agree with @apragsdale. the binary mutation model will result in nucleotide diversity less than expected under an infinite sites model because of back mutations from 1 to 0. you can confirm that you are getting the expected number of mutations out by asking the mutated treeseq how many mutations it has directly, rather than look at genotype matrices, e.g.,

ts = msprime.sim_ancestry(10)
mts = msprime.sim_mutations(ts,1e-5, model=msprime.BinaryMutationModel())
print(mts.num_mutations)

andrewkern Jul 28, 2022
Collaborator

also see this documentation on "silent" mutations generated by the MatrixMutationModel class https://tskit.dev/msprime/docs/latest/mutations.html#silent-mutations

petrelharp Jul 29, 2022
Maintainer

I agree also. Note that since pi = proportion of sites that are mutated, the proportion of already-mutated sites that get two mutations is of order pi; and that's what @mshpak76 is reporting (since 0.038/0.041= 0.92, which is about 1 - 2 * 0.041). And, I can confirm (with the following script) that (a) the 'branch' diversity (which is the expectation under infinite-sites) is 0.041 + noise, and (b) setting discrete_genome=False gets diversity of 0.041 + noise.

import msprime

samp_size = 100
mut_rate = 5.21e-7

demography = msprime.Demography()
demography.add_population(name = "A", initial_size = 2e+4, initially_active=True)
demography.add_population(name = "B", initial_size = 1e+4, initially_active=True)
demography.add_population_split(time=1e+3, derived=["B"], ancestral="A")

ancestral_ts = msprime.sim_ancestry(samples={"A":samp_size, "B":samp_size}, demography = demography, sequence_length=5000, recombination_rate = 1.09e-6, gene_conversion_rate = 6.25e-6, gene_conversion_tract_length = 100)
mutated_ts = msprime.sim_mutations(ancestral_ts, rate=mut_rate, model=msprime.BinaryMutationModel()) #, discrete_genome=False)

pi = mutated_ts.diversity(mutated_ts.samples(population=0))
bpi = mutated_ts.diversity(mutated_ts.samples(population=0), mode='branch') * mut_rate
theta = 4 * 2e4 * mut_rate

print(f"pi = {pi}, theta = {theta}, branch pi = {bpi}, 1 - pi/theta = {1 - pi/theta}")

(We could even do the math for the expected value under the binary mutation model; it's easy in this case.)

mshpak76 · 2022-10-11T09:04:44Z

mshpak76
Oct 11, 2022
Author

Following your reasoning, isn't the observed nucleotide diversity approximately pi(1-pi), so that the ratio should be 1-pi rather than 1-2*pi?

…

On Fri, Jul 29, 2022 at 5:48 PM Peter Ralph ***@***.***> wrote: I agree also. Note that since pi = proportion of sites that are mutated, the proportion of already-mutated sites that get *two* mutations is of order pi; and that's what @mshpak76 <https://github.com/mshpak76> is reporting (since 0.038/0.041= 0.92, which is about 1 - 2 * 0.041). And, I can confirm (with the following script) that (a) the 'branch' diversity (which is the expectation under infinite-sites) is 0.041 + noise, and (b) setting discrete_genome=False gets diversity of 0.041 + noise. import msprime samp_size = 100 mut_rate = 5.21e-7 demography = msprime.Demography() demography.add_population(name = "A", initial_size = 2e+4, initially_active=True) demography.add_population(name = "B", initial_size = 1e+4, initially_active=True) demography.add_population_split(time=1e+3, derived=["B"], ancestral="A") ancestral_ts = msprime.sim_ancestry(samples={"A":samp_size, "B":samp_size}, demography = demography, sequence_length=5000, recombination_rate = 1.09e-6, gene_conversion_rate = 6.25e-6, gene_conversion_tract_length = 100) mutated_ts = msprime.sim_mutations(ancestral_ts, rate=mut_rate, model=msprime.BinaryMutationModel()) #, discrete_genome=False) pi = mutated_ts.diversity(mutated_ts.samples(population=0)) bpi = mutated_ts.diversity(mutated_ts.samples(population=0), mode='branch') * mut_rate theta = 4 * 2e4 * mut_rate print(f"pi = {pi}, theta = {theta}, branch pi = {bpi}, 1 - pi/theta = {1 - pi/theta}") (We could even do the math for the expected value under the binary mutation model; it's easy in this case.) — Reply to this email directly, view it on GitHub <#2097 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUZMZCP4YVPS64SUZH3IGJLVWRNUVANCNFSM545ZRKJQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- ======================= Max Shpak, Ph.D. Department of Genetics University of Wisconsin Madison, WI 53706

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msprime generates low (< theta) nucleotide diversity #2097

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

msprime generates low (< theta) nucleotide diversity #2097

mshpak76 Jul 28, 2022

Replies: 2 comments · 8 replies

apragsdale Jul 28, 2022

apragsdale Jul 28, 2022

grahamgower Jul 28, 2022 Collaborator

andrewkern Jul 28, 2022 Collaborator

andrewkern Jul 28, 2022 Collaborator

petrelharp Jul 29, 2022 Maintainer

mshpak76 Oct 11, 2022 Author

mshpak76
Jul 28, 2022

Replies: 2 comments 8 replies

apragsdale
Jul 28, 2022

grahamgower Jul 28, 2022
Collaborator

andrewkern Jul 28, 2022
Collaborator

andrewkern Jul 28, 2022
Collaborator

petrelharp Jul 29, 2022
Maintainer

mshpak76
Oct 11, 2022
Author