Empirical Homoresidue shows unexpected extreme dN/dS values. #1

cswarth · 2016-09-13T17:30:10Z

The Empirical Homoresidue model (scatter column furthest right) for high- and middle-fitness values exhibits some unexpectedly extreme dN/dS values. Check out the sampled sequences from those points to understand why these occur.

cswarth · 2016-09-13T17:36:45Z

These are some entries from the results showing extreme values of omega for the homoresidue model at 20,000 generations.

     19:build/empiricalvalues_homoresidue/fit_0.01025/rep_0/gen_20000/results.txt,homoresidue,0.01025,20000,0,29.64818
     22:build/empiricalvalues_homoresidue/fit_0.01025/rep_1/gen_20000/results.txt,homoresidue,0.01025,20000,1,7.09823
     25:build/empiricalvalues_homoresidue/fit_0.01025/rep_2/gen_20000/results.txt,homoresidue,0.01025,20000,2,24.39555
     28:build/empiricalvalues_homoresidue/fit_0.01025/rep_3/gen_20000/results.txt,homoresidue,0.01025,20000,3,56.50374
     31:build/empiricalvalues_homoresidue/fit_0.01025/rep_4/gen_20000/results.txt,homoresidue,0.01025,20000,4,57.65932
     34:build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_20000/results.txt,homoresidue,0.0105,20000,0,0.4906
     37:build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_20000/results.txt,homoresidue,0.0105,20000,1,39.71024
     40:build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_20000/results.txt,homoresidue,0.0105,20000,2,8.75561
     43:build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_20000/results.txt,homoresidue,0.0105,20000,3,10.69716
     46:build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_20000/results.txt,homoresidue,0.0105,20000,4,28.23718

cswarth · 2016-09-13T18:40:38Z

Show below is a fragment from an alignment of deduplicated sequences sampled from fit=0.01025 repetition=3 generation=20000 that has an omega value of 56.5. As expected from the homoresidue nature of the starting population, almost all AA positions are Lysine (Lys/K). non-synonymous substitutions are infrequently scattered across the genome.

What's really strange is that all the nucleotide substitutions are A->C
What would cause that?

cswarth · 2016-09-13T22:43:39Z

The problem is SANTA is always generating A -> C substitutions and no other. I think this is a bug.

Background

Nucleotide mutators in SANTA can be configured with a rate bias matrix to influence the selection of substitution mutations based on the existing nucleotide, e.g.

        <mutator>
          <nucleotideMutator>
            <mutationRate>2.5E-5</mutationRate>
            <rateBias>
                  0.42 2.49 0.29
                  1.73 0.23 4.73
                  6.99 0.20 0.60
                  1.02 2.56 0.88
                </rateBias>
          </nucleotideMutator>
        </mutator>

However for the configuration used for this dN/dS project, the nucleotide mutator is configured without a <rateBias> element, e.g.

  <mutator>
    <nucleotideMutator>
      <mutationRate>$mutationrate</mutationRate>
    </nucleotideMutator>
  </mutator>

When this form of configuration is used without a rate bias matrix, the probability of a transversion, stored in NucleotideMutator.tv, is calculated to be 1.0, so transversions are in fact the only substitutions generated.

That is why we only see Adenine to Cytosine substitutions.

cswarth · 2016-10-06T23:25:06Z

After modifying the config file to generate a full spectrum of mutations (and modifying the default behavior in SANTA), the analysis of dN/dS rations in a simulated population still shows anomalies. The first is the extreme running time of codeml which in some cases runs for more than 11 hours, and the second is the extreme dN/dS ratios produced by some samples - sometimes exceeding 900!

Running time Anomalies

The most extreme runtime was created from a single sample, build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500 that took over 11 hours to complete.

Time used: 11:18:15
scons: done building targets.

real    726m29.257s
user    3m40.778s
sys     0m18.470s

This sample also produced an extreme dN/dS ration of 363.47116, although this is not the most extreme value produced across all samples.

This sample appears to have a normal number of unique sequences, about the same as other samples.

$ seqmagick info build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500/sample_dedup.fa 
name                                                                       alignment    min_len   max_len   avg_len  num_seqs
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500/sample_dedup.fa TRUE             468       468    468.00        31

The codon usage within the sample also looks consistent with other samples.

Extreme dN/dS values

Be very suspicious of dn/dS values of 999.000

$ find build/empiricalvalues_homoresidue/fit_0.0105 -name results.txt | xargs grep -i omega
build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_500/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_500/results.txt:omega (dN/dS) = 379.30494
build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_500/results.txt:omega (dN/dS) = 196.77209
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500/results.txt:omega (dN/dS) = 363.47116
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_500/results.txt:omega (dN/dS) = 999.00000

Next Step

Rerun the simulation with an amino acid that has more than 2 possible codons.
Arginine, Serine, and Leucine each have six codons, so might as well use on of those.

cswarth · 2016-10-07T20:28:23Z

Reconfigured the simulation to start with all Arginine and to favor Arginine at a ~~single position~~ all positions.
Arginine is coded for by six codons, CGT, CGC, CGA, CGG, AGA, AGG. Four of those are only a single mutation away from one another.

To encourage retention of synonymous mutations, we start off the population with genomes containing only one of the 4-way synonymous codons for Arginine, CGT. The fitness function applied across all 156 sites on the genome favors any of the six codons that code for Arginine, while all the others are considered uniformly less favored.

      <sequences>
    >C2V3C3L_0_|0|0|0
    CGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGT
      </sequences>

    <fitnessFunction>
      <empiricalFitness>
    <feature>C2C3</feature>
    <sites>1-156</sites>
    <!-- assign fitness to each amino acid at this site -->
    <!-- Order of Amino acids is A-C-D-E-F-G-H-I-K-L-M-N-P-Q-R-S-T-V-W-Y -->
    <values>
      0.010 0.010 0.010 0.010 0.010
      0.010 0.010 0.010 0.010 0.010
      0.010 0.010 0.010 0.010 $fitness
      0.010 0.010 0.010 0.010 0.010
    </values>
      </empiricalFitness>
    </fitnessFunction>

Results of dN/dS calculation

Codeml did not exhibit the extreme long running times seen with the previous configuration, but a majority of dN/dS values are still extreme/failing.

|-------------+---------+-----------+-------------|
|  fit_0.0105 | rep_4   | gen_500   |  63.73299   |
|  fit_0.0105 | rep_2   | gen_500   |  60.72560   |
|  fit_0.0105 | rep_1   | gen_500   |  342.44668  |
|  fit_0.0105 | rep_0   | gen_500   |  225.30016  |
|  fit_0.0105 | rep_3   | gen_500   |  19.42574   |
|  fit_0.0105 | rep_0   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_1   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_2   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_3   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_4   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_1   | gen_20000 |  96.22066   |
|  fit_0.0105 | rep_0   | gen_20000 |  39.26704   |
|  fit_0.0105 | rep_4   | gen_20000 |   3.71141   |
|  fit_0.0105 | rep_2   | gen_20000 |  31.44166   |
|  fit_0.0105 | rep_3   | gen_20000 |  28.22028   |
|-------------+---------+-----------+-------------|

It seems highly unusual that all the samples collected at 5000 generations exhibit meaningles dN/dS ratios but none of the samples collected at other time points do.

Run more fitness values (aside from just fit_0.0105) and make some plots of these values for comparison to the previous configuration.

matsen · 2016-10-07T20:57:36Z

Does anything jump out just starting at the extreme-value alignments?

cswarth · 2016-10-11T17:57:26Z

Adding link to plot of codon usage vs. ML dN/dS ratio as calculated by codeml for homoresidue empirical fitness constraint .
https://cswarth.github.io/santa-dnds/codon_usage.html

Re-running simulation with a 100x higher mutation rate to see how that affects statistics.

Simulation is running - this will take several hours to complete.

(py27) cwarth@stoat ~/src/matsen/santa-dnds (master *) 
$ scons -j 15

Still running as of 1:45pm. Looks like it is working on some of the high mutation rate samples, which is good because this are the ones I want to see.

cswarth · 2016-10-11T23:37:32Z

Bumping the mutation rate had the effect of increasing the number of unique sequences in the samples taken at 5000, and 20000 generations. Now there are ~100 unique sequences, and codeml never finishes. It's just too many sequences for PAML/codeml to deal with.

I'm going to put a filter step in to cut down the number of unique sequences sent to codeml.

Re-running now Tue Oct 11 16:53:44 PDT 2016

fixed a bug in the non-selective santa sim; running to completion now

cswarth · 2016-10-13T21:21:02Z

This is a new interactive plot of dnds that has some graphical issues and I realized it is based on bogus simulations. ( didn't have the homoresidue sequence initialzed correctly. ) This is just for reference of what an interactive dnds plot would look like. Mostly this was done to get familiar with plotly.
https://cswarth.github.io/santa-dnds/dnds-bogus1.html

New simulations are running now...

cswarth · 2016-10-13T21:54:51Z

Updated plot of dnds under two different mutation rates can be found at,
https://cswarth.github.io/santa-dnds/multi-rates.html

cswarth added the in progress label Oct 11, 2016

cswarth self-assigned this Oct 11, 2016

cswarth added ready in progress and removed in progress ready labels Oct 11, 2016

cswarth added ready and removed in progress labels Oct 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empirical Homoresidue shows unexpected extreme dN/dS values. #1

Empirical Homoresidue shows unexpected extreme dN/dS values. #1

cswarth commented Sep 13, 2016 •

edited

Loading

cswarth commented Sep 13, 2016

cswarth commented Sep 13, 2016

cswarth commented Sep 13, 2016

cswarth commented Oct 6, 2016 •

edited

Loading

cswarth commented Oct 7, 2016 •

edited

Loading

matsen commented Oct 7, 2016

cswarth commented Oct 11, 2016 •

edited

Loading

cswarth commented Oct 11, 2016 •

edited

Loading

cswarth commented Oct 13, 2016

cswarth commented Oct 13, 2016

Empirical Homoresidue shows unexpected extreme dN/dS values. #1

Empirical Homoresidue shows unexpected extreme dN/dS values. #1

Comments

cswarth commented Sep 13, 2016 • edited Loading

cswarth commented Sep 13, 2016

cswarth commented Sep 13, 2016

cswarth commented Sep 13, 2016

Background

cswarth commented Oct 6, 2016 • edited Loading

Running time Anomalies

Extreme dN/dS values

Next Step

cswarth commented Oct 7, 2016 • edited Loading

Results of dN/dS calculation

matsen commented Oct 7, 2016

cswarth commented Oct 11, 2016 • edited Loading

cswarth commented Oct 11, 2016 • edited Loading

cswarth commented Oct 13, 2016

cswarth commented Oct 13, 2016

cswarth commented Sep 13, 2016 •

edited

Loading

cswarth commented Oct 6, 2016 •

edited

Loading

cswarth commented Oct 7, 2016 •

edited

Loading

cswarth commented Oct 11, 2016 •

edited

Loading

cswarth commented Oct 11, 2016 •

edited

Loading