Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empirical Homoresidue shows unexpected extreme dN/dS values. #1

Open
cswarth opened this issue Sep 13, 2016 · 10 comments
Open

Empirical Homoresidue shows unexpected extreme dN/dS values. #1

cswarth opened this issue Sep 13, 2016 · 10 comments
Assignees
Labels

Comments

@cswarth
Copy link
Owner

cswarth commented Sep 13, 2016

The Empirical Homoresidue model (scatter column furthest right) for high- and middle-fitness values exhibits some unexpectedly extreme dN/dS values. Check out the sampled sequences from those points to understand why these occur.

image

@cswarth
Copy link
Owner Author

cswarth commented Sep 13, 2016

These are some entries from the results showing extreme values of omega for the homoresidue model at 20,000 generations.

     19:build/empiricalvalues_homoresidue/fit_0.01025/rep_0/gen_20000/results.txt,homoresidue,0.01025,20000,0,29.64818
     22:build/empiricalvalues_homoresidue/fit_0.01025/rep_1/gen_20000/results.txt,homoresidue,0.01025,20000,1,7.09823
     25:build/empiricalvalues_homoresidue/fit_0.01025/rep_2/gen_20000/results.txt,homoresidue,0.01025,20000,2,24.39555
     28:build/empiricalvalues_homoresidue/fit_0.01025/rep_3/gen_20000/results.txt,homoresidue,0.01025,20000,3,56.50374
     31:build/empiricalvalues_homoresidue/fit_0.01025/rep_4/gen_20000/results.txt,homoresidue,0.01025,20000,4,57.65932
     34:build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_20000/results.txt,homoresidue,0.0105,20000,0,0.4906
     37:build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_20000/results.txt,homoresidue,0.0105,20000,1,39.71024
     40:build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_20000/results.txt,homoresidue,0.0105,20000,2,8.75561
     43:build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_20000/results.txt,homoresidue,0.0105,20000,3,10.69716
     46:build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_20000/results.txt,homoresidue,0.0105,20000,4,28.23718

@cswarth
Copy link
Owner Author

cswarth commented Sep 13, 2016

Show below is a fragment from an alignment of deduplicated sequences sampled from fit=0.01025 repetition=3 generation=20000 that has an omega value of 56.5. As expected from the homoresidue nature of the starting population, almost all AA positions are Lysine (Lys/K). non-synonymous substitutions are infrequently scattered across the genome.

screen shot 2016-09-13 at 11 20 27 am

What's really strange is that all the nucleotide substitutions are A->C
What would cause that?

screen shot 2016-09-13 at 11 38 29 am

@cswarth
Copy link
Owner Author

cswarth commented Sep 13, 2016

The problem is SANTA is always generating A -> C substitutions and no other. I think this is a bug.

Background

Nucleotide mutators in SANTA can be configured with a rate bias matrix to influence the selection of substitution mutations based on the existing nucleotide, e.g.

        <mutator>
          <nucleotideMutator>
            <mutationRate>2.5E-5</mutationRate>
            <rateBias>
                  0.42 2.49 0.29
                  1.73 0.23 4.73
                  6.99 0.20 0.60
                  1.02 2.56 0.88
                </rateBias>
          </nucleotideMutator>
        </mutator>

However for the configuration used for this dN/dS project, the nucleotide mutator is configured without a <rateBias> element, e.g.

  <mutator>
    <nucleotideMutator>
      <mutationRate>$mutationrate</mutationRate>
    </nucleotideMutator>
  </mutator>

When this form of configuration is used without a rate bias matrix, the probability of a transversion, stored in NucleotideMutator.tv, is calculated to be 1.0, so transversions are in fact the only substitutions generated.

That is why we only see Adenine to Cytosine substitutions.

@cswarth
Copy link
Owner Author

cswarth commented Oct 6, 2016

After modifying the config file to generate a full spectrum of mutations (and modifying the default behavior in SANTA), the analysis of dN/dS rations in a simulated population still shows anomalies. The first is the extreme running time of codeml which in some cases runs for more than 11 hours, and the second is the extreme dN/dS ratios produced by some samples - sometimes exceeding 900!

Running time Anomalies

The most extreme runtime was created from a single sample, build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500 that took over 11 hours to complete.

Time used: 11:18:15
scons: done building targets.

real    726m29.257s
user    3m40.778s
sys     0m18.470s

This sample also produced an extreme dN/dS ration of 363.47116, although this is not the most extreme value produced across all samples.

This sample appears to have a normal number of unique sequences, about the same as other samples.

$ seqmagick info build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500/sample_dedup.fa 
name                                                                       alignment    min_len   max_len   avg_len  num_seqs
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500/sample_dedup.fa TRUE             468       468    468.00        31

The codon usage within the sample also looks consistent with other samples.

Extreme dN/dS values

Be very suspicious of dn/dS values of 999.000

$ find build/empiricalvalues_homoresidue/fit_0.0105 -name results.txt | xargs grep -i omega
build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_0/gen_500/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_500/results.txt:omega (dN/dS) = 379.30494
build/empiricalvalues_homoresidue/fit_0.0105/rep_4/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_1/gen_500/results.txt:omega (dN/dS) = 196.77209
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_500/results.txt:omega (dN/dS) = 363.47116
build/empiricalvalues_homoresidue/fit_0.0105/rep_2/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_5000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_20000/results.txt:omega (dN/dS) = 999.00000
build/empiricalvalues_homoresidue/fit_0.0105/rep_3/gen_500/results.txt:omega (dN/dS) = 999.00000

Next Step

Rerun the simulation with an amino acid that has more than 2 possible codons.
Arginine, Serine, and Leucine each have six codons, so might as well use on of those.

@cswarth
Copy link
Owner Author

cswarth commented Oct 7, 2016

Reconfigured the simulation to start with all Arginine and to favor Arginine at a single position all positions.
Arginine is coded for by six codons, CGT, CGC, CGA, CGG, AGA, AGG. Four of those are only a single mutation away from one another.

To encourage retention of synonymous mutations, we start off the population with genomes containing only one of the 4-way synonymous codons for Arginine, CGT. The fitness function applied across all 156 sites on the genome favors any of the six codons that code for Arginine, while all the others are considered uniformly less favored.

      <sequences>
    >C2V3C3L_0_|0|0|0
    CGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGTCGT
      </sequences>

    <fitnessFunction>
      <empiricalFitness>
    <feature>C2C3</feature>
    <sites>1-156</sites>
    <!-- assign fitness to each amino acid at this site -->
    <!-- Order of Amino acids is A-C-D-E-F-G-H-I-K-L-M-N-P-Q-R-S-T-V-W-Y -->
    <values>
      0.010 0.010 0.010 0.010 0.010
      0.010 0.010 0.010 0.010 0.010
      0.010 0.010 0.010 0.010 $fitness
      0.010 0.010 0.010 0.010 0.010
    </values>
      </empiricalFitness>
    </fitnessFunction>

Results of dN/dS calculation

Codeml did not exhibit the extreme long running times seen with the previous configuration, but a majority of dN/dS values are still extreme/failing.

|-------------+---------+-----------+-------------|
|  fit_0.0105 | rep_4   | gen_500   |  63.73299   |
|  fit_0.0105 | rep_2   | gen_500   |  60.72560   |
|  fit_0.0105 | rep_1   | gen_500   |  342.44668  |
|  fit_0.0105 | rep_0   | gen_500   |  225.30016  |
|  fit_0.0105 | rep_3   | gen_500   |  19.42574   |
|  fit_0.0105 | rep_0   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_1   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_2   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_3   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_4   | gen_5000  |  999.00000  |
|  fit_0.0105 | rep_1   | gen_20000 |  96.22066   |
|  fit_0.0105 | rep_0   | gen_20000 |  39.26704   |
|  fit_0.0105 | rep_4   | gen_20000 |   3.71141   |
|  fit_0.0105 | rep_2   | gen_20000 |  31.44166   |
|  fit_0.0105 | rep_3   | gen_20000 |  28.22028   |
|-------------+---------+-----------+-------------|

It seems highly unusual that all the samples collected at 5000 generations exhibit meaningles dN/dS ratios but none of the samples collected at other time points do.

Run more fitness values (aside from just fit_0.0105) and make some plots of these values for comparison to the previous configuration.

@matsen
Copy link

matsen commented Oct 7, 2016

Does anything jump out just starting at the extreme-value alignments?

@cswarth
Copy link
Owner Author

cswarth commented Oct 11, 2016

Adding link to plot of codon usage vs. ML dN/dS ratio as calculated by codeml for homoresidue empirical fitness constraint .
https://cswarth.github.io/santa-dnds/codon_usage.html

Re-running simulation with a 100x higher mutation rate to see how that affects statistics.

Simulation is running - this will take several hours to complete.

(py27) cwarth@stoat ~/src/matsen/santa-dnds (master *) 
$ scons -j 15

Still running as of 1:45pm. Looks like it is working on some of the high mutation rate samples, which is good because this are the ones I want to see.

@cswarth cswarth self-assigned this Oct 11, 2016
@cswarth
Copy link
Owner Author

cswarth commented Oct 11, 2016

Bumping the mutation rate had the effect of increasing the number of unique sequences in the samples taken at 5000, and 20000 generations. Now there are ~100 unique sequences, and codeml never finishes. It's just too many sequences for PAML/codeml to deal with.

I'm going to put a filter step in to cut down the number of unique sequences sent to codeml.


Re-running now Tue Oct 11 16:53:44 PDT 2016


fixed a bug in the non-selective santa sim; running to completion now

@cswarth
Copy link
Owner Author

cswarth commented Oct 13, 2016

This is a new interactive plot of dnds that has some graphical issues and I realized it is based on bogus simulations. ( didn't have the homoresidue sequence initialzed correctly. ) This is just for reference of what an interactive dnds plot would look like. Mostly this was done to get familiar with plotly.
https://cswarth.github.io/santa-dnds/dnds-bogus1.html

New simulations are running now...

@cswarth
Copy link
Owner Author

cswarth commented Oct 13, 2016

Updated plot of dnds under two different mutation rates can be found at,
https://cswarth.github.io/santa-dnds/multi-rates.html

@cswarth cswarth added ready and removed in progress labels Oct 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants