Unexpected value in dist.dna on sequences with gaps #126

samcwiley · 2024-07-23T13:19:15Z

Hi, I really love using ape and have been using lots of the functions for calculating distance between sequences; thank you for this.
I've noticed the nucleotide substitution models that include nucleotide base frequencies (namely Felsenstein-81 and Tamura-Nei 93) are giving me unexpected values for sequences with gaps when pairwise deletion is turned on.
In testing, both dist.dna(dna_sequences, model = "F81", pairwise.deletion=FALSE) and dist.dna(dna_sequences, model = "F81", pairwise.deletion=TRUE) return the same value for pairs of sequences with gaps, and I believe this is due to the base frequency tabulations are still counting bases corresponding to a base with a gap (they are not getting deleted after all.)

The text was updated successfully, but these errors were encountered:

emmanuelparadis · 2024-07-25T15:59:34Z

Hi,
Thanks for the appreciation!

and I believe this is due to the base frequency tabulations are still counting bases corresponding to a base with a gap (they are not getting deleted after all.)

You're correct: the base frequencies are calculated with all the data, whatever the value of pairwise.deletion. The logic behind this is that in these models base frequencies are assumed to be constant over all sequences, so it's better to use all possible observed bases to estimate them.

You can see what are the possible impacts of this by computing the distances dropping all sequences but a pair, something like this:

n <- nrow(dna_sequences)
D <- numeric(n * (n - 1) / 2)
k <- 1
for (i in 1:(n - 1)) {
   for (j in (i + 1):n) {
      D[k] <- dist.dna(dna_sequences[c(i, j), ], "F81", ......>)
      k <- k + 1
    }
}

(This could be a bit slow if you have a very big data set.)
D can be compared directly with the output of D0 <- dist.dna(dna_sequences, "F81", ......) for instance:

plot(D, D0); abline(0, 1, lty = 3)

Cheers,
Emmanuel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected value in dist.dna on sequences with gaps #126

Unexpected value in dist.dna on sequences with gaps #126

samcwiley commented Jul 23, 2024

emmanuelparadis commented Jul 25, 2024

Unexpected value in dist.dna on sequences with gaps #126

Unexpected value in dist.dna on sequences with gaps #126

Comments

samcwiley commented Jul 23, 2024

emmanuelparadis commented Jul 25, 2024