Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected value in dist.dna on sequences with gaps #126

Open
samcwiley opened this issue Jul 23, 2024 · 1 comment
Open

Unexpected value in dist.dna on sequences with gaps #126

samcwiley opened this issue Jul 23, 2024 · 1 comment

Comments

@samcwiley
Copy link

Hi, I really love using ape and have been using lots of the functions for calculating distance between sequences; thank you for this.
I've noticed the nucleotide substitution models that include nucleotide base frequencies (namely Felsenstein-81 and Tamura-Nei 93) are giving me unexpected values for sequences with gaps when pairwise deletion is turned on.
In testing, both dist.dna(dna_sequences, model = "F81", pairwise.deletion=FALSE) and dist.dna(dna_sequences, model = "F81", pairwise.deletion=TRUE) return the same value for pairs of sequences with gaps, and I believe this is due to the base frequency tabulations are still counting bases corresponding to a base with a gap (they are not getting deleted after all.)

@emmanuelparadis
Copy link
Owner

Hi,
Thanks for the appreciation!

and I believe this is due to the base frequency tabulations are still counting bases corresponding to a base with a gap (they are not getting deleted after all.)

You're correct: the base frequencies are calculated with all the data, whatever the value of pairwise.deletion. The logic behind this is that in these models base frequencies are assumed to be constant over all sequences, so it's better to use all possible observed bases to estimate them.

You can see what are the possible impacts of this by computing the distances dropping all sequences but a pair, something like this:

n <- nrow(dna_sequences)
D <- numeric(n * (n - 1) / 2)
k <- 1
for (i in 1:(n - 1)) {
   for (j in (i + 1):n) {
      D[k] <- dist.dna(dna_sequences[c(i, j), ], "F81", ......>)
      k <- k + 1
    }
}

(This could be a bit slow if you have a very big data set.)
D can be compared directly with the output of D0 <- dist.dna(dna_sequences, "F81", ......) for instance:

plot(D, D0); abline(0, 1, lty = 3)

Cheers,
Emmanuel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants