-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unintuitive results with N's in pairwiseAlignment #6
Comments
I don't think that
If you want N to be treated as an ambiguity letter that stands for A, C, G, or T, you need to set the costs of the A / N, C / N, G / N, and T / N substitutions to the same as for the A / A, C / C, G / G, and T / T substitutions. For example by doing something like this:
Does that make sense? |
Yes, indeed, if To be clear, my beef is not with the use of the quality-based alignment algorithm itself. In "real world" data, any I can think of two alternatives to the current defaults:
|
Hi Aaron, |
Okay, fair enough, thanks. |
I eventually figured this out - the alignment scheme is working correctly, but pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"),
fuzzyMatrix=nucleotideSubstitutionMatrix())
# gives a score of zero. This seems like something that should be default when |
I think PR Bioconductor/Biostrings#77 addresses this. An asymmetric substitution matrix avoids penalizing alignments in the case of ambiguous subject sequences. |
The default behaviour of
pairwiseAlignment
is to perform a quality-weighted alignment, even in the absence of any quality scores in the two input sequences. This is normally fine, as sequences without qualities are given a constant quality of22L
across all bases, and presumably this is just as reasonable as using arbitrary match/mismatch scores innucleotideSubstitutionMatrix
.However, the use of a high constant quality has odd effects when ambiguous N's are present. Consider:
... which gives a score of -29.5 (currently testing on version 2.46.0). This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate. Indeed, using a
nucleotideSubstitutionMatrix
gives me something a lot more sensible:In short; is the default quality score choice of
22L
appropriate for N's? Hacking around to assign low qualities to N's also gives something more reasonable than the default:... though obviously this would require more work when only a few bases are N.
The text was updated successfully, but these errors were encountered: