Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unintuitive results with N's in pairwiseAlignment #6

Open
LTLA opened this issue Mar 17, 2018 · 6 comments
Open

Unintuitive results with N's in pairwiseAlignment #6

LTLA opened this issue Mar 17, 2018 · 6 comments
Assignees

Comments

@LTLA
Copy link

LTLA commented Mar 17, 2018

The default behaviour of pairwiseAlignment is to perform a quality-weighted alignment, even in the absence of any quality scores in the two input sequences. This is normally fine, as sequences without qualities are given a constant quality of 22L across all bases, and presumably this is just as reasonable as using arbitrary match/mismatch scores in nucleotideSubstitutionMatrix.

However, the use of a high constant quality has odd effects when ambiguous N's are present. Consider:

library(Biostrings)
pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"))

... which gives a score of -29.5 (currently testing on version 2.46.0). This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate. Indeed, using a nucleotideSubstitutionMatrix gives me something a lot more sensible:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"),
     substitutionMatrix=nucleotideSubstitutionMatrix())
# score of 1.25

In short; is the default quality score choice of 22L appropriate for N's? Hacking around to assign low qualities to N's also gives something more reasonable than the default:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"), 
   subjectQuality=PhredQuality(0L))
# score of 2.06

... though obviously this would require more work when only a few bases are N.

@hpages
Copy link
Contributor

hpages commented Mar 21, 2018

I don't think that pairwiseAlignment() has built-in capabilities for handling IUPAC ambiguities out-of-the-box. My understanding is that, if no special arrangements are made, N is simply treated as any other letter. This means that the T / N substitution has the same cost as the T / A substitution:

> library(Biostrings)
> pairwiseAlignment(DNAString("T"), DNAString("N"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: N
score: -5.899286 
> pairwiseAlignment(DNAString("T"), DNAString("A"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: A
score: -5.899286 

If you want N to be treated as an ambiguity letter that stands for A, C, G, or T, you need to set the costs of the A / N, C / N, G / N, and T / N substitutions to the same as for the A / A, C / C, G / G, and T / T substitutions. For example by doing something like this:

> mat <- diag(15L, nrow=5, ncol=5) - 5L
> rownames(mat) <- colnames(mat) <- c("A", "C", "G", "T", "N")
> mat["N", ] <- mat[ , "N"] <- mat["A", "A"]
> mat
   A  C  G  T  N
A 10 -5 -5 -5 10
C -5 10 -5 -5 10
G -5 -5 10 -5 10
T -5 -5 -5 10 10
N 10 10 10 10 10
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTNTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTNTT---
score: 6 
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTTTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTTTT---
score: 6 
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTGTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTGTT---
score: -9 

Does that make sense?
H.

@LTLA
Copy link
Author

LTLA commented Mar 21, 2018

Yes, indeed, if substitutionMatrix is set to the output of nucleotideSubstitutionMatrix(), everything works as expected, as I mentioned above. My problem is with the default settings, where nucleotideSubstitutionMatrix() is not used. Rather, each input sequence is assigned a quality of PhredQuality(22L) (if none was specified), and quality-based alignment is performed. This causes some problems when the input sequence contains ambiguous bases, in particular N.

To be clear, my beef is not with the use of the quality-based alignment algorithm itself. In "real world" data, any Ns should have very low qualities by definition (as the base caller could not figure out its identity), and thus should contribute little to the alignment score. The problem here is that, when input qualities are not specified, Ns and other ambiguous bases are automatically and silently assigned qualities of 22L, resulting in strong mismatches to other bases. This is very unintuitive; it took me days to figure out why my adaptor sequences containing a stretch of Ns were not aligning to my data.

I can think of two alternatives to the current defaults:

  1. Use a nucleotideSubstitutionMatrix() when two DNAString[Set] objects are supplied without qualities. Throw an error (or warning, if you use suggestion 2 below) if only one of the objects have qualities. This is probably the clearest and most intuitive/expected for a typical user.
  2. Set the default qualities so that any ambiguous bases have the lowest possible quality, thus avoiding strong mismatches. The quality-aware alignment does, in fact, handle IUPAC codes correctly when the quality scores are low (see the \gamma parameter in ?pairwiseAlignment).

@hpages
Copy link
Contributor

hpages commented Apr 14, 2018

Hi Aaron,
The proposed changes sound reasonable but there is a lot of code around that uses pairwiseAlignment() and I want to take the time to investigate/evaluate how it would be impacted by these changes. We're probably too close from the next BioC release for that (I would need some time to fix the packages I break by making these changes and I don't have that time at the moment) so I'm postponing this to after the release.
Thanks,
H.

@LTLA
Copy link
Author

LTLA commented Apr 14, 2018

Okay, fair enough, thanks.

@LTLA
Copy link
Author

LTLA commented Oct 24, 2018

I eventually figured this out - the alignment scheme is working correctly, but fuzzyMatrix needs to be specified when dealing with quality-scaled DNA strings:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"), 
     fuzzyMatrix=nucleotideSubstitutionMatrix())
# gives a score of zero.

This seems like something that should be default when pairwiseAlignment sees DNA string inputs.

@acvill
Copy link

acvill commented May 24, 2023

This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate.

I think PR Bioconductor/Biostrings#77 addresses this. An asymmetric substitution matrix avoids penalizing alignments in the case of ambiguous subject sequences.

@hpages hpages transferred this issue from Bioconductor/Biostrings Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants