Unintuitive results with N's in pairwiseAlignment #6

LTLA · 2018-03-17T00:06:05Z

The default behaviour of pairwiseAlignment is to perform a quality-weighted alignment, even in the absence of any quality scores in the two input sequences. This is normally fine, as sequences without qualities are given a constant quality of 22L across all bases, and presumably this is just as reasonable as using arbitrary match/mismatch scores in nucleotideSubstitutionMatrix.

However, the use of a high constant quality has odd effects when ambiguous N's are present. Consider:

library(Biostrings)
pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"))

... which gives a score of -29.5 (currently testing on version 2.46.0). This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate. Indeed, using a nucleotideSubstitutionMatrix gives me something a lot more sensible:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"),
     substitutionMatrix=nucleotideSubstitutionMatrix())
# score of 1.25

In short; is the default quality score choice of 22L appropriate for N's? Hacking around to assign low qualities to N's also gives something more reasonable than the default:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"), 
   subjectQuality=PhredQuality(0L))
# score of 2.06

... though obviously this would require more work when only a few bases are N.

The text was updated successfully, but these errors were encountered:

hpages · 2018-03-21T00:11:46Z

I don't think that pairwiseAlignment() has built-in capabilities for handling IUPAC ambiguities out-of-the-box. My understanding is that, if no special arrangements are made, N is simply treated as any other letter. This means that the T / N substitution has the same cost as the T / A substitution:

> library(Biostrings)
> pairwiseAlignment(DNAString("T"), DNAString("N"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: N
score: -5.899286 
> pairwiseAlignment(DNAString("T"), DNAString("A"))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: T
subject: A
score: -5.899286

If you want N to be treated as an ambiguity letter that stands for A, C, G, or T, you need to set the costs of the A / N, C / N, G / N, and T / N substitutions to the same as for the A / A, C / C, G / G, and T / T substitutions. For example by doing something like this:

> mat <- diag(15L, nrow=5, ncol=5) - 5L
> rownames(mat) <- colnames(mat) <- c("A", "C", "G", "T", "N")
> mat["N", ] <- mat[ , "N"] <- mat["A", "A"]
> mat
   A  C  G  T  N
A 10 -5 -5 -5 10
C -5 10 -5 -5 10
G -5 -5 10 -5 10
T -5 -5 -5 10 10
N 10 10 10 10 10
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTNTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTNTT---
score: 6 
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTTTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTTTT---
score: 6 
> pairwiseAlignment(DNAString("AAATTTTTCCC"), DNAString("TTGTT"), substitutionMatrix=mat)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AAATTTTTCCC
subject: ---TTGTT---
score: -9

Does that make sense?
H.

LTLA · 2018-03-21T09:50:43Z

Yes, indeed, if substitutionMatrix is set to the output of nucleotideSubstitutionMatrix(), everything works as expected, as I mentioned above. My problem is with the default settings, where nucleotideSubstitutionMatrix() is not used. Rather, each input sequence is assigned a quality of PhredQuality(22L) (if none was specified), and quality-based alignment is performed. This causes some problems when the input sequence contains ambiguous bases, in particular N.

To be clear, my beef is not with the use of the quality-based alignment algorithm itself. In "real world" data, any Ns should have very low qualities by definition (as the base caller could not figure out its identity), and thus should contribute little to the alignment score. The problem here is that, when input qualities are not specified, Ns and other ambiguous bases are automatically and silently assigned qualities of 22L, resulting in strong mismatches to other bases. This is very unintuitive; it took me days to figure out why my adaptor sequences containing a stretch of Ns were not aligning to my data.

I can think of two alternatives to the current defaults:

Use a nucleotideSubstitutionMatrix() when two DNAString[Set] objects are supplied without qualities. Throw an error (or warning, if you use suggestion 2 below) if only one of the objects have qualities. This is probably the clearest and most intuitive/expected for a typical user.
Set the default qualities so that any ambiguous bases have the lowest possible quality, thus avoiding strong mismatches. The quality-aware alignment does, in fact, handle IUPAC codes correctly when the quality scores are low (see the \gamma parameter in ?pairwiseAlignment).

hpages · 2018-04-14T05:05:45Z

Hi Aaron,
The proposed changes sound reasonable but there is a lot of code around that uses pairwiseAlignment() and I want to take the time to investigate/evaluate how it would be impacted by these changes. We're probably too close from the next BioC release for that (I would need some time to fix the packages I break by making these changes and I don't have that time at the moment) so I'm postponing this to after the release.
Thanks,
H.

LTLA · 2018-04-14T10:57:01Z

Okay, fair enough, thanks.

LTLA · 2018-10-24T17:21:11Z

I eventually figured this out - the alignment scheme is working correctly, but fuzzyMatrix needs to be specified when dealing with quality-scaled DNA strings:

pairwiseAlignment(DNAString("TTTTT"), DNAString("NNNNN"), 
     fuzzyMatrix=nucleotideSubstitutionMatrix())
# gives a score of zero.

This seems like something that should be default when pairwiseAlignment sees DNA string inputs.

acvill · 2023-05-24T21:04:01Z

This is an unusually low score given that I would consider there to be no mismatches at all - a score near zero would be more appropriate.

I think PR Bioconductor/Biostrings#77 addresses this. An asymmetric substitution matrix avoids penalizing alignments in the case of ambiguous subject sequences.

mtmorgan assigned hpages Mar 23, 2018

hpages transferred this issue from Bioconductor/Biostrings Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unintuitive results with N's in pairwiseAlignment #6

Unintuitive results with N's in pairwiseAlignment #6

LTLA commented Mar 17, 2018 •

edited

Loading

hpages commented Mar 21, 2018 •

edited

Loading

LTLA commented Mar 21, 2018 •

edited

Loading

hpages commented Apr 14, 2018

LTLA commented Apr 14, 2018

LTLA commented Oct 24, 2018

acvill commented May 24, 2023

Unintuitive results with N's in pairwiseAlignment #6

Unintuitive results with N's in pairwiseAlignment #6

Comments

LTLA commented Mar 17, 2018 • edited Loading

hpages commented Mar 21, 2018 • edited Loading

LTLA commented Mar 21, 2018 • edited Loading

hpages commented Apr 14, 2018

LTLA commented Apr 14, 2018

LTLA commented Oct 24, 2018

acvill commented May 24, 2023

LTLA commented Mar 17, 2018 •

edited

Loading

hpages commented Mar 21, 2018 •

edited

Loading

LTLA commented Mar 21, 2018 •

edited

Loading