mpileup's off-by-one bug: when cannot parse contig name with a comma, subsequent output is wrongly labeled #2345

azaznaev · 2024-12-19T21:26:21Z

When bcftools encounters a contig name that it cannot parse (for instance, the one containing a comma), it produced an error message:
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=BVAB1.CP049781.1_Clostridiales_genomosp._BVAB1_isolate_UAB071_chromosome,_complete_genome,length=1649642>"

The output BCF, however, is still produced. Apparently, with a parsing error, some internal pointer for contigs in the BAM header gets messed up and the resulting BCF contains the variants not from the correct contig, but from the next contig in the BAM header.

My example

BAM header contains a lot of contig names, but all reads in BAM are mapped to one contig called right.contig.
FASTA file contains only the contig right.contig.
Resulting BCF file contains variants from contig wrong.contig. Why?
Sample of the BAM header with the correct and incorrect contigs being next to each other:

... somewhere above is the contig with the forbidden comma ...
@SQ           SN:right.contig           LN:876514
@SQ           SN:wrong.contig           LN:100745
...

Apparently, the pointer shifts by one when encountering the parsing error.

Command used: bcftools mpileup -q 30 -f {fasta} {bam} -o {bcf}
When the bad contig name is changed to remove the forbidden comma, all works as expected (BCF file's variants are annotated as right.contig)

Suggestion

When parsing issue happens, either don't generate the BCF (because it may be erroneously annotated) or fix the pointer issue that results in wrong contig being outputted to BCF. Thanks!

The text was updated successfully, but these errors were encountered:

jkbonfield · 2025-01-06T13:48:39Z

Note commas in contig IDs is forbidden by both SAM (section 1.2.1) and VCF (section 1.4.7). While I agree we shouldn't get strange behaviour from BCF tools, I'd argue it should simply be rejecting the file.

You may wish to give feedback to whoeever is producing a reference sequence containing commas explaining that it is incompatible with many standard tool pipelines.

azaznaev · 2025-01-06T14:15:18Z

@jkbonfield I agree that if the comma is not allowed, the file should be rejected with an error message. So far, the message seems more like a warning (at least visually, plus non-empty bcf output is produced). Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpileup's off-by-one bug: when cannot parse contig name with a comma, subsequent output is wrongly labeled #2345

mpileup's off-by-one bug: when cannot parse contig name with a comma, subsequent output is wrongly labeled #2345

azaznaev commented Dec 19, 2024

jkbonfield commented Jan 6, 2025

azaznaev commented Jan 6, 2025

mpileup's off-by-one bug: when cannot parse contig name with a comma, subsequent output is wrongly labeled #2345

mpileup's off-by-one bug: when cannot parse contig name with a comma, subsequent output is wrongly labeled #2345

Comments

azaznaev commented Dec 19, 2024

My example

Suggestion

jkbonfield commented Jan 6, 2025

azaznaev commented Jan 6, 2025