Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpileup's off-by-one bug: when cannot parse contig name with a comma, subsequent output is wrongly labeled #2345

Open
azaznaev opened this issue Dec 19, 2024 · 2 comments

Comments

@azaznaev
Copy link

When bcftools encounters a contig name that it cannot parse (for instance, the one containing a comma), it produced an error message:
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=BVAB1.CP049781.1_Clostridiales_genomosp._BVAB1_isolate_UAB071_chromosome,_complete_genome,length=1649642>"

The output BCF, however, is still produced. Apparently, with a parsing error, some internal pointer for contigs in the BAM header gets messed up and the resulting BCF contains the variants not from the correct contig, but from the next contig in the BAM header.

My example

  • BAM header contains a lot of contig names, but all reads in BAM are mapped to one contig called right.contig.

  • FASTA file contains only the contig right.contig.

  • Resulting BCF file contains variants from contig wrong.contig. Why?

  • Sample of the BAM header with the correct and incorrect contigs being next to each other:

... somewhere above is the contig with the forbidden comma ...
@SQ           SN:right.contig           LN:876514
@SQ           SN:wrong.contig           LN:100745
...

Apparently, the pointer shifts by one when encountering the parsing error.

  • Command used: bcftools mpileup -q 30 -f {fasta} {bam} -o {bcf}

  • When the bad contig name is changed to remove the forbidden comma, all works as expected (BCF file's variants are annotated as right.contig)

Suggestion

When parsing issue happens, either don't generate the BCF (because it may be erroneously annotated) or fix the pointer issue that results in wrong contig being outputted to BCF. Thanks!

@jkbonfield
Copy link
Contributor

Note commas in contig IDs is forbidden by both SAM (section 1.2.1) and VCF (section 1.4.7). While I agree we shouldn't get strange behaviour from BCF tools, I'd argue it should simply be rejecting the file.

You may wish to give feedback to whoeever is producing a reference sequence containing commas explaining that it is incompatible with many standard tool pipelines.

@azaznaev
Copy link
Author

azaznaev commented Jan 6, 2025

@jkbonfield I agree that if the comma is not allowed, the file should be rejected with an error message. So far, the message seems more like a warning (at least visually, plus non-empty bcf output is produced). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants