You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When bcftools encounters a contig name that it cannot parse (for instance, the one containing a comma), it produced an error message: [E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=BVAB1.CP049781.1_Clostridiales_genomosp._BVAB1_isolate_UAB071_chromosome,_complete_genome,length=1649642>"
The output BCF, however, is still produced. Apparently, with a parsing error, some internal pointer for contigs in the BAM header gets messed up and the resulting BCF contains the variants not from the correct contig, but from the next contig in the BAM header.
My example
BAM header contains a lot of contig names, but all reads in BAM are mapped to one contig called right.contig.
FASTA file contains only the contig right.contig.
Resulting BCF file contains variants from contig wrong.contig. Why?
Sample of the BAM header with the correct and incorrect contigs being next to each other:
... somewhere above is the contig with the forbidden comma ...
@SQ SN:right.contig LN:876514
@SQ SN:wrong.contig LN:100745
...
Apparently, the pointer shifts by one when encountering the parsing error.
When the bad contig name is changed to remove the forbidden comma, all works as expected (BCF file's variants are annotated as right.contig)
Suggestion
When parsing issue happens, either don't generate the BCF (because it may be erroneously annotated) or fix the pointer issue that results in wrong contig being outputted to BCF. Thanks!
The text was updated successfully, but these errors were encountered:
Note commas in contig IDs is forbidden by both SAM (section 1.2.1) and VCF (section 1.4.7). While I agree we shouldn't get strange behaviour from BCF tools, I'd argue it should simply be rejecting the file.
You may wish to give feedback to whoeever is producing a reference sequence containing commas explaining that it is incompatible with many standard tool pipelines.
@jkbonfield I agree that if the comma is not allowed, the file should be rejected with an error message. So far, the message seems more like a warning (at least visually, plus non-empty bcf output is produced). Thanks!
When bcftools encounters a contig name that it cannot parse (for instance, the one containing a comma), it produced an error message:
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=BVAB1.CP049781.1_Clostridiales_genomosp._BVAB1_isolate_UAB071_chromosome,_complete_genome,length=1649642>"
The output BCF, however, is still produced. Apparently, with a parsing error, some internal pointer for contigs in the BAM header gets messed up and the resulting BCF contains the variants not from the correct contig, but from the next contig in the BAM header.
My example
BAM header contains a lot of contig names, but all reads in BAM are mapped to one contig called
right.contig
.FASTA file contains only the contig
right.contig
.Resulting BCF file contains variants from contig
wrong.contig
. Why?Sample of the BAM header with the correct and incorrect contigs being next to each other:
Apparently, the pointer shifts by one when encountering the parsing error.
Command used:
bcftools mpileup -q 30 -f {fasta} {bam} -o {bcf}
When the bad contig name is changed to remove the forbidden comma, all works as expected (BCF file's variants are annotated as
right.contig
)Suggestion
When parsing issue happens, either don't generate the BCF (because it may be erroneously annotated) or fix the pointer issue that results in wrong contig being outputted to BCF. Thanks!
The text was updated successfully, but these errors were encountered: