Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling low bases in AD calculation #9014

Open
pogodina-nadezda opened this issue Oct 21, 2024 · 1 comment
Open

Handling low bases in AD calculation #9014

pogodina-nadezda opened this issue Oct 21, 2024 · 1 comment
Labels

Comments

@pogodina-nadezda
Copy link

Hi,

I found that even after adding --min-base-quality-score 20 to the HaplotypeCaller command line, low-quality bases are still being counted in the AD field in GVCF.

gatk version: 4.6.0.0 (GATK4 docker image)
command: gatk HaplotypeCaller -R hg38.chr17.fna -I chr17.bqsr.hg38.bam -O chr17.g.vcf.gz --dbsnp dbSNP.hg38.vcf.gz -ploidy 2 --max-alternate-alleles 2 --dont-use-soft-clipped-bases --min-base-quality-score 20 --base-quality-score-threshold 20 --minimum-mapping-quality 10 -ERC GVCF

Bases quality in one position (samtools mpileup result):
BQSR bam file (input):
chr17 3648932 N 47 GaGGcGgcGGcGcGGGcGcccaaaaGGGGGcaGGcGcgcGtcccacg ?5???!'??9+?+999!9!!+55559?999+!?9!!!E+9!!!!5!S
BAMOUT:
chr17 3648932 N 50 GGGGGGGGGGGGGGGGGGGGcctcaacccccaacggcacccaccagaccg !!?!!!?!!?!?!??5!!!?!!!+55!!!+!55?''+5!?!!++5BBB!I

Line in GVCF:
chr17 3648932 rs1555561049 G A,C,<NON_REF> 90.64 . BaseQRankSum=-0.419;DB;DP=53;ExcessHet=0.0000;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQRankSum=0.000;RAW_MQandDP=190800,53;ReadPosRankSum=0.352 GT:AD:DP:GQ:PL:SB 0/2:23,7,16,0:46:24:98,24,381,0,241,336,161,402,378,539:20,3,0,23

I expected the DP to be 28 or 20 according to the bamout, and the AD for Cytosine to not exceed 3. Is it expected behaviour that all bases are counted in the AD, regardless of quality?

@lbergelson
Copy link
Member

@pogodina-nadezda This is expected behavior but it's unintuitive and not necessarily ideal.
The AD is calculated based on the best match of each read to each possible haplotype during assembly. It's not based on the pileup at the site. So a read that has no quality at a site might still be counted towards that site if it matches the relavent assembled haplotype better than any alternative one during the read scoring phase.

@jamesemery This is the second time in 2 days that we've had questions about this though so it's clearly confusing and maybe not the best solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants