Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are AFs computed? #80

Open
romagnolid opened this issue May 2, 2019 · 9 comments · May be fixed by #126
Open

How are AFs computed? #80

romagnolid opened this issue May 2, 2019 · 9 comments · May be fixed by #126

Comments

@romagnolid
Copy link

I've read in other closed issues that you suggest not to pre-filter reads because that will distort the signal and that's reasonable but I noticed a strange situation.
Here's the raw counts of a variant obtained from IGV (reference is T)

chr10:123,241,496
Total count: 137
A : 0
C : 137 (100%, 59+, 78- )
G : 0
T : 0
N : 0

So, I would expect even with strict filters to see an AF of 100%.
Instead using -q 30 -Q 30 -m 30 as quality filters this is what I get
chr10 123241496 . T C 2874 PASS DP=123;AF=0.788618;SB=0;DP4=0,0,52,71
Can you help understand why that is happening?

@romagnolid romagnolid changed the title How are AF computed? How are AFs computed? May 2, 2019
@andreas-wilm
Copy link
Contributor

Sorry. I've ignored issues report after leaving academia. I have absolutely no idea what is going on here and would love to have a look. Any chance you could share a BAM slice?

@romagnolid
Copy link
Author

romagnolid commented Dec 19, 2019 via email

@andreas-wilm
Copy link
Contributor

Hi Dario, to which email address did you send the file? Could you either upload it and share the link here or share with my private email address: [email protected]. Thanks

@andreas-wilm
Copy link
Contributor

Thanks Dario. I can' t reproduce the problem. Which version are you using? With 2.1.4 (which should in this respect be identical to 2.1.3) gives the following:

$ lofreq call test_file_chr10.bam -r 10:123241496-123241496 -f human_g1k_v37.fasta | tail -n 1
10      123241496       .       T       C       3114    PASS    DP=123;AF=1.000000;SB=0;DP4=0,0,52,71

@romagnolid
Copy link
Author

romagnolid commented Jan 15, 2020

Sorry Andreas but you need the quality parameters -Q 30 -q 30 -m 30. Anyway I'm using version 2.1.3
My point is, irrespective of quality filters the AF shouldn't change when you have only reads covering the reference allele.
I am also confused by the fact that the raw depth (DP) is 123 (while it's 137 according to IGV) even without any filter. Am I missing something?

@andreas-wilm
Copy link
Contributor

Got it. This is a bug. The culprit is -q 30. This quality filter was meant to only affect variant quality computation, and not AF or DP etc, however it does (note to self: through the later used alt_counts). Fixing this is going to be hard and make the already convoluted logic and code even more convoluted. You won't like this, but I will leave this bug unfixed and this ticket open for two reasons:

  1. You are actually not meant to filter on base or mapping quality excessively in LoFreq! Filtering introduces calling biases (and -q has the biggest effect, apart from the bug) in LoFreq's core which was built to deal with and is robust against quality variations (even though you might lose your favourite variant every now and then :))
  2. I have started LoFreq3 (even though as a hobby, because I left academia; a static binary for call will appear within weeks) already, which is not affected by design (because it separates pileup/filtering from calling by design)

Sorry. Hope this helps anyway
Andreas

@romagnolid
Copy link
Author

It certainly gives a lot of insights, so thank you for taking time to inspect the issue.

andreas-wilm pushed a commit that referenced this issue Jun 12, 2020
@rahil19
Copy link

rahil19 commented Aug 2, 2021

@andreas-wilm I have the same issue

NC_045512.2     23709   .       C       T       49314   PASS    DP=9307;AF=0.751262;SB=23;DP4=45,18,4892,4304

When I run other variant callers (BCFTools) and look at IGV, they all report AF > 0.99. If I actually calculate from DP4 i.e. (4892+4304)/(45+18+4892+4304) allele frequency > 0.99. Therefore, I couldn't figure out where is 0.75 AF calculation is coming from
However, you mentioned that quality filters while running lofreq can cause biases in calculations. Please see my run below:

lofreq call --call-indels --min-cov 50 -q 30 -Q 30 -m 20 --verbose -f path_to_reference/genome_ref.fasta -o sample_bwa_mq20pp_mark_dup_lofreq_realign_indelq_alnq_call.vcf sample_bwa_mq20pp_mark_dup_lofreq_realign_indelq_alnq.bam

If we don't include mapping and base quality filters at the variant calling run, then how do we filter on base and mapping qualities after variants have been called using lofreq call?

Thanks, Rahil

@rahil19
Copy link

rahil19 commented Aug 2, 2021

Also, I really need to filter my variants on allele frequency to get high frequency variants (> 90%). Could you please suggest if I calculating from DP4 would be fine?

Thanks, Rahil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants