Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Explained Reads Calculation #139

Open
ddemelfi opened this issue Nov 19, 2024 · 0 comments
Open

Unexpected Explained Reads Calculation #139

ddemelfi opened this issue Nov 19, 2024 · 0 comments

Comments

@ddemelfi
Copy link

Less of an issue and more of a question, but looking at the HLA genotypes found by arcasHLA, I've noticed that in the output files there is not a specific confidence score. Instead, there are the percentages for the explained reads and abundance for the HLA allele. Using what I know, it seems like the explained percentage is the closest to the confidence score, but I was wondering how it was calculated.

I have a few samples that have several HLA-A alleles. The abundances vary, but for the sake of example, the top alleles are:
allele | abundance
A * 02:844 | 50%
A * 01:11N | 25%
A * 01:37:01 | 13%
A * 01:370 | 10%

Meanwhile, for this allele, the top 10 are all variations of HLA-A2. However, the most likely genotype is a combination of the top allele and the third most abundant making it [A02:844, A01:37:01] explaining about 97% of the reads. But why is this the case? Taking it at face value, would the top two abundances not explain most of the reads, or am I misunderstanding what the abundance represents? Also, with the top ten alleles all being HLA-A2, why is it not the case that HLA-A2 is the only allele present, as in, how are we confident that there are two different alleles. Finally, why would it return the top for alleles by abundance when each should only have 2? I can understand a classification error, but again, the second highest abundance isn't taken into consideration for the top scoring allele pair.

Another final note is that the two allele percentages are all very close, within 1% as a maximum difference. These could be naive questions, but I am having trouble understanding these percentages and the output overall, and the previous publications are vague specifically around the explained reads percent and the ambiguity of the abundances found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant