-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Systematic decisions to ignore regions of the PRG #96
Comments
Proposal 1
Any chunk that shares too many kmers with lots of other chunks is a repeat where we will never be able to place reads reliably ---> mask out. Any small set of chunks which share "enough" kmers, should be co-analysed. What's good about proposal 1
**What's bad about the proposal **
|
** Proposal 2** As proposal 1, except when comparing chunk i and chunk j, (assume diploid, but easy to modify what I say for other ploidies) sample 100 (or some number) of possible pairs of paths in chunk i, and also in chunk j, calculate the statistic for each pair of these, and then take the average, |
Impact on Plasmodium falciparum (key use case for us): will immediately remove the crazy repeat regions where we should not waste time trying to quasimap, or variant call. Less wasted time, lower RAM use. Impact on MHC (also key): this is trickier. There are places (of key importance) where there are say 2 sites, a long way apart, each with say 5000 alternate alleles. Now subsets of these alleles are very similar, within each site, and also a smaller subset are very similar between sites. To be concrete: Site 1 has: Site 2 has: Now, what should we do in terms of deciding whether site 1 and site 2 should be co-analysed? |
BTW, above I said something like |
Also, since this issue was raised, Robyn and I had a conversation which resulted in Proposal 3 |
..where "mask out" could mean modify the original VCF/whatever and regenerate a better PRG |
There's a lot going on in this issue. I think that it would be beneficial to spin out certain problems into separate issues. For instance, I think that chunking the PRG as a memory scaling enhancement can be dealt with separately. From what I understand, the regions that are beneficial to ignore are repeat regions. Surely there are already existing solutions which can identify repeat regions. @iqbal-lab will any of those solutions work for us? |
I think we need more clarity on what this is meant to achieve. I think the onus is on me to do that, so taking ownership |
Given a (minimum and) maximum read length, and a PRG,we should be able to
decide that there are some places we will never be able to draw inference on, so we might as well ignore (which means do not put them in the kmer-index, nor store allele counts)
decide there are some places that should always be analysed jointly (so that if we chunk the genome, those chunks should be analysed concurrently(as one chunk)
The text was updated successfully, but these errors were encountered: