-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with hash tables being too big #36
Comments
Good point. I think that this dovetails with your desire for alternate On Fri, Mar 22, 2013 at 2:59 AM, C. Titus Brown [email protected]:
Eric McDonald |
Referencing issue #27 from here, since I think that there is a tie-in. |
With our new
thoughts? |
I like the second option. |
@betatim asks:
no, not in the general case. |
but we can establish a lower bound :) |
I think lower bound is just fancy speak for lower bound ;) Should we add a script that will peek at your file and make recommendation for your table size and number of tables? |
@betatim it is not possible to do this usefully for most data sets, unfortunately. hence the idea for the first option - "I see you've got ~X k-mers and this is going to cause problems with your current settings, how about...?" We already have the code in the fprate evaluation at the end of digital normalization and load into counting. |
Why not? Is there really such a strong ordering to reads in a file that can't be fixed by reading a few thousand here and a few thousand there from a file? |
On Wed, Jan 18, 2017 at 12:54:54PM -0800, Tim Head wrote:
Why not? Is there really such a strong ordering to reads in a file that can't be fixed by reading a few thousand here and a few thousand there from a file?
Hmm... let's see.
You're sampling from a (frequently uneven) distribution.
There may be 20 reads out of 1 bn that are important and we want to keep.
They may be the only low abundance reads in there. How would we detect
them without sampling 50 million (1/20th of 1 bn) reads?
A bit of a straw man but the true numbers aren't necessarily that far off,
e.g. we may have 10,000 reads at a coverage of 20 in 5 billion reads,
for a soil sample.
The banding stuff that @standage is doing could be used for sampling
an Nth of the data uniformly, from which we could infer the above
distribution(s). But then we'd be computing across all the data. Hmm.
Maybe worth it?
|
Jaron pointed out that for k=20, you don't need to have hash tables much larger than 500 GB total (the exact number to calculate needs to include palindromes for k=20), and, in fact, you don't need more than one hash table because there's 0 false positive rate. We should figure out how to deal with this properly -- options are
I think I like the last the best.
The text was updated successfully, but these errors were encountered: