Deal with hash tables being too big #36

ctb · 2013-03-22T06:59:50Z

Jaron pointed out that for k=20, you don't need to have hash tables much larger than 500 GB total (the exact number to calculate needs to include palindromes for k=20), and, in fact, you don't need more than one hash table because there's 0 false positive rate. We should figure out how to deal with this properly -- options are

notify user and keep going
notify user and die
resize hash table down/modify parameters accordingly, notify user

I think I like the last the best.

emcd · 2013-03-22T17:03:28Z

Good point. I think that this dovetails with your desire for alternate
hashers in some ways, because we can squeeze out some additional efficiency
by performing an exact count and not even using the [worthless] modulus in
this case. (Would have to ensure that output file format was the same from
the exact counting though.)

On Fri, Mar 22, 2013 at 2:59 AM, C. Titus Brown [email protected]:

Jaron pointed out that for k=20, you don't need to have hash tables much
larger than 500 GB total (the exact number to calculate needs to include
palindromes for k=20), and, in fact, you don't need more than one hash
table because there's 0 false positive rate. We should figure out how to
deal with this properly -- options are

notify user and keep going

notify user and die

resize hash table down/modify parameters accordingly, notify user

I think I like the last the best.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/36
.

Eric McDonald
HPC/Cloud Software Engineer
for the Institute for Cyber-Enabled Research (iCER)
and the Laboratory for Genomics, Evolution, and Development (GED)
Michigan State University
P: 517-355-8733

emcd · 2013-04-05T18:33:43Z

Referencing issue #27 from here, since I think that there is a tie-in.

mr-c · 2014-10-01T22:25:48Z

This would be a great enhancement on top of #390 / #347.

ctb · 2016-12-21T21:50:54Z

With our new -M and -U options, a lot of the bigger issues are dealt with; the two remaining items that pop up in my head are these:

we could recommend new -M settings based on the # of k-mers actually seen when loading in a data set, so that if people wanted to optimize memory usage they could;
we could be calculating the max # of possible k-mers for a given k and downsizing the table memory sizes appropriately;

thoughts?

betatim · 2016-12-22T13:12:32Z

I like the second option.

ctb · 2016-12-22T14:20:57Z

@betatim asks:

is there a way to look at a subset of the input and guesstimate a likely number of unique kmers?

no, not in the general case.

ctb · 2016-12-22T14:21:14Z

but we can establish a lower bound :)

betatim · 2017-01-09T16:17:14Z

I think lower bound is just fancy speak for lower bound ;)

Should we add a script that will peek at your file and make recommendation for your table size and number of tables?

ctb · 2017-01-15T14:54:27Z

@betatim it is not possible to do this usefully for most data sets, unfortunately. hence the idea for the first option - "I see you've got ~X k-mers and this is going to cause problems with your current settings, how about...?" We already have the code in the fprate evaluation at the end of digital normalization and load into counting.

betatim · 2017-01-18T20:54:54Z

Why not? Is there really such a strong ordering to reads in a file that can't be fixed by reading a few thousand here and a few thousand there from a file?

ctb · 2017-01-18T21:25:02Z

On Wed, Jan 18, 2017 at 12:54:54PM -0800, Tim Head wrote: Why not? Is there really such a strong ordering to reads in a file that can't be fixed by reading a few thousand here and a few thousand there from a file?

Hmm... let's see. You're sampling from a (frequently uneven) distribution. There may be 20 reads out of 1 bn that are important and we want to keep. They may be the only low abundance reads in there. How would we detect them without sampling 50 million (1/20th of 1 bn) reads? A bit of a straw man but the true numbers aren't necessarily that far off, e.g. we may have 10,000 reads at a coverage of 20 in 5 billion reads, for a soil sample. The banding stuff that @standage is doing could be used for sampling an Nth of the data uniformly, from which we could infer the above distribution(s). But then we'd be computing across all the data. Hmm. Maybe worth it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with hash tables being too big #36

Deal with hash tables being too big #36

ctb commented Mar 22, 2013

emcd commented Mar 22, 2013

emcd commented Apr 5, 2013

mr-c commented Oct 1, 2014

ctb commented Dec 21, 2016

betatim commented Dec 22, 2016

ctb commented Dec 22, 2016

ctb commented Dec 22, 2016

betatim commented Jan 9, 2017

ctb commented Jan 15, 2017

betatim commented Jan 18, 2017

ctb commented Jan 18, 2017 via email

Deal with hash tables being too big #36

Deal with hash tables being too big #36

Comments

ctb commented Mar 22, 2013

emcd commented Mar 22, 2013

emcd commented Apr 5, 2013

mr-c commented Oct 1, 2014

ctb commented Dec 21, 2016

betatim commented Dec 22, 2016

ctb commented Dec 22, 2016

ctb commented Dec 22, 2016

betatim commented Jan 9, 2017

ctb commented Jan 15, 2017

betatim commented Jan 18, 2017

ctb commented Jan 18, 2017 via email