Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with hash tables being too big #36

Open
ctb opened this issue Mar 22, 2013 · 11 comments
Open

Deal with hash tables being too big #36

ctb opened this issue Mar 22, 2013 · 11 comments

Comments

@ctb
Copy link
Member

ctb commented Mar 22, 2013

Jaron pointed out that for k=20, you don't need to have hash tables much larger than 500 GB total (the exact number to calculate needs to include palindromes for k=20), and, in fact, you don't need more than one hash table because there's 0 false positive rate. We should figure out how to deal with this properly -- options are

  • notify user and keep going
  • notify user and die
  • resize hash table down/modify parameters accordingly, notify user

I think I like the last the best.

@emcd
Copy link
Contributor

emcd commented Mar 22, 2013

Good point. I think that this dovetails with your desire for alternate
hashers in some ways, because we can squeeze out some additional efficiency
by performing an exact count and not even using the [worthless] modulus in
this case. (Would have to ensure that output file format was the same from
the exact counting though.)

On Fri, Mar 22, 2013 at 2:59 AM, C. Titus Brown [email protected]:

Jaron pointed out that for k=20, you don't need to have hash tables much
larger than 500 GB total (the exact number to calculate needs to include
palindromes for k=20), and, in fact, you don't need more than one hash
table because there's 0 false positive rate. We should figure out how to
deal with this properly -- options are

  • notify user and keep going
  • notify user and die
  • resize hash table down/modify parameters accordingly, notify user

I think I like the last the best.


Reply to this email directly or view it on GitHubhttps://github.com//issues/36
.

Eric McDonald
HPC/Cloud Software Engineer
for the Institute for Cyber-Enabled Research (iCER)
and the Laboratory for Genomics, Evolution, and Development (GED)
Michigan State University
P: 517-355-8733

@emcd
Copy link
Contributor

emcd commented Apr 5, 2013

Referencing issue #27 from here, since I think that there is a tie-in.

@mr-c
Copy link
Contributor

mr-c commented Oct 1, 2014

This would be a great enhancement on top of #390 / #347.

@ctb
Copy link
Member Author

ctb commented Dec 21, 2016

With our new -M and -U options, a lot of the bigger issues are dealt with; the two remaining items that pop up in my head are these:

  • we could recommend new -M settings based on the # of k-mers actually seen when loading in a data set, so that if people wanted to optimize memory usage they could;
  • we could be calculating the max # of possible k-mers for a given k and downsizing the table memory sizes appropriately;

thoughts?

@betatim
Copy link
Member

betatim commented Dec 22, 2016

I like the second option.

@ctb
Copy link
Member Author

ctb commented Dec 22, 2016

@betatim asks:

is there a way to look at a subset of the input and guesstimate a likely number of unique kmers?

no, not in the general case.

@ctb
Copy link
Member Author

ctb commented Dec 22, 2016

but we can establish a lower bound :)

@betatim
Copy link
Member

betatim commented Jan 9, 2017

I think lower bound is just fancy speak for lower bound ;)

Should we add a script that will peek at your file and make recommendation for your table size and number of tables?

@ctb
Copy link
Member Author

ctb commented Jan 15, 2017

@betatim it is not possible to do this usefully for most data sets, unfortunately. hence the idea for the first option - "I see you've got ~X k-mers and this is going to cause problems with your current settings, how about...?" We already have the code in the fprate evaluation at the end of digital normalization and load into counting.

@betatim
Copy link
Member

betatim commented Jan 18, 2017

Why not? Is there really such a strong ordering to reads in a file that can't be fixed by reading a few thousand here and a few thousand there from a file?

@ctb
Copy link
Member Author

ctb commented Jan 18, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants