-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster kmer hashing #71
Comments
duplicate of #47 |
The Bob Jenkins hash works fine in Cortex. BooPHF/related use a minimal perfect hash for much bigger sets and are fast |
Copied over the post from the duplicate:
Closing duplicate #47 . |
The hash function we currently used is here:
|
hello, just restarting this discussion, since there is a new interesting related paper, and it is also nice to start looking around gramtools code & issues. I guess https://github.com/rob-p/BooM is a nice option, and makes BBHash an associative index. The problem is that it stores all the keys in a There is also now the option of using BLight (https://github.com/Malfoy/Blight, https://www.biorxiv.org/content/10.1101/546309v2), which wraps BBHash in a more lightweight associative index, but still increases memory usage to ~28 bits per k-mer. Good to keep this in mind, but I guess it might be better to use the FM-index as backup. I guess it can also translate into faster queries than the current implementation, which uses So, I'd vote for a custom BBHash wrapper, similar to https://github.com/rob-p/BooM, but backed up by the (already constructed) gramtools' FM-index to filter out FPs. Do you see any issue or better options than this one? |
A pair perfect hash function linking two perfect hashes of the same data with different base hash functions might provide a way to get a set with membership queries in a fixed number of bits per entry regardless of key length. It would have some theoretically low false positive rate assuming random lookups. The construction begins by making two perfect minimal hash tables, each for a different hash of the keys. Then we link the tables together by recording which entry each of one of them maps to in the other. The total structure should take |
Also, definitely test out https://github.com/skarupke/flat_hash_map and https://github.com/greg7mdp/sparsepp. I'd love to hear about any other alternatives. In my application I am mostly limited by lookup speed, and can handle a larger construction cost. That said, I found perfect hashing to be a little slower than I'd expected. If you find or do any benchmarks I'd be interested. |
Indeed something interesting to test. I would guess queries would be very fast with this. I think that in real datasets, N should be large, so 1/N is fine. However, in the example discussed above:
this would take 251 MB (using |
Thanks for the suggestions! I shall take a look at these. I will keep you posted in case of any benchmarks. Please feel free to share more about hashing!! Thanks! |
update: I had in mind that https://github.com/rob-p/BooM was not an associative index wrapper for BBHash, but it is... edited #71 (comment) accordingly |
The pairPMHF wouldn't be too useful when storing 13-mers, because they are smaller than the mapping that's needed between the PMHFs. But if the keys are very big then the savings would become significant. I don't know what kmers size that would have to be. Maybe all the 71-kmers in a read set would be an interesting thing to store in a structure like this. The n log n factor could be tuned to a given false positive rate. For instance, if you could accept FP at 1/256 then you'd need only 8 bits per entry. |
Ah so the quasi dictionary is sufficient. What a silly idea I had. A checksum or other hash into any small number of bits would provide the same guarantees. This checksum could be made from another PMHF, which is interesting but maybe overly complicated. |
Hello! Sorry for the delay on answering! It is nice to have the pairMPHF in mind, and a checksum from another MPHF could yield some interesting stuff, it can be useful in some situations I guess. I thought about using a classical Bloom Filter (BF) together with a MPHF to have membership queries with low FPs, but the Quasi Dictionary (QD) uses less memory than classical Bloom Filters with the same rate of FPs, and it is probably faster, since only one hash is needed. I guess the lightest structure would still be MPHF backed up by the FM-index (which is already constructed). This would take only additional ~3bits/kmer. The problem is that filtering out FPs requires The QD is probably faster than the previous approach, taking more space (~(3+f)bits/kmer with a FP of 1/2^f). Other structures to check (probably just a benchmark is going to give us concrete answers):
I shall keep searching for additional alternatives... |
I look forward to chatting about this when I get back next week. |
This was first mentioned on Slack by Sorina: https://iqbalgroup.slack.com/archives/C5MANUB17/p1504131606000222
This link shows benchmarking of different hash functions for random and sequential integer inserts:
http://incise.org/hash-table-benchmarks.html
Gramtools currently uses the Boost hash function. Benchmarking results show that this is the slowest algorithm when > 20 mil integers are inserted. When handling the human PRG, gramtools attempts to index 60 mil kmers.
Several other libraries provide hashing functions which are apparently faster than our current implementation.
The text was updated successfully, but these errors were encountered: