Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other Hash Functions #26

Open
bovee opened this issue Sep 6, 2018 · 9 comments
Open

Other Hash Functions #26

bovee opened this issue Sep 6, 2018 · 9 comments

Comments

@bovee
Copy link
Contributor

bovee commented Sep 6, 2018

Idea here.

Murmurhash is fast, but it would potentially be faster to use a hashing function (like ntHash ) that doesn't require a full recomputation on each new k-mer.

Unfortunately, to use ntHash itself we'd either need C++ bindings or a translation into Rust.

@bovee
Copy link
Contributor Author

bovee commented Sep 6, 2018

RepHash is another rolling hashing function worth taking a look at.

@bovee
Copy link
Contributor Author

bovee commented Sep 12, 2018

See bcgsc/ntHash#7 for a discussion of a Rust implemention of ntHash.

@luizirber
Copy link
Contributor

Now that I figure out what version of ntHash I'm comparing to, I'll finish the implementation (with a proper Iterator trait). I can't really build a Hasher because it is more for traditional checksum hash functions (where you accumulate changes with write and then finish the hash)

@mohamadi
Copy link

@bovee Have you seen ntHashIterator class in ntHash lib? This is a C++ wrapper over ntHash to iterate on a sequence.

@mohamadi
Copy link

@luizirber and @bovee please let me what a C++ binding for ntHash should look like. Do you mean C++ version of ntHash instead of C version? or something like ntHashIterator would work for you? I can also help with Rust translation.

@luizirber
Copy link
Contributor

I think this is ready for testing: https://github.com/luizirber/nthash

@mohamadi
Copy link

mohamadi commented Sep 13, 2018

@luizirber nice work! just a quick note here in the Rust implementation. Could it be possible to replace both match with a lookup array. I think match is similar to C/C++ switch which is about 10-20% slower than a lookup array according to our past experience in ABySS and other tools. It could be 256 entry-table similar to what we have as seedTab in nthash.hpp, or even we can use a compressed version of 16-entry seedTab with few extra operations.

@mohamadi
Copy link

@bovee @luizirber How do you handle non-ACGT? Just wanted to mention there are specific functions in nthash.hpp for this purpose.

@bovee
Copy link
Contributor Author

bovee commented Sep 19, 2018

@mohamadi The current behavior is to panic on non-ACGTN, but in the (very preliminary) pull request I put together (luizirber/nthash#2) using a lookup array, I treat every non-ACGT as an N (i.e. zero). I think panicing is probably a better default for most use cases though?

(Also, it'd be appreciated if we could move the conversation over to @luizirber 's ntHash repo)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants