Locality sensitive false positives #16

ekg · 2019-07-25T09:02:25Z

I would like to encourage false positives to occur for inputs that are near (in some comparative sense) members of the set that was used to build the PMHF.

Would the construction algorithm already encourage this to happen for some cases? I haven't tested but perhaps you have thought about this.

If not, what adjustment should I consider making to encourage locality sensitive false positives?

ekg · 2019-07-25T09:25:15Z

Would it make sense to just change the hash functions used in the construction of the pmhf? If these tend to be locality sensitive then this would tend to drive false positives along the same path through the function.

rchikhi · 2019-07-25T14:45:58Z

Yes, I believe so that changing the hash function to a LSH one could be fruitful. More importantly, having a benchmark that determines if such a change worked effectively seems like an important component too..

ekg · 2019-07-25T15:35:35Z

A metric to optimize would be the edit distance between a query and the kmer it is hashed to "flasely".

We want to be more likely to hash to something when the edit distance is lower. That's another metric to check.

To drive this, we could use an actual set of kmers and all edits of them up to some number of substitutions and indels.

rchikhi · 2019-07-25T15:41:35Z

Oh if you're hashing kmers then yes, that sounds all good. The LSH could then just be the binary value of the canonical l-minimizer of the kmer, where l is picked such that [0;4^l] is the hash range, e.g. for a 32 bits hash range, l=17. I'm sure there are better LSH's though.

ekg · 2019-07-26T08:21:06Z

Could I just put this kind of l-minimizer hash in as the custom hash function? What else might I need to adjust?

rchikhi · 2019-07-26T08:23:33Z

That's probably it. You can have a look at https://github.com/rizkg/BBHash/blob/master/example_custom_hash.cpp which shows how to use a custom hash

rchikhi · 2019-10-01T08:28:39Z

So I took a stab at implementing this. I generated some random 31-mers. The hash function inside BBHash is now the hash of their l-mer minimizer. Then for each inserted element X, I counted the number of times BBHash returns the same index for X as for an element at hamming distance 1 of X (dubbed a "desired collision").

Out of 930000 queries	Number of desired collisions
With regular BBhash hashing (no minimizer)	82
minimizer-based hash, l=6	4615
l=10	426300
l=15	364356
l=20	248890

Code: https://gist.github.com/rchikhi/7acb1c5f23c4ad7f3641775f69d28cf8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locality sensitive false positives #16

Locality sensitive false positives #16

ekg commented Jul 25, 2019

ekg commented Jul 25, 2019

rchikhi commented Jul 25, 2019

ekg commented Jul 25, 2019

rchikhi commented Jul 25, 2019 •

edited

Loading

ekg commented Jul 26, 2019

rchikhi commented Jul 26, 2019

rchikhi commented Oct 1, 2019 •

edited

Loading

Locality sensitive false positives #16

Locality sensitive false positives #16

Comments

ekg commented Jul 25, 2019

ekg commented Jul 25, 2019

rchikhi commented Jul 25, 2019

ekg commented Jul 25, 2019

rchikhi commented Jul 25, 2019 • edited Loading

ekg commented Jul 26, 2019

rchikhi commented Jul 26, 2019

rchikhi commented Oct 1, 2019 • edited Loading

rchikhi commented Jul 25, 2019 •

edited

Loading

rchikhi commented Oct 1, 2019 •

edited

Loading