Request: Bloom Filter for ssdeep #21

rjzak · 2019-07-16T14:28:58Z

For times when processing several files for similarity, it would be helpful to have a bloom filter, but for SSDeep. Something where the algorithm could answer "have I seen a file like this before", but without having to store and check every single prior ssdeep hash.

a4lg · 2019-07-17T01:06:41Z

Hi, thanks for your feedback!

I am interested in bloom filter-like feature for ssdeep hashes in general but not quite interested in implementing it in the ssdeep software (due to various demands that depend on the scale).

To my mind, it seems that...

Implementing helper functions to libfuzzy library for better indexing and
Documenting some of ssdeep internals

is reasonable.

I, as a huge ssdeep fan, once made a ssdeep-compatible library ffuzzypp and parallel clustering tool fast-ssdeep-clus (now open sourced as https://github.com/a4lg/fast-ssdeep-clus) before I become a ssdeep maintainer. This is fairly simple (does not use any bloom filters) but can make full cluster list from 30M raw ssdeep hashes in only a few days (on a 2x10-core server) and adding some hashes to the cluster list would not take that time.

So, I think making features to help advanced users make indexing efficient is reasonable (and plausible).
Could you let me know your thoughts about this option?

rjzak · 2019-07-17T19:54:26Z

I was thinking of a solution which wouldn't necessarily use bloom filters, but would replicate some of their functionality. Basically, having a low memory algorithm which could indicate with some degree of error if the provided ssdeep hash was similar to anything it's seen before, but without having to store all the prior examples to make this determination. To me, it seems such a solution would be able to process the 30M examples quicker than a few days (just a guess). The proposed clustering idea seems to still require storing all the previously observed hashes, and doesn't fit my use case. I'm also working with malware, but would like to de-duplicate the collection without having a runtime of O(N^N).

tohihuty · 2019-10-07T17:38:39Z

At least its is possible to decrease the huge number of hash comparisons by pre-filtering as depicted at https://www.virusbulletin.com/virusbulletin/2015/11/optimizing-ssdeep-use-scale/.
My C++ implementation of a tailored hash table and comparison function processes 250k hashes in 2 seconds (1 thread). In the event that you need to compare some new hashes against a big database it is even faster (200 ms) and less memory-consuming, because the hash table is only filled with the n-grams of the new hashes.

a4lg added the enhancement label Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Bloom Filter for ssdeep #21

Request: Bloom Filter for ssdeep #21

rjzak commented Jul 16, 2019

a4lg commented Jul 17, 2019

rjzak commented Jul 17, 2019

tohihuty commented Oct 7, 2019

Request: Bloom Filter for ssdeep #21

Request: Bloom Filter for ssdeep #21

Comments

rjzak commented Jul 16, 2019

a4lg commented Jul 17, 2019

rjzak commented Jul 17, 2019

tohihuty commented Oct 7, 2019