Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Bloom Filter for ssdeep #21

Open
rjzak opened this issue Jul 16, 2019 · 3 comments
Open

Request: Bloom Filter for ssdeep #21

rjzak opened this issue Jul 16, 2019 · 3 comments

Comments

@rjzak
Copy link

rjzak commented Jul 16, 2019

For times when processing several files for similarity, it would be helpful to have a bloom filter, but for SSDeep. Something where the algorithm could answer "have I seen a file like this before", but without having to store and check every single prior ssdeep hash.

@a4lg
Copy link
Collaborator

a4lg commented Jul 17, 2019

Hi, thanks for your feedback!

I am interested in bloom filter-like feature for ssdeep hashes in general but not quite interested in implementing it in the ssdeep software (due to various demands that depend on the scale).

To my mind, it seems that...

  1. Implementing helper functions to libfuzzy library for better indexing and
  2. Documenting some of ssdeep internals

is reasonable.

I, as a huge ssdeep fan, once made a ssdeep-compatible library ffuzzypp and parallel clustering tool fast-ssdeep-clus (now open sourced as https://github.com/a4lg/fast-ssdeep-clus) before I become a ssdeep maintainer. This is fairly simple (does not use any bloom filters) but can make full cluster list from 30M raw ssdeep hashes in only a few days (on a 2x10-core server) and adding some hashes to the cluster list would not take that time.

So, I think making features to help advanced users make indexing efficient is reasonable (and plausible).
Could you let me know your thoughts about this option?

@rjzak
Copy link
Author

rjzak commented Jul 17, 2019

I was thinking of a solution which wouldn't necessarily use bloom filters, but would replicate some of their functionality. Basically, having a low memory algorithm which could indicate with some degree of error if the provided ssdeep hash was similar to anything it's seen before, but without having to store all the prior examples to make this determination. To me, it seems such a solution would be able to process the 30M examples quicker than a few days (just a guess). The proposed clustering idea seems to still require storing all the previously observed hashes, and doesn't fit my use case. I'm also working with malware, but would like to de-duplicate the collection without having a runtime of O(N^N).

@tohihuty
Copy link

tohihuty commented Oct 7, 2019

At least its is possible to decrease the huge number of hash comparisons by pre-filtering as depicted at https://www.virusbulletin.com/virusbulletin/2015/11/optimizing-ssdeep-use-scale/.
My C++ implementation of a tailored hash table and comparison function processes 250k hashes in 2 seconds (1 thread). In the event that you need to compare some new hashes against a big database it is even faster (200 ms) and less memory-consuming, because the hash table is only filled with the n-grams of the new hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants