-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: Bloom Filter for ssdeep #21
Comments
Hi, thanks for your feedback! I am interested in bloom filter-like feature for ssdeep hashes in general but not quite interested in implementing it in the ssdeep software (due to various demands that depend on the scale). To my mind, it seems that...
is reasonable. I, as a huge ssdeep fan, once made a ssdeep-compatible library ffuzzypp and parallel clustering tool fast-ssdeep-clus (now open sourced as https://github.com/a4lg/fast-ssdeep-clus) before I become a ssdeep maintainer. This is fairly simple (does not use any bloom filters) but can make full cluster list from 30M raw ssdeep hashes in only a few days (on a 2x10-core server) and adding some hashes to the cluster list would not take that time. So, I think making features to help advanced users make indexing efficient is reasonable (and plausible). |
I was thinking of a solution which wouldn't necessarily use bloom filters, but would replicate some of their functionality. Basically, having a low memory algorithm which could indicate with some degree of error if the provided ssdeep hash was similar to anything it's seen before, but without having to store all the prior examples to make this determination. To me, it seems such a solution would be able to process the 30M examples quicker than a few days (just a guess). The proposed clustering idea seems to still require storing all the previously observed hashes, and doesn't fit my use case. I'm also working with malware, but would like to de-duplicate the collection without having a runtime of O(N^N). |
At least its is possible to decrease the huge number of hash comparisons by pre-filtering as depicted at https://www.virusbulletin.com/virusbulletin/2015/11/optimizing-ssdeep-use-scale/. |
For times when processing several files for similarity, it would be helpful to have a bloom filter, but for SSDeep. Something where the algorithm could answer "have I seen a file like this before", but without having to store and check every single prior ssdeep hash.
The text was updated successfully, but these errors were encountered: