Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: pg search is slow #165

Open
ghost opened this issue Sep 25, 2023 · 4 comments
Open

Performance: pg search is slow #165

ghost opened this issue Sep 25, 2023 · 4 comments
Labels
performance Runtime Performance Improvements

Comments

@ghost
Copy link

ghost commented Sep 25, 2023

Hi,

MalwareDB is great, however when we testing file search up to 10M files, TLSH search requires 10s.
I found that TLSH already published the index algorithm( https://tlsh.org/papers.html)

Do we have milestone for better search index? Thanks!

@rjzak rjzak added this to MalwareDB Sep 25, 2023
@rjzak rjzak moved this to Backlog in MalwareDB Sep 25, 2023
@rjzak
Copy link
Member

rjzak commented Sep 25, 2023

Thanks for the feedback! I've mostly be focused on getting features in place and working on usability issues. But performance is definitely something on my mind, and it's not yet as fast as I'd like it to be.

Are you using SQLite or Postgres? Postgres should be faster since it uses a C extension, and SQLite has to load all the data in memory then search.

Edit: I missed the pg part in the title when I first looked, I'll investigate.

@rjzak
Copy link
Member

rjzak commented Sep 26, 2023

@maxmeng-oss How was the performance with 10M files with the other Postgres extensions (lzjd, ssdeep)?

@ghost
Copy link
Author

ghost commented Sep 26, 2023

I havn't tested LZJD, SSDEEP yet.
Since B-tree, Hash, SP-GiST are same linear grow on TLSH, my educated guess is it doesn't matter what hash algorithm you choose, it will require O(N) search on all data.

@ghost ghost changed the title Performance: pg search performance is slow Performance: pg search is slow Sep 26, 2023
@rjzak
Copy link
Member

rjzak commented Sep 27, 2023

This might help: https://github.com/jinyyu/tlsh_gist
There's also this, but I don't understand it: https://zhuanlan.zhihu.com/p/497732848

Related Postgres docs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Runtime Performance Improvements
Projects
Status: Backlog
Development

No branches or pull requests

1 participant