Hybrid-DataStructure

A hybrid data structure to store Strings, expected to be used for Heuristics for Bitcoin Forensics.
Currently focusing on Heuristic H1, wich checks for same input addresses in different transactions, it makes heavy use of intersection and union functions.
Eventually this project will become a external crate at crates.io to be used openly.
This project is also part of my scientific initiation at UFU (Federal university of Uberlândia - Brazil).

Benchmarks and tests

"cargo bench" for benchmarks and "cargo test" for tests. With the current data structure, Hybrid beats using only a BTreeSet for H1 Heuristic by roughly 100 milliseconds in this test. But it wins with one condition, since BloomFilters have a fixed maximum size, if the structure needs to allocate more it creates another one and pushes it to the list, if it needs to do it, it looses. I recommend creating big BloomFilters, using Hybrid::new(items_count, fp_rate) to avoid this issue.

Details about the project

It is a hybrid data structure that contains a BtreeSet and a BloomFilter. The BloomFilter's goal is to optimize "contains" methods without compromising insertions.
The current code is still on the optimization process but is relatively faster than just using a BTreeSet in certain cases.

Union functions

Last version's union function just pushed the other data structure to the end of the list, this thing was causing some mutability and lifetime problems, it is possible to circle around it but for this new version I've changed the insertion function so it checks for duplicates and union basically inserts other into self. I did this to avoid this mutability and lifetimes issues until I discover how to properly deal with it.

For the future updates:

Make a better way of concatenating the iterators, because of borrowing issues, otherwise from my testings it works.
Make a function that returns iterators with the intersection itself, for the heuristics just a boolean value is enough, but I want it to make this structure more useful for other cases.
I also want to test other Tree sets so the user can choose the best solution for the case.

Things I have already tried

I tried to make it generic, not only for strings, but I don't know why it lost 40% performance since typing should be dealt with in compile time, not run time.
Also tried to make only one TreeSet and a list of BloomFilters, but it ended up consuming so much memory that it crashes when I test it with millions of strings. The goal of this was to deal with duplicates from unions.
Currently using the same seed for all bloomfilters, with the hope to improve performance but haven't seen anything relevant.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
hybrid_data_structure		hybrid_data_structure
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid-DataStructure

Benchmarks and tests

Details about the project

Union functions

For the future updates:

Things I have already tried

About

Releases

Packages

Languages

PedroLeale/Hybrid-DataStructure

Folders and files

Latest commit

History

Repository files navigation

Hybrid-DataStructure

Benchmarks and tests

Details about the project

Union functions

For the future updates:

Things I have already tried

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages