Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch construction of wavelet trees #13

Closed
chrisbarber opened this issue Apr 29, 2019 · 4 comments
Closed

Batch construction of wavelet trees #13

chrisbarber opened this issue Apr 29, 2019 · 4 comments

Comments

@chrisbarber
Copy link
Contributor

chrisbarber commented Apr 29, 2019

Orthogonal to #5, it would be nice to have (and probably easier to implement) batch construction of wavelet trees which is agnostic to the underlying bitvector implementation. While still making serial inserts (push back) to the individual bitvectors, acceleration is possible as follows:

Given a string of symbols to batch-create a new wavelet tree, first scan the string and encode all the symbols using the chosen alphabet encoder. Build an "empty" wavelet tree with nodes for each of the encoded symbols. For each node, precompute the symbols which belong to its partition, as well as whether those symbols belong to its left or right child. Now, the bitvectors of each node can be populated in parallel by iterating symbols in the string, and:

  • if symbol belongs to partition for this node
    • push back a 0 onto bitvector if symbol belongs to left child
    • push back a 1 onto bitvector if symbol belongs to right child
  • else, do nothing

Since the empty tree is pre-constructed, and the bitvectors are all independent, there is no synchronization to worry about, and all threads can read from one copy of the string in memory. Run one thread per node up to a max number of concurrent threads.

Conceptually it is that simple but there are a few things to optimize, e.g. you can wait to create a node and initiate its thread until coming across the first symbol belonging to it, so you don't actually have to pre-scan the string first.

This would also work for push_back_many or even insert_many, where you have threads per node that scan the whole vector of symbols to push back, or the pairs of (symbol, position) to insert, and only do the push back or insert if the symbol belongs to its partition.

@nicolaprezza
Copy link
Member

Thanks for the suggestion! yes, it would surely work nicely. Multi-threading is feature that would be very nice to add to the library. I don't have much time to sped on it though, so if you do go ahead, I'll happily merge.

@chrisbarber
Copy link
Contributor Author

This does offer a pretty good speedup, but I suspect there might be more to be gained by implementing batch construction into the lower-levels e.g. spsi or packed_vector. Could you perhaps advise on a strategy for doing that? Could you also point to a reference for what spsi actually implements? (I see a couple papers that look right, but it would take some time for me to be sure as I am unfamiliar with the general domain.)

@nicolaprezza
Copy link
Member

Awesome, thanks! spsi implements a B-tree where the leaves are packed vectors. The best reference I can think of is the library's paper (https://core.ac.uk/download/pdf/84869324.pdf).

Packed_vector is simply a vector where all integers are packed in a fixed number of bits each. This should be easy to parallelize: instead of calling many push backs, I think you could push an entire word containing several pre-packed integers.

@nicolaprezza
Copy link
Member

Whoops, I've been too eager to merge: it doesn't compile :-/ does this work for you? can you fix it?

from DYNAMIC's main folder:

mkdir build
cd build
cmake ..
make

err.txt

nicolaprezza added a commit that referenced this issue May 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants