Batch construction of wavelet trees #13

chrisbarber · 2019-04-29T14:53:28Z

Orthogonal to #5, it would be nice to have (and probably easier to implement) batch construction of wavelet trees which is agnostic to the underlying bitvector implementation. While still making serial inserts (push back) to the individual bitvectors, acceleration is possible as follows:

Given a string of symbols to batch-create a new wavelet tree, first scan the string and encode all the symbols using the chosen alphabet encoder. Build an "empty" wavelet tree with nodes for each of the encoded symbols. For each node, precompute the symbols which belong to its partition, as well as whether those symbols belong to its left or right child. Now, the bitvectors of each node can be populated in parallel by iterating symbols in the string, and:

if symbol belongs to partition for this node
- push back a 0 onto bitvector if symbol belongs to left child
- push back a 1 onto bitvector if symbol belongs to right child
else, do nothing

Since the empty tree is pre-constructed, and the bitvectors are all independent, there is no synchronization to worry about, and all threads can read from one copy of the string in memory. Run one thread per node up to a max number of concurrent threads.

Conceptually it is that simple but there are a few things to optimize, e.g. you can wait to create a node and initiate its thread until coming across the first symbol belonging to it, so you don't actually have to pre-scan the string first.

This would also work for push_back_many or even insert_many, where you have threads per node that scan the whole vector of symbols to push back, or the pairs of (symbol, position) to insert, and only do the push back or insert if the symbol belongs to its partition.

The text was updated successfully, but these errors were encountered:

nicolaprezza · 2019-05-02T07:53:38Z

Thanks for the suggestion! yes, it would surely work nicely. Multi-threading is feature that would be very nice to add to the library. I don't have much time to sped on it though, so if you do go ahead, I'll happily merge.

chrisbarber · 2019-05-07T12:47:02Z

This does offer a pretty good speedup, but I suspect there might be more to be gained by implementing batch construction into the lower-levels e.g. spsi or packed_vector. Could you perhaps advise on a strategy for doing that? Could you also point to a reference for what spsi actually implements? (I see a couple papers that look right, but it would take some time for me to be sure as I am unfamiliar with the general domain.)

nicolaprezza · 2019-05-08T08:56:23Z

Awesome, thanks! spsi implements a B-tree where the leaves are packed vectors. The best reference I can think of is the library's paper (https://core.ac.uk/download/pdf/84869324.pdf).

Packed_vector is simply a vector where all integers are packed in a fixed number of bits each. This should be easy to parallelize: instead of calling many push backs, I think you could push an entire word containing several pre-packed integers.

nicolaprezza · 2019-05-08T09:05:10Z

Whoops, I've been too eager to merge: it doesn't compile :-/ does this work for you? can you fix it?

from DYNAMIC's main folder:

mkdir build
cd build
cmake ..
make

err.txt

followup to #13

chrisbarber mentioned this issue May 7, 2019

Implement multithreaded push_many on wt_string #14

Merged

chrisbarber mentioned this issue May 8, 2019

followup to #13 #15

Merged

nicolaprezza added a commit that referenced this issue May 8, 2019

Merge pull request #15 from chrisbarber/master

e537746

followup to #13

chrisbarber closed this as completed May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch construction of wavelet trees #13

Batch construction of wavelet trees #13

chrisbarber commented Apr 29, 2019 •

edited

Loading

nicolaprezza commented May 2, 2019

chrisbarber commented May 7, 2019

nicolaprezza commented May 8, 2019

nicolaprezza commented May 8, 2019

Batch construction of wavelet trees #13

Batch construction of wavelet trees #13

Comments

chrisbarber commented Apr 29, 2019 • edited Loading

nicolaprezza commented May 2, 2019

chrisbarber commented May 7, 2019

nicolaprezza commented May 8, 2019

nicolaprezza commented May 8, 2019

chrisbarber commented Apr 29, 2019 •

edited

Loading