Performance regression in bao tree? #1288

rklaehn · 2023-07-20T19:09:14Z

Played around with ML data sets again.

For some reason ingestion of a new file is much slower in 0.5 than it used to be in 0.4.1.

The only thing that is done when adding a file is computing the outboard, synchronously, via bao_tree::io::outboard_post_order.

I don't remember much changing regarding this in bao-tree. The only noteworthy thing was moving from ouroboros to self_cell.

Another culprit could be the new threading concept, but since this is working on the blocking pool I find this unlikely.

rklaehn · 2023-07-28T15:23:14Z

Looked into this a bit. It seems that iroh add is about half as fast in 0.5.2 as in 0.4.1. Looking at the flamegraph does not reveal any immediate insights. Most time is spent in the blake3 crate, as you would expect.

Since it is 1/2 as fast as before, maybe I am doing something twice?

One thing that is notable is that this is using the portable impl on m1 mac because the neon SIMD feature is not automatically detected. See from the crate docs:

# The NEON implementation does not participate in dynamic feature detection,
# which is currently x86-only. If "neon" is on, NEON support is assumed. Note
# that AArch64 always supports NEON, but support on ARMv7 varies. The NEON
# implementation uses C intrinsics and requires a C compiler.
neon = []

We could manually enable neon on macs.

rklaehn · 2023-07-29T14:17:50Z

It looks like abao is more efficient when computing a chunk group hash compared to bao-tree.

rklaehn · 2023-07-29T20:44:50Z

I still have no idea whatsoever what caused the regression. Looking at the flamegraph showed that the version I got locally is using bao-tree and isn't that old. Might be because of release-optimized and LTO.

But I looked into this a bit more. There is currently a limitation of all bao implementations (bao, my abao, and bao-tree) that they can not tap into the performance of the blake3 crate.

The blake3 crate provides efficient ways to hash data (blake3::hash and blake3::Hasher), but these ways do not provide a way to

start at a non zero chunk offset and
compute a non-root hash.

The functions that are exposed in blake3::guts that allow working with non zero chunk offsets and producing non root hashes are not as efficient. (blake3::guts::parent_cv and blake3::guts::ChunkState).

What is needed is to expose a function

pub fn hash_block(start_chunk: u64, data: &[u8], is_root: bool) -> Hash {

in blake3::guts that would benefit all bao impls.

rklaehn · 2023-07-29T20:59:57Z

Added a PR to BLAKE3: BLAKE3-team/BLAKE3#329

rklaehn · 2023-08-03T08:30:15Z

Fixed by #1322 or if the above PR gets merged and we can use it.

b5 · 2023-08-16T17:45:51Z

marking as closed by #1322, feel free to re-open if you'd like to use this to track the BLAKE3 PR @rklaehn, but I wouldn't hold my breath :)

rklaehn added perf performance related issues bug Something isn't working c-iroh-blobs labels Jul 20, 2023

rklaehn mentioned this issue Jul 29, 2023

Add microbenchmarks for computing a chunk group hash n0-computer/bao-tree#22

Closed

b5 closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in bao tree? #1288

Performance regression in bao tree? #1288

rklaehn commented Jul 20, 2023

rklaehn commented Jul 28, 2023

rklaehn commented Jul 29, 2023

rklaehn commented Jul 29, 2023

rklaehn commented Jul 29, 2023

rklaehn commented Aug 3, 2023

b5 commented Aug 16, 2023

Performance regression in bao tree? #1288

Performance regression in bao tree? #1288

Comments

rklaehn commented Jul 20, 2023

rklaehn commented Jul 28, 2023

rklaehn commented Jul 29, 2023

rklaehn commented Jul 29, 2023

rklaehn commented Jul 29, 2023

rklaehn commented Aug 3, 2023

b5 commented Aug 16, 2023