-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in bao tree? #1288
Comments
Looked into this a bit. It seems that iroh add is about half as fast in 0.5.2 as in 0.4.1. Looking at the flamegraph does not reveal any immediate insights. Most time is spent in the blake3 crate, as you would expect. Since it is 1/2 as fast as before, maybe I am doing something twice? One thing that is notable is that this is using the portable impl on m1 mac because the neon SIMD feature is not automatically detected. See from the crate docs:
We could manually enable neon on macs. |
It looks like abao is more efficient when computing a chunk group hash compared to bao-tree. |
I still have no idea whatsoever what caused the regression. Looking at the flamegraph showed that the version I got locally is using bao-tree and isn't that old. Might be because of release-optimized and LTO. But I looked into this a bit more. There is currently a limitation of all bao implementations (bao, my abao, and bao-tree) that they can not tap into the performance of the blake3 crate. The blake3 crate provides efficient ways to hash data (blake3::hash and blake3::Hasher), but these ways do not provide a way to
The functions that are exposed in blake3::guts that allow working with non zero chunk offsets and producing non root hashes are not as efficient. (blake3::guts::parent_cv and blake3::guts::ChunkState). What is needed is to expose a function pub fn hash_block(start_chunk: u64, data: &[u8], is_root: bool) -> Hash { in blake3::guts that would benefit all bao impls. |
Added a PR to BLAKE3: BLAKE3-team/BLAKE3#329 |
Fixed by #1322 or if the above PR gets merged and we can use it. |
Played around with ML data sets again.
For some reason ingestion of a new file is much slower in 0.5 than it used to be in 0.4.1.
The only thing that is done when adding a file is computing the outboard, synchronously, via bao_tree::io::outboard_post_order.
I don't remember much changing regarding this in bao-tree. The only noteworthy thing was moving from ouroboros to self_cell.
Another culprit could be the new threading concept, but since this is working on the blocking pool I find this unlikely.
The text was updated successfully, but these errors were encountered: