-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tolerate empty vectors of shuffle intermediates #1383
Tolerate empty vectors of shuffle intermediates #1383
Conversation
This is more important for the sharded shuffle, which for small inputs is reasonably likely to produce an empty output on some shard.
f104326
to
c764425
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1383 +/- ##
==========================================
+ Coverage 93.58% 93.71% +0.12%
==========================================
Files 223 223
Lines 37165 37611 +446
==========================================
+ Hits 34781 35246 +465
+ Misses 2384 2365 -19 ☔ View full report in Codecov by Sentry. |
ipa-core/src/helpers/hashing.rs
Outdated
/// | ||
/// ## Panics | ||
/// Panics when Iterator is empty. | ||
pub fn compute_non_empty_hash<I, T, S>(input: I) -> Hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to return Option<Hash>
here and let callers explicitly decide what to do when the input is empty? I feel like treating 0 as a valid hash is dangerous if you accidentally call compute_hash
because after that all bets are off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do you need an empty hash? Or could those potentially-empty cases be amended with once(foo).chain(input) at the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just saw our first case where some shards after shuffle ended up with 0 shares. To me, the fact that hash was never computed, feels like something callers need to be aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option<Hash>
would mean adding an expect
to most calls, and would also mean doing something like unwrap_or(SHA256_HASH_OF_EMPTY_MESSAGE)
for the empty-ok case, since we need to send something to the other helper to signify "I have no data".
I'd rather expose an extra function than prepend a dummy value to force a message to be non-empty, which seems like a kludge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it just feels unsafe to allow someone to ignore the fact that hash wasn't computed at all. Yes it introduces clutter at the callsite, but it is a good thing for this use case imo. Slows down impatient writer and makes them think what exactly they want to do.
I like Martin's suggestion to flip the naming - it somewhat mitigates the issue, although not entirely imo - someone can still just copy-paste the code w/o thinking. Those things are hard to catch in review
@@ -74,9 +70,37 @@ where | |||
sha.update(&buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that this is a bit churn-y. Is there value in writing multiple items before updating the hash? Or are we OK with our vectorization code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that this can optimize reasonably well. I'm inclined to leave it unless/until it shows up in profiling.
ipa-core/src/helpers/hashing.rs
Outdated
/// | ||
/// ## Panics | ||
/// Panics when Iterator is empty. | ||
pub fn compute_non_empty_hash<I, T, S>(input: I) -> Hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do you need an empty hash? Or could those potentially-empty cases be amended with once(foo).chain(input) at the caller?
ipa-core/src/helpers/hashing.rs
Outdated
} | ||
|
||
/// Computes Hash of serializable values from an iterator | ||
pub fn compute_hash<I, T, S>(input: I) -> Hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd flip the naming, so that the (maybe) empty hash requires the special name.
Another question is whether empty vs. not is even the right property to be looking at, or if we should be looking at the amount of entropy in the input (i.e. at least 128 bits) instead. On the other hand, I don't know if I necessarily think these properties should be enforced in this routine at all. The callers know more about the input and can check a more precise set of properties. Looking at the places |
This is more important for the sharded shuffle, which for small inputs is reasonably likely to produce an empty output on some shard. It adds a new
compute_non_empty_hash
function with the existing behavior of rejecting empty input, and changes thecompute_hash
function to accept an empty input. Then, it changes all of the existing calls tocompute_hash
, except the one inshuffle::malicious
, to callcompute_non_empty_hash
instead. (I did not analyze the existing uses for others that may be able to tolerate an empty input.)