Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate empty vectors of shuffle intermediates #1383

Merged
merged 3 commits into from
Oct 30, 2024

Conversation

andyleiserson
Copy link
Collaborator

This is more important for the sharded shuffle, which for small inputs is reasonably likely to produce an empty output on some shard. It adds a new compute_non_empty_hash function with the existing behavior of rejecting empty input, and changes the compute_hash function to accept an empty input. Then, it changes all of the existing calls to compute_hash, except the one in shuffle::malicious, to call compute_non_empty_hash instead. (I did not analyze the existing uses for others that may be able to tolerate an empty input.)

This is more important for the sharded shuffle, which for small inputs
is reasonably likely to produce an empty output on some shard.
Copy link

codecov bot commented Oct 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.71%. Comparing base (4fd5e2b) to head (e9d8562).
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1383      +/-   ##
==========================================
+ Coverage   93.58%   93.71%   +0.12%     
==========================================
  Files         223      223              
  Lines       37165    37611     +446     
==========================================
+ Hits        34781    35246     +465     
+ Misses       2384     2365      -19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

///
/// ## Panics
/// Panics when Iterator is empty.
pub fn compute_non_empty_hash<I, T, S>(input: I) -> Hash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to return Option<Hash> here and let callers explicitly decide what to do when the input is empty? I feel like treating 0 as a valid hash is dangerous if you accidentally call compute_hash because after that all bets are off

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do you need an empty hash? Or could those potentially-empty cases be amended with once(foo).chain(input) at the caller?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just saw our first case where some shards after shuffle ended up with 0 shares. To me, the fact that hash was never computed, feels like something callers need to be aware of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option<Hash> would mean adding an expect to most calls, and would also mean doing something like unwrap_or(SHA256_HASH_OF_EMPTY_MESSAGE) for the empty-ok case, since we need to send something to the other helper to signify "I have no data".

I'd rather expose an extra function than prepend a dummy value to force a message to be non-empty, which seems like a kludge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just feels unsafe to allow someone to ignore the fact that hash wasn't computed at all. Yes it introduces clutter at the callsite, but it is a good thing for this use case imo. Slows down impatient writer and makes them think what exactly they want to do.

I like Martin's suggestion to flip the naming - it somewhat mitigates the issue, although not entirely imo - someone can still just copy-paste the code w/o thinking. Those things are hard to catch in review

@@ -74,9 +70,37 @@ where
sha.update(&buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that this is a bit churn-y. Is there value in writing multiple items before updating the hash? Or are we OK with our vectorization code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that this can optimize reasonably well. I'm inclined to leave it unless/until it shows up in profiling.

///
/// ## Panics
/// Panics when Iterator is empty.
pub fn compute_non_empty_hash<I, T, S>(input: I) -> Hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do you need an empty hash? Or could those potentially-empty cases be amended with once(foo).chain(input) at the caller?

}

/// Computes Hash of serializable values from an iterator
pub fn compute_hash<I, T, S>(input: I) -> Hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd flip the naming, so that the (maybe) empty hash requires the special name.

@andyleiserson
Copy link
Collaborator Author

Another question is whether empty vs. not is even the right property to be looking at, or if we should be looking at the amount of entropy in the input (i.e. at least 128 bits) instead. On the other hand, I don't know if I necessarily think these properties should be enforced in this routine at all. The callers know more about the input and can check a more precise set of properties. Looking at the places compute_hash gets used, I see three categories: (1) malicious shuffle verification (this PR); (2) input is an array; (3) in validate_three_two_way_sharing_of_zero, which is unused.

@andyleiserson andyleiserson merged commit 6d29275 into private-attribution:main Oct 30, 2024
12 checks passed
@andyleiserson andyleiserson deleted the empty-hash branch October 30, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants