-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discard deferred msgs #1131
Discard deferred msgs #1131
Conversation
/// complete after we have disconnected from the client, which would make | ||
/// handling the decrypted value incorrect (because it may have been skipped | ||
/// or re-sent). | ||
pub connection_id: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure I understand it, we are really checking to be sure that the connection ID
at the time we took a message off the wire from a downstairs matches the connection ID
when we process the job in on_client_message()
?
Like, we don't need to check for, and don't expect to see, a difference in connection ID
when we put a message on the wire and when we pull it off, right?
Any "in flight" messages won't make it back if we stop the client task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like, we don't need to check for, and don't expect to see, a difference in connection ID
when we put a message on the wire and when we pull it off, right?
Any "in flight" messages won't make it back if we stop the client task.
Correct, that was what #1126 fixes. This adds the same "ignore messages from disconnected downstairs" for messages that were deferred before the downstairs was disconnected, but complete their deferred operation afterwards.
@@ -489,7 +489,7 @@ pub mod repair_test { | |||
info!(up.log, "repair job should have got here, move it forward"); | |||
// The repair (NoOp) job should have shown up. Move it forward. | |||
for cid in ClientId::iter() { | |||
if cid != err_ds { | |||
if cid != err_ds && cid != or_ds { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a little confused on how we were sending these responses for or_ds
and not
throwing off the count of skipped/done below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I finally got it here. The test should probably bump the connected counter when it starts up, but
that's unrelated to the changes here, and it looks like having the counter as "0" does not matter anywhere
as long as the check in upstairs:
pub(crate) fn get_connection_id(&self) -> Option<u64> {
if self.client_task.client_stop_tx.is_some() {
Some(self.stats.connected as u64)
} else {
None
}
}
Find a number, it's happy.
This fixes another occurrence of the panic thought to be addressed in #1126:
That fix was insufficient because messages from an about-to-be-disconnected client can be deferred (for decryption), and therefore still arrive in the main task after the client connection is closed.
The fix is to tag every message with its connection index, which is just
DownstairsStats::connected
for that particular client. Then, when processing deferred messages, skip them if the connection index has changed.This change also revealed a bug in our live-repair tests! After we send an error, both the error-sending and the under-repair Downstairs will be moved to
DsState::Faulted
; therefore, we shouldn't be sending a reply from the latter.