Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard deferred msgs #1131

Merged
merged 2 commits into from
Feb 2, 2024
Merged

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Feb 2, 2024

This fixes another occurrence of the panic thought to be addressed in #1126:

Feb 02 00:44:11.050 WARN 63a91608-1f7f-41a4-bd21-76371c72a2d0 WARNING finish job 1083 when downstairs state: Offline, client: 1, : downstairs, session_id: 610eadbf-df34-44b5-8520-b4e88207bf54
thread 'test::integration_test_problematic_downstairs' panicked at upstairs/src/client.rs:1249:13:
[1] Job 1083 completed while not InProgress: New
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: crucible::client::DownstairsClient::process_io_completion
             at /Users/mjk/oxide/crucible/upstairs/src/client.rs:1249:13
   3: crucible::downstairs::Downstairs::process_io_completion_inner
             at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3177:12
   4: crucible::downstairs::Downstairs::process_io_completion
             at /Users/mjk/oxide/crucible/upstairs/src/downstairs.rs:3063:9
   5: crucible::upstairs::Upstairs::on_client_message::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:1616:25
   6: crucible::upstairs::Upstairs::apply::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:540:43
   7: crucible::upstairs::Upstairs::run::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/upstairs.rs:429:32
   8: crucible::up_main::{{closure}}
             at /Users/mjk/oxide/crucible/upstairs/src/lib.rs:2962:58
   9: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9
  10: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/task/core.rs:328:17
  11: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/mjk/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/loom/std/unsafe_cell.rs:16:9

That fix was insufficient because messages from an about-to-be-disconnected client can be deferred (for decryption), and therefore still arrive in the main task after the client connection is closed.

The fix is to tag every message with its connection index, which is just DownstairsStats::connected for that particular client. Then, when processing deferred messages, skip them if the connection index has changed.

This change also revealed a bug in our live-repair tests! After we send an error, both the error-sending and the under-repair Downstairs will be moved to DsState::Faulted; therefore, we shouldn't be sending a reply from the latter.

@mkeeter mkeeter requested a review from leftwo February 2, 2024 01:38
@leftwo
Copy link
Contributor

leftwo commented Feb 2, 2024

I wonder if this might fix:
#1108
and
#1121

1108 I never gathered more data on, so it's rather short on details.

/// complete after we have disconnected from the client, which would make
/// handling the decrypted value incorrect (because it may have been skipped
/// or re-sent).
pub connection_id: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure I understand it, we are really checking to be sure that the connection ID
at the time we took a message off the wire from a downstairs matches the connection ID
when we process the job in on_client_message()?

Like, we don't need to check for, and don't expect to see, a difference in connection ID
when we put a message on the wire and when we pull it off, right?
Any "in flight" messages won't make it back if we stop the client task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, we don't need to check for, and don't expect to see, a difference in connection ID
when we put a message on the wire and when we pull it off, right?
Any "in flight" messages won't make it back if we stop the client task.

Correct, that was what #1126 fixes. This adds the same "ignore messages from disconnected downstairs" for messages that were deferred before the downstairs was disconnected, but complete their deferred operation afterwards.

@@ -489,7 +489,7 @@ pub mod repair_test {
info!(up.log, "repair job should have got here, move it forward");
// The repair (NoOp) job should have shown up. Move it forward.
for cid in ClientId::iter() {
if cid != err_ds {
if cid != err_ds && cid != or_ds {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused on how we were sending these responses for or_ds and not
throwing off the count of skipped/done below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I finally got it here. The test should probably bump the connected counter when it starts up, but
that's unrelated to the changes here, and it looks like having the counter as "0" does not matter anywhere
as long as the check in upstairs:

    pub(crate) fn get_connection_id(&self) -> Option<u64> {                       
        if self.client_task.client_stop_tx.is_some() {                            
            Some(self.stats.connected as u64)                                     
        } else {                                                                  
            None                                                                  
        }                                                                         
    }      

Find a number, it's happy.

@leftwo leftwo self-requested a review February 2, 2024 17:36
@mkeeter mkeeter merged commit fe857b7 into oxidecomputer:main Feb 2, 2024
18 checks passed
@mkeeter mkeeter deleted the discard-deferred-msgs branch February 2, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants