Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solitary USDT probe in a tokio::select! statement seems to cause other code in one select! arm to execute unexpectedly #86

Open
gjcolombo opened this issue Jan 11, 2023 · 1 comment

Comments

@gjcolombo
Copy link

Propolis recently stumbled across what seems like a code generation issue in which putting a USDT probe (and only a USDT probe) in exactly the right place caused a function to careen down an unexpected path in a tokio::select! statement. We've worked around the issue in Propolis, but we need to investigate further to root-cause it, hence this tracking issue.

(I'm not certain whether this is a problem with USDT as such or an unfortunate interaction between USDT and the Rust build tools--we might investigate and find it's the latter. But USDT probes do seem to be involved somehow so I'm filing this in this repo for now.)

Most of the context, and a link to the actual code in question, is in propolis#292. It might be faster just to read that issue. Here, however, is some long-winded background:


Propolis has a tokio task that reads/writes bytes from/to a VM's virtual COM1 and from/to serial console clients connected over a websocket. The body of the task is mostly a large select! statement that handles input/output from these various sources. In psuedocode it looks like this:

  1 loop {
  2     // get a future that tries to read bytes from the guest
  3     let read_fut = get_read_fut_from_uart();
  4 
  5     // get some other futures/channels ready to go
  6 
  7     tokio::select! {
  8 
  9         // various other cases for new connections, handling
 10         // writes, etc.
 11 
 12         nread = read_fut => {
 13             match nread {
 14                 Some(0) | None => {
 15                     probes::serial_task_read!(|| { 0 });
 16                     // channel was closed or an error occurred,
 17                     // exit the task
 18                     break;
 19                 }
 20                 Some(n) {
 21                     probes::serial_task_read!(|| { n });
 22                     // process bytes that were read
 23                 }
 24             }
 25         }
 26     }
 27 }

The (Propolis equivalent) of the probe on line 15 has very strange behavior. If you build this code release, and this probe (and only this probe) is present in its match arm, then the serial task breaks out of the loop unexpectedly early. Any of the following changes make the loop behave as expected:

  • building the code with the dev profile
  • removing the probe entirely
  • adding a slog::info! logging statement at line 16 (this need not refer to the value of nread or indeed to any other variables; it can just print a message)
  • hoisting the probes on lines 15 and 21 out of the match

This Propolis commit implements this last option as a workaround and shows the actual code in its full glory.


I briefly rummaged through the output of cargo-show-asm with and without the offending probe in place, but quickly decided I was in over my head, since (a) I don't (yet) understand how DTrace probes work or how USDT constructs them, and (b) the only way I could find to get all the assembly and matched line numbers I needed was to generate asm for the entire propolis-server lib, which is massive and hard to rummage through at this level of detail.

I suspect the next step here is to try to minimize this somewhat: can we reproduce this with a simplified task that fakes out the UART and builds in its own small library? If so that should make it much easier to understand what's happening.

@bnaecker
Copy link
Collaborator

bnaecker commented Jan 9, 2024

I was trying to repro this yesterday, and hit a few things I thought I'd add. First, I made a tiny program intended to be a minimal reproducer, but was not able to successfully repro the behavior. Here's the main.rs in its entirety:

#[usdt::provider]
mod probes {
    fn read_bytes(n_bytes: u64) {}
}

#[tokio::main]
async fn main() {
    usdt::register_probes().unwrap();
    let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel();
    let counter = std::sync::Arc::new(std::sync::atomic::AtomicU64::new(10));
    let counter_ = counter.clone();
    let _task = tokio::task::spawn(async move {
        while counter_.load(std::sync::atomic::Ordering::SeqCst) != 0 {
            tokio::time::sleep(std::time::Duration::from_millis(100)).await;
            let n = counter_.fetch_sub(1, std::sync::atomic::Ordering::SeqCst);
            tx.send(n).unwrap();
        }
    });
    loop {
        tokio::select! {
            nread = rx.recv() => {
                match nread {
                    Some(0) | None => {
                        probes::read_bytes!(|| { 0 } );
                        break;
                    }
                    Some(n) => {
                        probes::read_bytes!(|| { n });
                    }
                }
            }
        }
    }
    assert_eq!(counter.load(std::sync::atomic::Ordering::SeqCst), 0, "Broke early");
    println!("Completed normally");
}

Here's what I see running this against rustc 1.66.0:

bnaecker@sandstone : ~/file-cabinet/tmp/usdt-select-fail $ cargo +1.66.0 --version
cargo 1.66.0 (d65d197ad 2022-11-15)
bnaecker@sandstone : ~/file-cabinet/tmp/usdt-select-fail $ cargo +1.66.0 run --release
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/usdt-select-fail`
Completed normally

It's entirely possible that's not an accurate reproducer. I next tried to go back to the original commit in Propolis on which this failure was discovered, but could not actually build it due to a dependency conflict:

bnaecker@shale : ~/propolis $ git rev-list -1 HEAD
5c36674218c7a0a4551063d053462481f29e3ac1
bnaecker@shale : ~/propolis $ cargo +1.66.0 build --release --bin propolis-server
    Updating crates.io index
    Updating git repository `https://github.com/oxidecomputer/crucible`
    Updating git repository `https://github.com/oxidecomputer/progenitor`
    Updating git repository `https://github.com/oxidecomputer/dropshot`
    Updating git repository `https://github.com/oxidecomputer/omicron`
    Updating git repository `https://github.com/oxidecomputer/rfb`
    Updating git repository `https://github.com/oxidecomputer/dlpi-sys`
    Updating git repository `https://github.com/oxidecomputer/ispf`
    Updating git repository `https://github.com/oxidecomputer/p4`
    Updating git repository `https://github.com/oxidecomputer/p9fs`
    Updating git repository `https://github.com/oxidecomputer/softnpu`
error: no matching package named `omicron-common` found
location searched: https://github.com/oxidecomputer/omicron?branch=main
required by package `propolis-server v0.1.0 (/home/bnaecker/propolis/bin/propolis-server)`

I'm not sure where this error is coming from, but it seems there's some git dependency that's not pinned down to a specific revision. omicron-common is indeed part of the omicron repo, but without a lockfile I'm not sure which revision this is supposed to be pointing at. We could try to correlate the histories of the repos, and point to a revision by SHA that is from around the same time, though that seems a bit fraught. I'm not sure where to go from here, but creating a more accurate reproducer seems a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants