-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failed in CI: test_omdb_success_cases #6505
Comments
Bummer -- and thanks for filing this. From the output, it looks to me like the test ran the command
That's not very much to go on. We don't have more because this was a subprocess -- the test ultimately failed only because the output didn't match what it expected. I haven't totally given up yet but I've put up #6516 so that if we hit this again we'll get more information about a panic from the subprocess. |
The panic message is coming from here: But I think that's just propagating a panic that happened in the middle of just about anything that async-bb8-diesel was doing. There are a few unwraps in in the omicron/dev-tools/omdb/src/bin/omdb/db.rs Lines 2837 to 2920 in a77c31b
But if we panicked in those, I don't think it would show up in async-bb8-diesel. I'm trying to figure out what would show up there. We're not using I'm also going to file an async-bb8-diesel bug because it seems like it could propagate more about the panic error in this situation. |
Actually, I'm not sure this is an async-bb8-diesel bug. Looking more closely at the
One way I could imagine this happening is if the program started an async database operation (like |
I hit this one today on PR #6652: https://buildomat.eng.oxide.computer/wg/0/details/01J8JH1GHTE595MAF5YBTA8BS6/3ybl2B1ZPCoj5D0UfV5BgvmfQpYqlpYGkOVlmy3dpaUvVN5L/01J8JH2AVHD7EAR6AT7ZG8WS72#S5595. Wanted to comment here because it looks like this particular flake may result in very different panic messages depending on which OMDB command actually hit the issue, so folks hitting this flake might report new issues for it that are actually duplicates of this. |
Appears to have bitten #6698, also in an omdb test: https://buildomat.eng.oxide.computer/wg/0/details/01J8RD8168F85NSREWHQ1SDSRG/CYHvVeadfxQl3Yi4jpbKPAmUTiZePQCq9PVtL9VzofHG4CL6/01J8RD8TN8DS3QGWAZHEPVJYEW5 |
Could this be a shutdown task ordering issue? At https://docs.rs/crate/async-bb8-diesel/0.2.1/source/src/async_traits.rs#95, if the spawned task is cancelled due to the runtime shutting down, then there's going to be a panic here. |
Yeah, looking at it I'm pretty sure that it is a shutdown ordering issue. The API is generic over arbitrary E so there's sadly no good place to put in "child task got cancelled, runtime shutting down" at the moment. So we'll probably have to make a breaking change to async-bb8-diesel. |
Hmm, but as Dave pointed out this should only happen if the underlying task was cancelled. And the |
Ah, according to tokio-rs/tokio#3805 (comment) what's happening is that the runtime is shutting down before |
I've put up a tentative PR at oxidecomputer/async-bb8-diesel#77. I don't feel great about it, but I also don't see another way sadly. This is going to end up infecting Omicron as well -- we'll no longer be dealing with Diesel errors, but instead with this new wrapper error type. (And here again we'd need to be careful to not panic on all errors -- instead, if the error is a shutdown error, silently ignoring it somehow.) Ugh. |
I understand why we think async-bb8-diesel is causing this particular backtrace - the diagnosis of "We are trying to spawn new work while the tokio runtime is shutting down" seems accurate - but this seems like it might be a secondary failure, rather than the primary reason for the test failing. Framed another way: why are we trying to spawn new work amid a runtime shutdown? I think that propagating better error information from async-bb8-diesel would be worthwhile, just want to confirm my understanding here that "it's weird |
I'm guessing the reason we are wondering about async-bb8-diesel is that, as I understand it, nothing else in the |
Is there some kind of background task that might be hitting the DB periodically? edit: it's a bit hard to be completely sure, but the stack trace does seem to suggest this is happening within a task. |
The error is coming from here:
|
Yeah, this tracks with the timing when qorb was integrated into Omicron (in dd85331, which landed right before this bug was first reported). On the bright side, this doesn't seem like a bug that would impact prod, but a test shutdown ordering. I'll look at how we're terminating the pool. If we can cleanly terminate the qorb pool when the test exits, that should also help resolve this issue. |
Sorry I'm a little confused about our hypothesized sequence of events leading to this. Is it something like: in the child process:
(I feel like that's not exactly right but I'm just trying to put together the pieces above) |
Yeah, it's worthwhile clarifying, there's a lot of moving pieces. This is my hypothesis:
The qorb "termination" code is pretty half-baked right now -- it just calls My plan is the following:
|
#6881 is my proposed fix, with an attempt to summarize "what I believe is going wrong" in the PR message. |
This test failed on a CI run on pull request 6475:
https://github.com/oxidecomputer/omicron/pull/6475/checks?check_run_id=29546110600
https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H#S5276
Excerpt from the log showing the failure:
The text was updated successfully, but these errors were encountered: