-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI test failure on "main" in concurrent_nexus_instances_only_move_forward
#4093
Comments
At a high-level, the test performs an upgrade through the following:
omicron/nexus/db-queries/src/db/datastore/db_metadata.rs Lines 417 to 427 in 22a0179
omicron/nexus/db-queries/src/db/datastore/db_metadata.rs Lines 430 to 438 in 22a0179
Then, it tries to spawn ten datastores, and verifies that "if you got past initialization, you should not be able to see the old 'widget' table": omicron/nexus/db-queries/src/db/datastore/db_metadata.rs Lines 443 to 458 in 22a0179
The whole point of this test is to make sure that "old upgrades can't happen after newer upgrades succeed" -- so if On the bright side, I don't think that's happening. We wrap each of the upgrade requests in a transaction that should fail if a concurrent upgrade finished, so the transaction failures are actually expected. On the less bright side, I'm seeing a failure here:
Specifically, the error message:
This is funky to me, because we're grabbing a connection from This makes me suspicious that the (expected) transaction failures are leaving around connections with aborted transactions that haven't been finished, which are somehow (???) being inserted back into the connection pool. |
Also weirdly: I am having absolutely no luck repro-ing this locally (on Linux). If anyone is able to get this to reproduce on their machine, please lemme know. Gonna try deploying to illumos and seeing if the failure is more reliable there. |
Update: Each test iteration takes ~6 seconds, haven't seen failures on either machine running exclusively this test for a few minutes. |
I tried updating: omicron/nexus/db-queries/src/db/datastore/db_metadata.rs Lines 397 to 411 in 22a0179
To: SELECT version = '{version}' and target_version = '{target}' and RANDOM() < 0.1\
To try to trigger the transaction to abort more often. No dice. |
Okay, I found that by modifying the following lines: omicron/nexus/db-queries/src/db/pool.rs Lines 76 to 79 in 22a0179
And adding: .queue_strategy(bb8::QueueStrategy::Lifo)
.max_size(5) Seems to cause this failure to trigger locally. I'm starting to question if this failure is specific to this test, or could occur on any test with transaction errors, and this just happens to be a test that pushes that behavior intentionally. EDIT FROM SEAN IN THE FUTURE: This is "sorta true" -- it's true for any |
... I think this may be a case where diesel has been helping us, but where doing things with manual SQL files breaks down. https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#50-71 , for example, appears to manually monitor for any errors returned after a transaction |
Okay, this is truly horrific, but, I tried changing the following: omicron/nexus/db-queries/src/db/datastore/db_metadata.rs Lines 274 to 281 in 22a0179
to async fn apply_schema_update(&self, sql: &String) -> Result<(), Error> {
// Grab a SINGLE connection so we can rollback if we encounter an error...
let conn = self.pool().get().await.unwrap();
if let Err(e) = conn.batch_execute_async(&sql).await.map_err(|e| {
Error::internal_error(&format!("Failed to execute upgrade: {e}"))
}) {
// If we hit an error, assume it happened while we were in a transaction.
//
// THIS ISN'T SAFE, WE JUST HAPPEN TO KNOW IT'S THE CASE FOR THIS TEST.
//
// Explicitly issue a ROLLBACK command to clean the connection.
conn.batch_execute_async("ROLLBACK;").await.unwrap();
return Err(e);
};
Ok(())
} And as a result, I'm seeing this test pass 100+ times in a row. I really wish it was possible for SQL scripts to have their own control flow, and "auto-rollback on error" in a transaction, or have the ability to conditionally rollback (via something like As a result, I guess it's just "not safe to issue a transaction that can ever fail" from within a batch SQL script, because we won't know if we need to explicitly issue a |
- Previously, schema updates encouraged authors on each change to add their own transactions, validating that the "current" and "target" versions are correct. - This unfortunately is not handled particularly well in scripted SQL. I **incorrectly** thought that failing a transaction while `batch_execute`-ing it (e.g., via a `CAST` error) would cause the transaction to fail, and rollback. **This is not true**. In CockroachDB, an error is thrown, but the transaction is not closed. This was the cause of #4093 , where connections stuck in this mangled ongoing transaction state were placed back into the connection pool. - To fix this: Nexus now explicitly wraps each schema change in a transaction using Diesel, which ensures that "on success, they're committed, and on failure, they're rolled back"). - Additionally, this PR upgrades all existing schema changes to conform to this "implied transaction from Nexus" policy, and makes it possible to upgrade using multiple transactions in a single version change. Fixes #4093
So the original implementation of this PR used the following:
So from a Diesel / Diesel-Async point-of-view, there's no knowledge about a transaction occurring. I believe the issue here was that a transaction was started by I believe that if we used the
I agree with the issue raised in async-bb8-diesel#47 -- it needs to be cancel safe -- but I believe that the "mechanism on the |
Sorry, what I meant was not to try to track whether a transaction was created, but rather always issue What I meant about common practice is that I think other connection pools have the notion of "stuff that gets done when a connection gets put back into the pool to ensure that connections all come out of the pool in a consistent state". Here's an example from deadpool's PostgreSQL crate: They provide a few different choices with tradeoffs in terms of cost vs. completeness. |
https://github.com/oxidecomputer/omicron/runs/16803639081
buildomat job 01HAAF8SJY8674XS8J70TK3BQB
@smklein do you want to take a look?
The text was updated successfully, but these errors were encountered: