Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI test failure on "main" in concurrent_nexus_instances_only_move_forward #4093

Closed
davepacheco opened this issue Sep 14, 2023 · 11 comments · Fixed by #4096
Closed

CI test failure on "main" in concurrent_nexus_instances_only_move_forward #4093

davepacheco opened this issue Sep 14, 2023 · 11 comments · Fixed by #4096
Assignees

Comments

@davepacheco
Copy link
Collaborator

https://github.com/oxidecomputer/omicron/runs/16803639081
buildomat job 01HAAF8SJY8674XS8J70TK3BQB

4142	2023-09-14T19:07:48.263Z	        FAIL [   4.592s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
4143	2023-09-14T19:07:48.266Z	
4144	2023-09-14T19:07:48.269Z	--- STDOUT:              nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ---
4145	2023-09-14T19:07:48.272Z	
4146	2023-09-14T19:07:48.274Z	running 1 test
4147	2023-09-14T19:07:48.277Z	test db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ... FAILED
4148	2023-09-14T19:07:48.281Z	
4149	2023-09-14T19:07:48.285Z	failures:
4150	2023-09-14T19:07:48.288Z	
4151	2023-09-14T19:07:48.291Z	failures:
4152	2023-09-14T19:07:48.295Z	    db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
4153	2023-09-14T19:07:48.299Z	
4154	2023-09-14T19:07:48.302Z	test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 147 filtered out; finished in 4.46s
4155	2023-09-14T19:07:48.305Z	
4156	2023-09-14T19:07:48.308Z	
4157	2023-09-14T19:07:48.311Z	--- STDERR:              nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ---
4158	2023-09-14T19:07:48.314Z	log file: /var/tmp/omicron_tmp/nexus_db_queries-47279c83bec6079e-concurrent_nexus_instances_only_move_forward.5808.0.log
4159	2023-09-14T19:07:48.318Z	note: configured to log to "/var/tmp/omicron_tmp/nexus_db_queries-47279c83bec6079e-concurrent_nexus_instances_only_move_forward.5808.0.log"
4160	2023-09-14T19:07:48.321Z	thread 'db::datastore::db_metadata::test::concurrent_nexus_instances_only_moveconcurrent_nexus_instances_only_move_forward' panicked at 'Failed to query for table: Connection(Query(DatabaseError(Unknown, "current transaction is aborted, commands ignored until end of transaction block")))', nexus/db-queries/src/db/datastore/db_metadata.rs:461:22

@smklein do you want to take a look?

@smklein smklein self-assigned this Sep 14, 2023
@smklein
Copy link
Collaborator

smklein commented Sep 14, 2023

At a high-level, the test performs an upgrade through the following:

  • v0: Starting state
  • v1: Create a table named widget

add_upgrade(
v1.clone(),
wrap_in_version_checking_txn(
&v0,
&v1,
"SELECT pg_sleep(RANDOM()); \
CREATE TABLE IF NOT EXISTS widget(); \
SELECT pg_sleep(RANDOM());",
),
)
.await;

  • v2: Drop the widget table

add_upgrade(
v2.clone(),
wrap_in_version_checking_txn(
&v1,
&v2,
"DROP TABLE IF EXISTS widget;",
),
)
.await;

Then, it tries to spawn ten datastores, and verifies that "if you got past initialization, you should not be able to see the old 'widget' table":

let _ = futures::future::join_all((0..10).map(|_| {
let log = log.clone();
let pool = pool.clone();
let config = config.clone();
tokio::task::spawn(async move {
let datastore = DataStore::new(&log, pool, Some(&config)).await?;
// This is the crux of this test: confirm that, as each
// migration completes, it's not possible to see any artifacts
// of the "v1" migration (namely: the 'Widget' table should not
// exist).
let result = diesel::select(
diesel::dsl::sql::<diesel::sql_types::Bool>(
"EXISTS (SELECT * FROM pg_tables WHERE tablename = 'widget')"
)
)

The whole point of this test is to make sure that "old upgrades can't happen after newer upgrades succeed" -- so if widget is visible (aka, someone concurrently re-added it), something has gone horribly wrong.

On the bright side, I don't think that's happening. We wrap each of the upgrade requests in a transaction that should fail if a concurrent upgrade finished, so the transaction failures are actually expected.

On the less bright side, I'm seeing a failure here:

.expect("Failed to query for table");

Specifically, the error message:

Failed to query for table:
  Connection(
    Query(
       DatabaseError(Unknown, "current transaction is aborted, commands ignored until end of transaction block")
    )
  )

This is funky to me, because we're grabbing a connection from datastore.pool(), and we should not be issuing a transaction within this operation.

This makes me suspicious that the (expected) transaction failures are leaving around connections with aborted transactions that haven't been finished, which are somehow (???) being inserted back into the connection pool.

@smklein
Copy link
Collaborator

smklein commented Sep 14, 2023

Also weirdly: I am having absolutely no luck repro-ing this locally (on Linux). If anyone is able to get this to reproduce on their machine, please lemme know.

Gonna try deploying to illumos and seeing if the failure is more reliable there.

@smklein
Copy link
Collaborator

smklein commented Sep 14, 2023

Update: Each test iteration takes ~6 seconds, haven't seen failures on either machine running exclusively this test for a few minutes.

@smklein
Copy link
Collaborator

smklein commented Sep 14, 2023

I tried updating:

let wrap_in_version_checking_txn = |version, target, sql| -> String {
format!("BEGIN; \
SELECT CAST(\
IF(\
(\
SELECT version = '{version}' and target_version = '{target}'\
FROM omicron.public.db_metadata WHERE singleton = true\
),\
'true',\
'Invalid starting version for schema change'\
) AS BOOL\
);\
{sql};\
COMMIT;")
};

To:

                            SELECT version = '{version}' and target_version = '{target}' and RANDOM() < 0.1\

To try to trigger the transaction to abort more often. No dice.

@smklein
Copy link
Collaborator

smklein commented Sep 14, 2023

Okay, I found that by modifying the following lines:

let pool = builder
.connection_customizer(Box::new(DisallowFullTableScans {}))
.error_sink(Box::new(error_sink))
.build_unchecked(manager);

And adding:

  .queue_strategy(bb8::QueueStrategy::Lifo)
  .max_size(5)

Seems to cause this failure to trigger locally.

I'm starting to question if this failure is specific to this test, or could occur on any test with transaction errors, and this just happens to be a test that pushes that behavior intentionally. EDIT FROM SEAN IN THE FUTURE: This is "sorta true" -- it's true for any batch_execute-ed transactions, but those are only happening in schema upgrades right now. I think.

@smklein
Copy link
Collaborator

smklein commented Sep 15, 2023

... I think this may be a case where diesel has been helping us, but where doing things with manual SQL files breaks down.

https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#50-71 , for example, appears to manually monitor for any errors returned after a transaction BEGIN, and explicitly makes the choice to COMMIT or ROLLBACK. In contrast, when we batch execute some SQL that contains a BEGIN statement, I think we keep on executing subsequent commands, even on error?

@smklein
Copy link
Collaborator

smklein commented Sep 15, 2023

Okay, this is truly horrific, but, I tried changing the following:

// Applies a schema update, using raw SQL read from a caller-supplied
// configuration file.
async fn apply_schema_update(&self, sql: &String) -> Result<(), Error> {
self.pool().batch_execute_async(&sql).await.map_err(|e| {
Error::internal_error(&format!("Failed to execute upgrade: {e}"))
})?;
Ok(())
}

to

    async fn apply_schema_update(&self, sql: &String) -> Result<(), Error> {
        // Grab a SINGLE connection so we can rollback if we encounter an error...
        let conn = self.pool().get().await.unwrap();

        if let Err(e) = conn.batch_execute_async(&sql).await.map_err(|e| {
            Error::internal_error(&format!("Failed to execute upgrade: {e}"))
        }) {
            // If we hit an error, assume it happened while we were in a transaction.
            //
            // THIS ISN'T SAFE, WE JUST HAPPEN TO KNOW IT'S THE CASE FOR THIS TEST.
            //
            // Explicitly issue a ROLLBACK command to clean the connection.
            conn.batch_execute_async("ROLLBACK;").await.unwrap();
            return Err(e);
        };
        Ok(())
    }

And as a result, I'm seeing this test pass 100+ times in a row.

I really wish it was possible for SQL scripts to have their own control flow, and "auto-rollback on error" in a transaction, or have the ability to conditionally rollback (via something like IF (invariant, true, ROLLBACK)). However, I don't think either of these is possible.

As a result, I guess it's just "not safe to issue a transaction that can ever fail" from within a batch SQL script, because we won't know if we need to explicitly issue a ROLLBACK to follow-up.

smklein added a commit that referenced this issue Sep 15, 2023
- Previously, schema updates encouraged authors on each change to add
their own transactions, validating that the "current" and "target"
versions are correct.
- This unfortunately is not handled particularly well in scripted SQL. I
**incorrectly** thought that failing a transaction while
`batch_execute`-ing it (e.g., via a `CAST` error) would cause the
transaction to fail, and rollback. **This is not true**. In CockroachDB,
an error is thrown, but the transaction is not closed. This was the
cause of #4093 , where connections stuck in this mangled ongoing
transaction state were placed back into the connection pool.
- To fix this: Nexus now explicitly wraps each schema change in a
transaction using Diesel, which ensures that "on success, they're
committed, and on failure, they're rolled back").
- Additionally, this PR upgrades all existing schema changes to conform
to this "implied transaction from Nexus" policy, and makes it possible
to upgrade using multiple transactions in a single version change.

Fixes #4093
@davepacheco
Copy link
Collaborator Author

I wonder if it should be part of the connection pool (or how it's set up) to rollback any in-progress transaction(s). I think that may be common practice. It seems like that'd address both this problem and the sort of thing that happened in #3207 / async-bb8-diesel#47. What do you think @smklein?

@smklein
Copy link
Collaborator

smklein commented Oct 3, 2023

So the original implementation of this PR used the following:

So from a Diesel / Diesel-Async point-of-view, there's no knowledge about a transaction occurring. I believe the issue here was that a transaction was started by batch_execute, but we had no idea if it successfully committed, or if it required a ROLLBACK statement.

I believe that if we used the .transaction method, this "auto-rollback" behavior is already implemented:

I agree with the issue raised in async-bb8-diesel#47 -- it needs to be cancel safe -- but I believe that the "mechanism on the .transaction method" makes sense. By wrapping the transaction function, we know if a ROLLBACK is necessary, but on a batch_execute function, we have no idea if a transaction is occurring.

@davepacheco
Copy link
Collaborator Author

Sorry, what I meant was not to try to track whether a transaction was created, but rather always issue ROLLBACK when a connection is put back into the pool. This will fail if there was no open transaction and we'd ignore the error in this context. This would be a fallback to catch cases like this, rather than relying on the consuming code to have correctly tracked this information.

What I meant about common practice is that I think other connection pools have the notion of "stuff that gets done when a connection gets put back into the pool to ensure that connections all come out of the pool in a consistent state". Here's an example from deadpool's PostgreSQL crate:
https://docs.rs/deadpool-postgres/latest/deadpool_postgres/enum.RecyclingMethod.html

They provide a few different choices with tradeoffs in terms of cost vs. completeness. RecyclingMethod::Fast does basically nothing. RecyclingMethod::Clean runs several SQL statements that close open cursors, reset any connection-local parameters that were set with set, unlock any locks that were somehow left locked, etc. (I'm surprised they don't seem to do a rollback!)

@smklein
Copy link
Collaborator

smklein commented Oct 3, 2023

I filed #4191 to track, and #4192 as a related issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants