CI test failure on "main" in `concurrent_nexus_instances_only_move_forward` #4093

davepacheco · 2023-09-14T21:27:57Z

https://github.com/oxidecomputer/omicron/runs/16803639081
buildomat job 01HAAF8SJY8674XS8J70TK3BQB

4142	2023-09-14T19:07:48.263Z	        FAIL [   4.592s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
4143	2023-09-14T19:07:48.266Z	
4144	2023-09-14T19:07:48.269Z	--- STDOUT:              nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ---
4145	2023-09-14T19:07:48.272Z	
4146	2023-09-14T19:07:48.274Z	running 1 test
4147	2023-09-14T19:07:48.277Z	test db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ... FAILED
4148	2023-09-14T19:07:48.281Z	
4149	2023-09-14T19:07:48.285Z	failures:
4150	2023-09-14T19:07:48.288Z	
4151	2023-09-14T19:07:48.291Z	failures:
4152	2023-09-14T19:07:48.295Z	    db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
4153	2023-09-14T19:07:48.299Z	
4154	2023-09-14T19:07:48.302Z	test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 147 filtered out; finished in 4.46s
4155	2023-09-14T19:07:48.305Z	
4156	2023-09-14T19:07:48.308Z	
4157	2023-09-14T19:07:48.311Z	--- STDERR:              nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ---
4158	2023-09-14T19:07:48.314Z	log file: /var/tmp/omicron_tmp/nexus_db_queries-47279c83bec6079e-concurrent_nexus_instances_only_move_forward.5808.0.log
4159	2023-09-14T19:07:48.318Z	note: configured to log to "/var/tmp/omicron_tmp/nexus_db_queries-47279c83bec6079e-concurrent_nexus_instances_only_move_forward.5808.0.log"
4160	2023-09-14T19:07:48.321Z	thread 'db::datastore::db_metadata::test::concurrent_nexus_instances_only_moveconcurrent_nexus_instances_only_move_forward' panicked at 'Failed to query for table: Connection(Query(DatabaseError(Unknown, "current transaction is aborted, commands ignored until end of transaction block")))', nexus/db-queries/src/db/datastore/db_metadata.rs:461:22

@smklein do you want to take a look?

The text was updated successfully, but these errors were encountered:

smklein · 2023-09-14T22:30:30Z

At a high-level, the test performs an upgrade through the following:

v0: Starting state
v1: Create a table named widget

omicron/nexus/db-queries/src/db/datastore/db_metadata.rs

Lines 417 to 427 in 22a0179

    
                   add_upgrade( 
        
                       v1.clone(), 
        
                       wrap_in_version_checking_txn( 
        
                           &v0, 
        
                           &v1, 
        
                           "SELECT pg_sleep(RANDOM()); \ 
        
                            CREATE TABLE IF NOT EXISTS widget(); \ 
        
                            SELECT pg_sleep(RANDOM());", 
        
                       ), 
        
                   ) 
        
                   .await;

v2: Drop the widget table

omicron/nexus/db-queries/src/db/datastore/db_metadata.rs

Lines 430 to 438 in 22a0179

    
           add_upgrade( 
        
               v2.clone(), 
        
               wrap_in_version_checking_txn( 
        
                   &v1, 
        
                   &v2, 
        
                   "DROP TABLE IF EXISTS widget;", 
        
               ), 
        
           ) 
        
           .await;

Then, it tries to spawn ten datastores, and verifies that "if you got past initialization, you should not be able to see the old 'widget' table":

omicron/nexus/db-queries/src/db/datastore/db_metadata.rs

Lines 443 to 458 in 22a0179

    
           let _ = futures::future::join_all((0..10).map(|_| { 
        
               let log = log.clone(); 
        
               let pool = pool.clone(); 
        
               let config = config.clone(); 
        
               tokio::task::spawn(async move { 
        
                   let datastore = DataStore::new(&log, pool, Some(&config)).await?; 
        
                   // This is the crux of this test: confirm that, as each 
        
                   // migration completes, it's not possible to see any artifacts 
        
                   // of the "v1" migration (namely: the 'Widget' table should not 
        
                   // exist). 
        
                   let result = diesel::select( 
        
                           diesel::dsl::sql::<diesel::sql_types::Bool>( 
        
                               "EXISTS (SELECT * FROM pg_tables WHERE tablename = 'widget')" 
        
                           ) 
        
                       )

The whole point of this test is to make sure that "old upgrades can't happen after newer upgrades succeed" -- so if widget is visible (aka, someone concurrently re-added it), something has gone horribly wrong.

On the bright side, I don't think that's happening. We wrap each of the upgrade requests in a transaction that should fail if a concurrent upgrade finished, so the transaction failures are actually expected.

On the less bright side, I'm seeing a failure here:

omicron/nexus/db-queries/src/db/datastore/db_metadata.rs

Line 461 in 22a0179

.expect("Failed to query for table");

Specifically, the error message:

Failed to query for table:
  Connection(
    Query(
       DatabaseError(Unknown, "current transaction is aborted, commands ignored until end of transaction block")
    )
  )

This is funky to me, because we're grabbing a connection from datastore.pool(), and we should not be issuing a transaction within this operation.

This makes me suspicious that the (expected) transaction failures are leaving around connections with aborted transactions that haven't been finished, which are somehow (???) being inserted back into the connection pool.

smklein · 2023-09-14T22:36:23Z

Also weirdly: I am having absolutely no luck repro-ing this locally (on Linux). If anyone is able to get this to reproduce on their machine, please lemme know.

Gonna try deploying to illumos and seeing if the failure is more reliable there.

smklein · 2023-09-14T22:56:37Z

Update: Each test iteration takes ~6 seconds, haven't seen failures on either machine running exclusively this test for a few minutes.

smklein · 2023-09-14T23:23:41Z

I tried updating:

omicron/nexus/db-queries/src/db/datastore/db_metadata.rs

Lines 397 to 411 in 22a0179

    
                   let wrap_in_version_checking_txn = |version, target, sql| -> String { 
        
                       format!("BEGIN; \ 
        
                           SELECT CAST(\ 
        
                               IF(\ 
        
                                   (\ 
        
                                       SELECT version = '{version}' and target_version = '{target}'\ 
        
                                       FROM omicron.public.db_metadata WHERE singleton = true\ 
        
                                   ),\ 
        
                                   'true',\ 
        
                                   'Invalid starting version for schema change'\ 
        
                               ) AS BOOL\ 
        
                           );\ 
        
                           {sql};\ 
        
                           COMMIT;") 
        
                   };

To:

                            SELECT version = '{version}' and target_version = '{target}' and RANDOM() < 0.1\

To try to trigger the transaction to abort more often. No dice.

smklein · 2023-09-14T23:52:09Z

Okay, I found that by modifying the following lines:

omicron/nexus/db-queries/src/db/pool.rs

Lines 76 to 79 in 22a0179

    
           let pool = builder 
        
               .connection_customizer(Box::new(DisallowFullTableScans {})) 
        
               .error_sink(Box::new(error_sink)) 
        
               .build_unchecked(manager);

And adding:

  .queue_strategy(bb8::QueueStrategy::Lifo)
  .max_size(5)

Seems to cause this failure to trigger locally.

I'm starting to question if this failure is specific to this test, or could occur on any test with transaction errors, and this just happens to be a test that pushes that behavior intentionally. EDIT FROM SEAN IN THE FUTURE: This is "sorta true" -- it's true for any batch_execute-ed transactions, but those are only happening in schema upgrades right now. I think.

smklein · 2023-09-15T01:03:07Z

... I think this may be a case where diesel has been helping us, but where doing things with manual SQL files breaks down.

https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#50-71 , for example, appears to manually monitor for any errors returned after a transaction BEGIN, and explicitly makes the choice to COMMIT or ROLLBACK. In contrast, when we batch execute some SQL that contains a BEGIN statement, I think we keep on executing subsequent commands, even on error?

smklein · 2023-09-15T01:33:09Z

Okay, this is truly horrific, but, I tried changing the following:

omicron/nexus/db-queries/src/db/datastore/db_metadata.rs

Lines 274 to 281 in 22a0179

    
           // Applies a schema update, using raw SQL read from a caller-supplied 
        
           // configuration file. 
        
           async fn apply_schema_update(&self, sql: &String) -> Result<(), Error> { 
        
               self.pool().batch_execute_async(&sql).await.map_err(|e| { 
        
                   Error::internal_error(&format!("Failed to execute upgrade: {e}")) 
        
               })?; 
        
               Ok(()) 
        
           }

to

    async fn apply_schema_update(&self, sql: &String) -> Result<(), Error> {
        // Grab a SINGLE connection so we can rollback if we encounter an error...
        let conn = self.pool().get().await.unwrap();

        if let Err(e) = conn.batch_execute_async(&sql).await.map_err(|e| {
            Error::internal_error(&format!("Failed to execute upgrade: {e}"))
        }) {
            // If we hit an error, assume it happened while we were in a transaction.
            //
            // THIS ISN'T SAFE, WE JUST HAPPEN TO KNOW IT'S THE CASE FOR THIS TEST.
            //
            // Explicitly issue a ROLLBACK command to clean the connection.
            conn.batch_execute_async("ROLLBACK;").await.unwrap();
            return Err(e);
        };
        Ok(())
    }

And as a result, I'm seeing this test pass 100+ times in a row.

I really wish it was possible for SQL scripts to have their own control flow, and "auto-rollback on error" in a transaction, or have the ability to conditionally rollback (via something like IF (invariant, true, ROLLBACK)). However, I don't think either of these is possible.

As a result, I guess it's just "not safe to issue a transaction that can ever fail" from within a batch SQL script, because we won't know if we need to explicitly issue a ROLLBACK to follow-up.

- Previously, schema updates encouraged authors on each change to add their own transactions, validating that the "current" and "target" versions are correct. - This unfortunately is not handled particularly well in scripted SQL. I **incorrectly** thought that failing a transaction while `batch_execute`-ing it (e.g., via a `CAST` error) would cause the transaction to fail, and rollback. **This is not true**. In CockroachDB, an error is thrown, but the transaction is not closed. This was the cause of #4093 , where connections stuck in this mangled ongoing transaction state were placed back into the connection pool. - To fix this: Nexus now explicitly wraps each schema change in a transaction using Diesel, which ensures that "on success, they're committed, and on failure, they're rolled back"). - Additionally, this PR upgrades all existing schema changes to conform to this "implied transaction from Nexus" policy, and makes it possible to upgrade using multiple transactions in a single version change. Fixes #4093

davepacheco · 2023-10-02T18:25:23Z

I wonder if it should be part of the connection pool (or how it's set up) to rollback any in-progress transaction(s). I think that may be common practice. It seems like that'd address both this problem and the sort of thing that happened in #3207 / async-bb8-diesel#47. What do you think @smklein?

smklein · 2023-10-03T16:57:57Z

So the original implementation of this PR used the following:

This method https://docs.rs/diesel/2.1.2/diesel/connection/trait.SimpleConnection.html#tymethod.batch_execute
... on a string that contained BEGIN / COMMIT statements

So from a Diesel / Diesel-Async point-of-view, there's no knowledge about a transaction occurring. I believe the issue here was that a transaction was started by batch_execute, but we had no idea if it successfully committed, or if it required a ROLLBACK statement.

I believe that if we used the .transaction method, this "auto-rollback" behavior is already implemented:

I agree with the issue raised in async-bb8-diesel#47 -- it needs to be cancel safe -- but I believe that the "mechanism on the .transaction method" makes sense. By wrapping the transaction function, we know if a ROLLBACK is necessary, but on a batch_execute function, we have no idea if a transaction is occurring.

davepacheco · 2023-10-03T18:04:12Z

Sorry, what I meant was not to try to track whether a transaction was created, but rather always issue ROLLBACK when a connection is put back into the pool. This will fail if there was no open transaction and we'd ignore the error in this context. This would be a fallback to catch cases like this, rather than relying on the consuming code to have correctly tracked this information.

What I meant about common practice is that I think other connection pools have the notion of "stuff that gets done when a connection gets put back into the pool to ensure that connections all come out of the pool in a consistent state". Here's an example from deadpool's PostgreSQL crate:
https://docs.rs/deadpool-postgres/latest/deadpool_postgres/enum.RecyclingMethod.html

They provide a few different choices with tradeoffs in terms of cost vs. completeness. RecyclingMethod::Fast does basically nothing. RecyclingMethod::Clean runs several SQL statements that close open cursors, reset any connection-local parameters that were set with set, unlock any locks that were somehow left locked, etc. (I'm surprised they don't seem to do a rollback!)

smklein · 2023-10-03T18:49:52Z

I filed #4191 to track, and #4192 as a related issue.

smklein self-assigned this Sep 14, 2023

smklein mentioned this issue Sep 15, 2023

Explicitly manage schema upgrade transactions #4096

Merged

smklein closed this as completed in #4096 Sep 15, 2023

smklein mentioned this issue Oct 3, 2023

Clean connections that are returned to connection pool #4191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI test failure on "main" in `concurrent_nexus_instances_only_move_forward` #4093

CI test failure on "main" in `concurrent_nexus_instances_only_move_forward` #4093

davepacheco commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023 •

edited

Loading

smklein commented Sep 15, 2023

smklein commented Sep 15, 2023

davepacheco commented Oct 2, 2023

smklein commented Oct 3, 2023

davepacheco commented Oct 3, 2023

smklein commented Oct 3, 2023

CI test failure on "main" in concurrent_nexus_instances_only_move_forward #4093

CI test failure on "main" in concurrent_nexus_instances_only_move_forward #4093

Comments

davepacheco commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023

smklein commented Sep 14, 2023 • edited Loading

smklein commented Sep 15, 2023

smklein commented Sep 15, 2023

davepacheco commented Oct 2, 2023

smklein commented Oct 3, 2023

davepacheco commented Oct 3, 2023

smklein commented Oct 3, 2023

CI test failure on "main" in `concurrent_nexus_instances_only_move_forward` #4093

CI test failure on "main" in `concurrent_nexus_instances_only_move_forward` #4093

smklein commented Sep 14, 2023 •

edited

Loading