Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: oximeter-db::integration_test test_cluster #6508

Closed
hawkw opened this issue Sep 3, 2024 · 4 comments · Fixed by #6515
Closed

test failed in CI: oximeter-db::integration_test test_cluster #6508

hawkw opened this issue Sep 3, 2024 · 4 comments · Fixed by #6515
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@hawkw
Copy link
Member

hawkw commented Sep 3, 2024

This test failed on a CI run on #6503:

https://github.com/oxidecomputer/omicron/pull/6503/checks?check_run_id=29578652229

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J6T1X30H12CA5RSZT555AG5G/3vaGlLPSR6mbE4NtzsMac1xBduXXkHgAv2swx5SbzlkUQQE6/01J6T1XWB9Y0RGBXTP9ECXKC5J#S6146

Excerpt from the log showing the failure:

    test_cluster

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 1 filtered out; finished in 12.87s


--- STDERR:              oximeter-db::integration_test test_cluster ---
log file: /var/tmp/omicron_tmp/integration_test-62a6f93586c06e24-test_cluster.205201.0.log
note: configured to log to "/var/tmp/omicron_tmp/integration_test-62a6f93586c06e24-test_cluster.205201.0.log"
thread 'test_cluster' panicked at oximeter/db/tests/integration_test.rs:309:10:
failed to insert samples at server3: Database("Query failed: Code: 244. DB::Exception: Insert failed due to zookeeper error. Please retry. Reason: Connection loss. (UNEXPECTED_ZOOKEEPER_ERROR) (version 23.8.7.24 (official build))\n")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1059:23
   4: {async_block#0}
             at ./tests/integration_test.rs:306:5
   5: poll<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   8: with_budget<core::task::poll::Poll<core::result::Result<(), anyhow::Error>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
   9: budget<core::task::poll::Poll<core::result::Result<(), anyhow::Error>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  12: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  16: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), anyhow::Error>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>>, core::option::Option<core::result::Result<(), anyhow::Error>>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), anyhow::Error>>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  17: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  18: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  21: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  23: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  24: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  25: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=core::result::Result<(), anyhow::Error>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  26: test_cluster
             at ./tests/integration_test.rs:388:5
  27: integration_test::test_cluster::{{closure}}
             at ./tests/integration_test.rs:126:28
  28: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

At a glance, this looks like a transient Zookeeper error that probably should be retried? But, perhaps @bnaecker knows more about what's going on here.

@hawkw hawkw added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Sep 3, 2024
@bnaecker
Copy link
Collaborator

bnaecker commented Sep 3, 2024

@andrewjstone, mind taking a look at this one?

@andrewjstone
Copy link
Contributor

@andrewjstone, mind taking a look at this one?

Yup. Will dig in.

@andrewjstone
Copy link
Contributor

After staring at this, I'm somewhat surprised by the failure. Clickhouse server replica 3 (the one client3 is inserting samples to) must be trying to talk to clickhouse keeper 2 after we stopped it. It probably doesn't automatically retry to connect to the others. This very old comment on an issue basically says the same thing.

It also could be that keeper 2 was the current leader and when we stopped it, we lost coordination until there was a re-election. I do see a message related to this in the clickhouse server 3 error log:

2024.09.02 19:50:12.079302 [ 207410 ] {} <Error> virtual bool DB::DDLWorker::initializeMainThread(): Code: 999. Coordination::Exception: Coordination error: Connection loss, path /clickhouse/task_queue/ddl/. (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):

I'll continue digging.

I'm somewhat hesitant to put in blanket retries here. I'd like to make it so that the test instead waits for an appropriate keeper leader and ensures client3 can use it before it tries to insert samples. However, that may not be feasible given our current interfaces.

andrewjstone added a commit that referenced this issue Sep 3, 2024
In a [test run](#6508) we
saw that oximeter sample inserts failed because a keeper leader election
was occurring during an insert attempt. Wrap all inserts after keeper
cluster changes in a retry to make them resilient.

Fixes #6508
@andrewjstone
Copy link
Contributor

andrewjstone commented Sep 3, 2024

I did indeed note that a leader election occurred at this time and just decided to wrap the insert calls in a retry loop:

33675 |   | 2024.09.02 19:50:12.018300 [ 205318 ] {} <Debug> RaftInstance: system is reconfigured to have 2 servers, last config index: 1, this config index: 643
-- | -- | --
33676 |   | 2024.09.02 19:50:12.018304 [ 205318 ] {} <Information> RaftInstance: srv_to_leave_: 2
33677 |   | 2024.09.02 19:50:12.018308 [ 205318 ] {} <Information> RaftInstance: clearing uncommitted config at log 643, prev 1
33678 |   | 2024.09.02 19:50:12.018320 [ 205318 ] {} <Information> RaftInstance: new configuration: log idx 643, prev log idx 1
33679 |   | peer 1, DC ID 0, ::1:19101, voting member, 1
33680 |   | peer 3, DC ID 0, ::1:19103, voting member, 1
33681 |   | my id: 3, leader: 3, term: 1
33682 |   | 2024.09.02 19:50:12.018326 [ 205318 ] {} <Trace> RaftInstance: (update) new target priority: 1
33683 |   | 2024.09.02 19:50:12.018335 [ 205318 ] {} <Debug> RaftInstance: DONE: commit upto 643, current idx 643

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants