-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failed in CI: oximeter-db::integration_test test_cluster
#6508
Comments
@andrewjstone, mind taking a look at this one? |
Yup. Will dig in. |
After staring at this, I'm somewhat surprised by the failure. Clickhouse server replica 3 (the one client3 is inserting samples to) must be trying to talk to clickhouse keeper 2 after we stopped it. It probably doesn't automatically retry to connect to the others. This very old comment on an issue basically says the same thing. It also could be that keeper 2 was the current leader and when we stopped it, we lost coordination until there was a re-election. I do see a message related to this in the clickhouse server 3 error log:
I'll continue digging. I'm somewhat hesitant to put in blanket retries here. I'd like to make it so that the test instead waits for an appropriate keeper leader and ensures |
I did indeed note that a leader election occurred at this time and just decided to wrap the insert calls in a retry loop:
|
This test failed on a CI run on #6503:
https://github.com/oxidecomputer/omicron/pull/6503/checks?check_run_id=29578652229
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01J6T1X30H12CA5RSZT555AG5G/3vaGlLPSR6mbE4NtzsMac1xBduXXkHgAv2swx5SbzlkUQQE6/01J6T1XWB9Y0RGBXTP9ECXKC5J#S6146
Excerpt from the log showing the failure:
At a glance, this looks like a transient Zookeeper error that probably should be retried? But, perhaps @bnaecker knows more about what's going on here.
The text was updated successfully, but these errors were encountered: