Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failure: "Failed to find a type oid for ..." #1219

Open
davepacheco opened this issue Jun 16, 2022 · 10 comments
Open

test failure: "Failed to find a type oid for ..." #1219

davepacheco opened this issue Jun 16, 2022 · 10 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@davepacheco
Copy link
Collaborator

While trying to repro #1130, I ran into a failure I hadn't seen before and can't find a ticket for:

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.13782/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.17780.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.13782/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.17780.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Failed to notify Nexus about new Dataset: Error Response', /home/dap/omicron/sled-agent/src/sim/storage.rs:240:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)


failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 5.93s

Log file:

$ bunyan "/dangerzone/omicron_tmp/try_repro.13782/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.17780.0.log"
[2022-06-15T22:22:04.900085563Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA
[2022-06-15T22:22:04.900451226Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-308521ed0d0eed98/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA/data)
[2022-06-15T22:22:04.907769128Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA/listen-url
[2022-06-15T22:22:06.627376916Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach pid: 17785
[2022-06-15T22:22:06.628304134Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach listen URL: postgresql://[email protected]:64766/omicron?sslmode=disable
...
[2022-06-15T22:22:10.628406846Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/17780 on ivanova: Created Simulated Crucible Server (kind=storage, server=b6d65341-167c-41df-9b5c-41cded99c229, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, address=[::1]:61800)
[2022-06-15T22:22:10.628731621Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/17780 on ivanova: client request (sled_id=b6d65341-167c-41df-9b5c-41cded99c229, body=Some(Body), method=PUT)
    uri: http://127.0.0.1:36197/zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c
[2022-06-15T22:22:10.629066446Z] TRACE: 5eb2887c-0c0e-4a74-87fc-60a737cec391/dropshot_internal/17780 on ivanova: incoming request (req_id=65465297-194a-47b9-97a7-a49519f3fb06, method=PUT, remote_addr=127.0.0.1:52635, local_addr=127.0.0.1:36197)
    uri: /zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c
[2022-06-15T22:22:10.629428382Z]  INFO: 5eb2887c-0c0e-4a74-87fc-60a737cec391/ServerContext/17780 on ivanova: upserting dataset (address=[::1]:61800, dataset_id=7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c, zpool_id=2cfecc89-e67a-4b10-b50c-ec9c130187c2)
[2022-06-15T22:22:10.792935278Z]  INFO: 5eb2887c-0c0e-4a74-87fc-60a737cec391/dropshot_internal/17780 on ivanova: request completed (req_id=65465297-194a-47b9-97a7-a49519f3fb06, method=PUT, remote_addr=127.0.0.1:52635, local_addr=127.0.0.1:36197, error_message_external="Internal Server Error", response_code=500)
    uri: /zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c
    --
    error_message_internal: Unknown diesel error creating Dataset called "7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c": Failed to find a type oid for `dataset_kind`
[2022-06-15T22:22:10.793844962Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/17780 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(36197), path: "/zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c", query: None, fragment: None }, status: 500, headers: {"content-type": "application/json", "x-request-id": "65465297-194a-47b9-97a7-a49519f3fb06", "content-length": "124", "date": "Wed, 15 Jun 2022 22:22:10 GMT"} })
@davepacheco
Copy link
Collaborator Author

The key error seems to be:

Unknown diesel error creating Dataset called "7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c": Failed to find a type oid for `dataset_kind`

@smklein
Copy link
Collaborator

smklein commented Jun 16, 2022

This looks like a failure while setting up the sled agent server.

The string Failed to notify Nexus about new Dataset is unique, and comes from sim::Storage::insert_dataset:

/// Adds a Dataset to the sled's simulated storage and notifies Nexus.
pub async fn insert_dataset(&mut self, zpool_id: Uuid, dataset_id: Uuid) {
// Update our local data
let dataset = self
.zpools
.get_mut(&zpool_id)
.expect("Zpool does not exist")
.insert_dataset(&self.log, dataset_id, self.crucible_ip);
// Notify Nexus
let request = DatasetPutRequest {
address: dataset.address().to_string(),
kind: DatasetKind::Crucible,
};
self.nexus_client
.dataset_put(&zpool_id, &dataset_id, &request)
.await
.expect("Failed to notify Nexus about new Dataset");
}

This function is public, but it has only one callsite:

/// Adds a Crucible Dataset within a zpool.
pub async fn create_crucible_dataset(
&self,
zpool_id: Uuid,
dataset_id: Uuid,
) {
self.storage.lock().await.insert_dataset(zpool_id, dataset_id).await;
}

Which is almost certainly being invoked by:

impl DiskTest {
// Creates fake physical storage, an organization, and a project.
pub async fn new(cptestctx: &ControlPlaneTestContext) -> Self {
let sled_agent = cptestctx.sled_agent.sled_agent.clone();
// Create a Zpool.
let zpool_id = Uuid::new_v4();
let zpool_size = ByteCount::from_gibibytes_u32(10);
sled_agent.create_zpool(zpool_id, zpool_size.to_bytes()).await;
// Create multiple Datasets within that Zpool.
let dataset_count = 3;
let dataset_ids: Vec<_> =
(0..dataset_count).map(|_| Uuid::new_v4()).collect();
for id in &dataset_ids {
sled_agent.create_crucible_dataset(zpool_id, *id).await;
// By default, regions are created immediately.
let crucible = sled_agent.get_crucible_dataset(zpool_id, *id).await;
crucible
.set_create_callback(Box::new(|_| RegionState::Created))
.await;
}
Self { sled_agent, zpool_id, zpool_size, dataset_ids }
}
}

The caller side seems "correct" here - this code path is doing what it's supposed to do, AFAICT.

However, that error message implies to me that CRDB is either:

  • Blatantly wrong (corrupted, somehow), or
  • Hasn't finished initializing

@smklein
Copy link
Collaborator

smklein commented Jun 16, 2022

Here's the diesel error:

https://github.com/diesel-rs/diesel/blob/df7ca0ae54a20e4c0d59c13e00d1c07667d61723/diesel/src/pg/backend.rs#L27-L64

dataset_kind is indeed a custom type, initialized here:

CREATE TYPE omicron.public.dataset_kind AS ENUM (
'crucible',
'cockroach',
'clickhouse'
);

And managed in Rust as this type:

impl_enum_type!(
#[derive(SqlType, Debug, QueryId)]
#[diesel(postgres_type(name = "dataset_kind"))]
pub struct DatasetKindEnum;
#[derive(Clone, Debug, AsExpression, FromSqlRow, Serialize, Deserialize, PartialEq)]
#[diesel(sql_type = DatasetKindEnum)]
pub enum DatasetKind;
// Enum values
Crucible => b"crucible"
Cockroach => b"cockroach"
Clickhouse => b"clickhouse"
);

@smklein
Copy link
Collaborator

smklein commented Jun 16, 2022

Due to the "copy-from-seed-db" behavior, which should be initialized with the dataset_kind type already created, I'm not really able to hypothesize a scenario where the request to nexus races with initialization of CRDB (and thus, risks using dataset_kind before CREATE TYPE was invoked).

@davepacheco
Copy link
Collaborator Author

I ran into something similar again:

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Nexus never loaded users: Internal Error: unexpected database error: Failed to find a type oid for `identity_type`', /home/dap/omicron/nexus/test-utils/src/lib.rs:113:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.6246/.tmplxWdo6


failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 3.25s
$ bunyan "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
[2022-06-19T09:26:25.97193573Z]  INFO: test_disk_create_disk_that_already_exists_fails/7698 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.6246/.tmplxWdo6
...
[2022-06-19T09:26:28.857351398Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: attempting to create built-in role assignments
[2022-06-19T09:26:28.858183736Z] TRACE: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: authorize begin (resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-06-19T09:26:28.858950249Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: roles (roles="RoleSet { roles: {} }")
[2022-06-19T09:26:28.865577011Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: authorize result (result=Ok(()), resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-06-19T09:26:29.190035668Z]  WARN: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: failed to list sagas (will retry after 238.566087ms): Internal Error: Unknown diesel error accessing SagaDbg ById(2bc180c4-f5bf-489e-a75c-02be4911b6d0): Failed to find a type oid for `saga_state`
[2022-06-19T09:26:29.190623132Z] ERROR: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: gave up trying to populate built-in PopulateBuiltinRoleAssignments
    error_message: InternalError { internal_message: "unexpected database error: Failed to find a type oid for `identity_type`" }

@smklein
Copy link
Collaborator

smklein commented Jun 21, 2022

interesting data point, that the "not found" oid changed (dataset_kind vs saga_state)

@smklein
Copy link
Collaborator

smklein commented Jun 21, 2022

Both are created ENUM types, however

@smklein
Copy link
Collaborator

smklein commented Jun 21, 2022

And identity_type too, which is also an ENUM

@smklein smklein changed the title test failure: "Failed to notify Nexus about new Dataset: Error Response" test failure: "Failed to find a type oid for ..." Jun 21, 2022
@davepacheco
Copy link
Collaborator Author

Yet another one:

ATTEMPT 693

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'failed to make "create" request: expected status code 201 Created, found 500 Internal Server Error', /home/dap/omicron/nexus/test-utils/src/resource_helpers.rs:59:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.7754/.tmpSFxLri
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)


failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 8.52s

$ bunyan "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
[2022-06-21T18:48:22.968901968Z]  INFO: test_disk_create_disk_that_already_exists_fails/27364 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.7754/.tmpSFxLri
...
_external/27364 on ivanova: request completed (req_id=2ff0999f-343a-4b6d-a1c2-f1b3f9d3e6cf, uri=/organizations/test-org/projects, method=POST, remote_addr=127.0.0.1:55876, local_addr=127.0.0.1:63954, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Unknown diesel error creating VpcRouter called "system": Failed to find a type oid for `vpc_router_kind`
[2022-06-21T18:48:31.405038669Z]  INFO: test_disk_create_disk_that_already_exists_fails/external client test context/27364 on ivanova: client received response (status=500)

It seems to be more common lately. That's surprising because I haven't updated my omicron commit. I'm on 095dfb6, which is nearly 236f2ac (the tip of "main" on May 27). I have recently changed to a custom CockroachDB build, but that was only Friday (after this bug was filed) and all that's changed is using malloc(3c) for the memory allocator.

@davepacheco davepacheco added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Jul 15, 2022
@davepacheco
Copy link
Collaborator Author

It's conceivable that this is caused by the same underlying illumos issue described in #1146. I'm not sure how we can know this unless we (1) see this again using a cockroach binary without that fix and grab a core file from CockroachDB, or (2) see this again at all once the bug is fixed.

leftwo added a commit that referenced this issue Mar 29, 2024
Propolis changes:
Add `IntrPin::import_state` and migrate LPC UART pin states (#669)
Attempt to set WCE for raw file backends
Fix clippy/lint nits for rust 1.77.0

Crucible changes:
Correctly (and robustly) count bytes (#1237)
test-replay.sh fix name of DTrace script (#1235)
BlockReq -> BlockOp (#1234)
Simplify `BlockReq` (#1218)
DTrace, cmon, cleanup, retry downstairs connections at 10 seconds.
(#1231)
Remove `MAX_ACTIVE_COUNT` flow control system (#1217)

Crucible changes that were in Omicron but not in Propolis before this commit.
Return *410 Gone* if volume is inactive (#1232)
Update Rust crate opentelemetry to 0.22.0 (#1224)
Update Rust crate base64 to 0.22.0 (#1222)
Update Rust crate async-recursion to 1.1.0 (#1221)
Minor cleanups to extent implementations (#1230)
Update Rust crate http to 0.2.12 (#1220)
Update Rust crate reedline to 0.30.0 (#1227)
Update Rust crate rayon to 1.9.0 (#1226)
Update Rust crate nix to 0.28 (#1223)
Update Rust crate async-trait to 0.1.78 (#1219)
Various buffer optimizations (#1211)
Add low-level test for message encoding (#1214)
Don't let df failures ruin the buildomat tests (#1213)
Activate the NBD server's psuedo file (#1209)

---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

2 participants