Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement record based Crucible reference counting #6805

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Oct 9, 2024

Crucible volumes are created by layering read-write regions over a hierarchy of read-only resources. Originally only a region snapshot could be used as a read-only resource for a volume. With the introduction of read-only regions (created during the region snapshot replacement process) this is no longer true!

Read-only resources can be used by many volumes, and because of this they need to have a reference count so they can be deleted when they're not referenced anymore. The region_snapshot table uses a volume_references column, which counts how many uses there are. The region table does not have this column, and more over a simple integer works for reference counting but does not tell you what volume that use is from. This can be determined (see omdb's validate volume references command) but it's information that is tossed out, as Nexus knows what volumes use what resources! Instead, record what read-only resources a volume uses in a new table.

As part of the schema change to add the new volume_resource_usage table, a migration is included that will create the appropriate records for all region snapshots.

In testing, a few bugs were found: the worst being that read-only regions did not have their read_only column set to true. This would be a problem if read-only regions are created, but they're currently only created during region snapshot replacement. To detect if any of these regions were created, find all regions that were allocated for a snapshot volume:

SELECT id FROM region
WHERE volume_id IN (SELECT volume_id FROM snapshot);

A similar bug was found in the simulated Crucible agent.

This commit also reverts #6728, enabling region snapshot replacement again - it was disabled due to a lack of read-only region reference counting, so it can be enabled once again.

Crucible volumes are created by layering read-write regions over a
hierarchy of read-only resources. Originally only a region snapshot
could be used as a read-only resource for a volume. With the
introduction of read-only regions (created during the region snapshot
replacement process) this is no longer true!

Read-only resources can be used by many volumes, and because of this
they need to have a reference count so they can be deleted when they're
not referenced anymore. The region_snapshot table uses a
`volume_references` column, which counts how many uses there are. The
region table does not have this column, and more over a simple integer
works for reference counting but does not tell you _what_ volume that
use is from. This can be determined (see omdb's validate volume
references command) but it's information that is tossed out, as Nexus
knows what volumes use what resources! Instead, record what read-only
resources a volume uses in a new table.

As part of the schema change to add the new `volume_resource_usage`
table, a migration is included that will create the appropriate records
for all region snapshots.

In testing, a few bugs were found: the worst being that read-only
regions did not have their read_only column set to true. This would be a
problem if read-only regions are created, but they're currently only
created during region snapshot replacement. To detect if any of these
regions were created, find all regions that were allocated for a
snapshot volume:

    SELECT id FROM region
    WHERE volume_id IN (SELECT volume_id FROM snapshot);

A similar bug was found in the simulated Crucible agent.

This commit also reverts oxidecomputer#6728, enabling region snapshot replacement
again - it was disabled due to a lack of read-only region reference
counting, so it can be enabled once again.
@jmpesp jmpesp requested review from smklein and leftwo October 9, 2024 00:29
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have some more questions after I finish going through nexus/tests/integration_tests/volume_management.rs but I thought I should give you what I have so far.

/// this column, and more over a simple integer works for reference counting but
/// does not tell you _what_ volume that use is from. This can be determined
/// (see omdb's validate volume references command) but it's information that is
/// tossed out, as Nexus knows what volumes use what resources! Instead, record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording of the comment here, I'm having trouble tracking what the "Instead" is referencing.

We are recording what read-only resources a volume uses instead of doing what?

VolumeResourceUsage::ReadOnlyRegion {
region_id: record
.region_id
.expect("valid read-only region usage record"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the .expects below, this message is printed when we panic, right? If so, should it be saying that we did not find a valid region usage record?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservatively, this should maybe be a TryFrom implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expects

conn: &async_bb8_diesel::Connection<DbConnection>,
err: OptionalError<VolumeCreationError>,
volume: Volume,
crucible_targets: CrucibleTargets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the contents of CrucibleTargets? I know it's a String, but what is it really?
Ah, okay, it looks like it's just a Vec of the String of the targets for only the read only parents. It's not the best name for that structure, but I see it existed before this PR so I can't complain too much about it here.

}));
}

// If the resource was hard-deleted, or in the off chance that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this choice expected often? Do we want to log a message if we don't expect to see it, but do?

.optional()
}

async fn read_only_target_to_volume_resource_usage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a 1-1 situation, where there should only ever be one VolumeResourceUsage for a given read only target?

for read_only_target in crucible_targets.read_only_targets {
let sub_err = OptionalError::new();

let maybe_usage = Self::read_only_target_to_volume_resource_usage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This line and a few others here seem to have escaped cargo fmt as they are super long.

@@ -2409,24 +2956,188 @@ impl DataStore {
read_only_parent: None,
};

let volume_data = serde_json::to_string(&vcr)
let volume_data = serde_json::to_string(&vcr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, there is a lot going on inside this transaction :)

)
})
})?
// XXX be smart enough to .filter the above query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a TODO, you can get smart enough!

while let Some(vcr_part) = parts.pop_front() {
match vcr_part {
VolumeConstructionRequest::Volume { sub_volumes, .. } => {
for sub_volume in sub_volumes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this should work for multiple sub-volumes.

.transaction_retry_wrapper("no_volume_resource_usage_records_exist")
.transaction(&conn, |conn| async move {
conn.batch_execute_async(
nexus_test_utils::db::ALLOW_FULL_TABLE_SCAN_SQL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't let dap see this.

'region_snapshot'
);

CREATE TABLE IF NOT EXISTS omicron.public.volume_resource_usage (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it exists elsewhere in this PR, but I think this table would benefit from some text explaining what it is.

Comment on lines +54 to +58
pub region_id: Option<Uuid>,

pub region_snapshot_dataset_id: Option<Uuid>,
pub region_snapshot_region_id: Option<Uuid>,
pub region_snapshot_snapshot_id: Option<Uuid>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea reading the DB schema that these were two groups of columns, and "either one or the other is non-null".

If that's the intent -- and it seems to be, based on VolumeResourceUsage as an enum -- maybe we could add a CHECK on the table validating this?

Something like:

CONSTRAINT exactly_one_usage_source CHECK (
  (
    (usage_type = 'readonlyregion') AND
    (region_id IS NOT NULL) AND
    (region_snapshot_dataset_id IS NULL AND region_snaphshot_region_id IS NULL AND region_snapshot_snapshot_id IS NULL)
  ) OR
  (
    (usage_type = 'regionsnapshot') AND
    (region_id NOT NULL) AND
    (region_snapshot_dataset_id IS NOT NULL AND region_snaphshot_region_id IS NOT NULL AND region_snapshot_snapshot_id IS NOT NULL)
  )
)

VolumeResourceUsage::ReadOnlyRegion {
region_id: record
.region_id
.expect("valid read-only region usage record"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservatively, this should maybe be a TryFrom implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expects

@@ -264,7 +264,7 @@ impl DataStore {
block_size,
blocks_per_extent,
extent_count,
read_only: false,
read_only: maybe_snapshot_id.is_some(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug before this PR?

enum VolumeCreationError {
#[error("Error from Volume creation: {0}")]
Public(Error),
let maybe_volume: Option<Volume> = dsl::volume
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume has a time_deleted column -- do we not care about that here?

Comment on lines +3060 to +3064
// This function may be called with a replacement volume
// that is completely blank, to be filled in later by this
// function. `volume_create` will have been called but will
// not have added any volume resource usage records, because
// it was blank!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this comment, in this location - namely, I'm not sure who is calling volume_create in this situation where we're saying "volume_create will have been called". Is this something we're doing within the body of this function, and I'm not seeing it? Or is this something someone else could concurrently be doing?

// not have added any volume resource usage records, because
// it was blank!
//
// The indention leaving this transaction is that the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The indention leaving this transaction is that the
// The intention leaving this transaction is that the

// We don't support a pure Region VCR at the volume
// level in the database, so this choice should
// never be encountered.
panic!("Region not supported as a top level volume");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a footgun to me, to have a pub fn with an undocumented panic?

OVERLAY(
OVERLAY(
MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT)
PLACING '4' from 13
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? Why '4'? Why from 13?

MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT)
PLACING '4' from 13
)
PLACING TO_HEX(volume_references) from 17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why from 17?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants