-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement record based Crucible reference counting #6805
base: main
Are you sure you want to change the base?
Implement record based Crucible reference counting #6805
Conversation
Crucible volumes are created by layering read-write regions over a hierarchy of read-only resources. Originally only a region snapshot could be used as a read-only resource for a volume. With the introduction of read-only regions (created during the region snapshot replacement process) this is no longer true! Read-only resources can be used by many volumes, and because of this they need to have a reference count so they can be deleted when they're not referenced anymore. The region_snapshot table uses a `volume_references` column, which counts how many uses there are. The region table does not have this column, and more over a simple integer works for reference counting but does not tell you _what_ volume that use is from. This can be determined (see omdb's validate volume references command) but it's information that is tossed out, as Nexus knows what volumes use what resources! Instead, record what read-only resources a volume uses in a new table. As part of the schema change to add the new `volume_resource_usage` table, a migration is included that will create the appropriate records for all region snapshots. In testing, a few bugs were found: the worst being that read-only regions did not have their read_only column set to true. This would be a problem if read-only regions are created, but they're currently only created during region snapshot replacement. To detect if any of these regions were created, find all regions that were allocated for a snapshot volume: SELECT id FROM region WHERE volume_id IN (SELECT volume_id FROM snapshot); A similar bug was found in the simulated Crucible agent. This commit also reverts oxidecomputer#6728, enabling region snapshot replacement again - it was disabled due to a lack of read-only region reference counting, so it can be enabled once again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have some more questions after I finish going through nexus/tests/integration_tests/volume_management.rs
but I thought I should give you what I have so far.
/// this column, and more over a simple integer works for reference counting but | ||
/// does not tell you _what_ volume that use is from. This can be determined | ||
/// (see omdb's validate volume references command) but it's information that is | ||
/// tossed out, as Nexus knows what volumes use what resources! Instead, record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording of the comment here, I'm having trouble tracking what the "Instead" is referencing.
We are recording what read-only resources a volume uses instead of doing what?
VolumeResourceUsage::ReadOnlyRegion { | ||
region_id: record | ||
.region_id | ||
.expect("valid read-only region usage record"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and the .expect
s below, this message is printed when we panic, right? If so, should it be saying that we did not find a valid region usage record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conservatively, this should maybe be a TryFrom
implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expect
s
conn: &async_bb8_diesel::Connection<DbConnection>, | ||
err: OptionalError<VolumeCreationError>, | ||
volume: Volume, | ||
crucible_targets: CrucibleTargets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the contents of CrucibleTargets
? I know it's a String
, but what is it really?
Ah, okay, it looks like it's just a Vec of the String of the targets for only the read only parents. It's not the best name for that structure, but I see it existed before this PR so I can't complain too much about it here.
})); | ||
} | ||
|
||
// If the resource was hard-deleted, or in the off chance that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this choice expected often? Do we want to log a message if we don't expect to see it, but do?
.optional() | ||
} | ||
|
||
async fn read_only_target_to_volume_resource_usage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a 1-1 situation, where there should only ever be one VolumeResourceUsage
for a given read only target?
for read_only_target in crucible_targets.read_only_targets { | ||
let sub_err = OptionalError::new(); | ||
|
||
let maybe_usage = Self::read_only_target_to_volume_resource_usage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This line and a few others here seem to have escaped cargo fmt
as they are super long.
@@ -2409,24 +2956,188 @@ impl DataStore { | |||
read_only_parent: None, | |||
}; | |||
|
|||
let volume_data = serde_json::to_string(&vcr) | |||
let volume_data = serde_json::to_string(&vcr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man, there is a lot going on inside this transaction :)
) | ||
}) | ||
})? | ||
// XXX be smart enough to .filter the above query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a TODO, you can get smart enough!
while let Some(vcr_part) = parts.pop_front() { | ||
match vcr_part { | ||
VolumeConstructionRequest::Volume { sub_volumes, .. } => { | ||
for sub_volume in sub_volumes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this should work for multiple sub-volumes.
.transaction_retry_wrapper("no_volume_resource_usage_records_exist") | ||
.transaction(&conn, |conn| async move { | ||
conn.batch_execute_async( | ||
nexus_test_utils::db::ALLOW_FULL_TABLE_SCAN_SQL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't let dap see this.
'region_snapshot' | ||
); | ||
|
||
CREATE TABLE IF NOT EXISTS omicron.public.volume_resource_usage ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it exists elsewhere in this PR, but I think this table would benefit from some text explaining what it is.
pub region_id: Option<Uuid>, | ||
|
||
pub region_snapshot_dataset_id: Option<Uuid>, | ||
pub region_snapshot_region_id: Option<Uuid>, | ||
pub region_snapshot_snapshot_id: Option<Uuid>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea reading the DB schema that these were two groups of columns, and "either one or the other is non-null".
If that's the intent -- and it seems to be, based on VolumeResourceUsage
as an enum -- maybe we could add a CHECK
on the table validating this?
Something like:
CONSTRAINT exactly_one_usage_source CHECK (
(
(usage_type = 'readonlyregion') AND
(region_id IS NOT NULL) AND
(region_snapshot_dataset_id IS NULL AND region_snaphshot_region_id IS NULL AND region_snapshot_snapshot_id IS NULL)
) OR
(
(usage_type = 'regionsnapshot') AND
(region_id NOT NULL) AND
(region_snapshot_dataset_id IS NOT NULL AND region_snaphshot_region_id IS NOT NULL AND region_snapshot_snapshot_id IS NOT NULL)
)
)
VolumeResourceUsage::ReadOnlyRegion { | ||
region_id: record | ||
.region_id | ||
.expect("valid read-only region usage record"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conservatively, this should maybe be a TryFrom
implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expect
s
@@ -264,7 +264,7 @@ impl DataStore { | |||
block_size, | |||
blocks_per_extent, | |||
extent_count, | |||
read_only: false, | |||
read_only: maybe_snapshot_id.is_some(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a bug before this PR?
enum VolumeCreationError { | ||
#[error("Error from Volume creation: {0}")] | ||
Public(Error), | ||
let maybe_volume: Option<Volume> = dsl::volume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Volume has a time_deleted
column -- do we not care about that here?
// This function may be called with a replacement volume | ||
// that is completely blank, to be filled in later by this | ||
// function. `volume_create` will have been called but will | ||
// not have added any volume resource usage records, because | ||
// it was blank! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this comment, in this location - namely, I'm not sure who is calling volume_create
in this situation where we're saying "volume_create will have been called". Is this something we're doing within the body of this function, and I'm not seeing it? Or is this something someone else could concurrently be doing?
// not have added any volume resource usage records, because | ||
// it was blank! | ||
// | ||
// The indention leaving this transaction is that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The indention leaving this transaction is that the | |
// The intention leaving this transaction is that the |
// We don't support a pure Region VCR at the volume | ||
// level in the database, so this choice should | ||
// never be encountered. | ||
panic!("Region not supported as a top level volume"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a footgun to me, to have a pub fn
with an undocumented panic?
OVERLAY( | ||
OVERLAY( | ||
MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT) | ||
PLACING '4' from 13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing? Why '4'? Why from 13
?
MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT) | ||
PLACING '4' from 13 | ||
) | ||
PLACING TO_HEX(volume_references) from 17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why from 17
?
Crucible volumes are created by layering read-write regions over a hierarchy of read-only resources. Originally only a region snapshot could be used as a read-only resource for a volume. With the introduction of read-only regions (created during the region snapshot replacement process) this is no longer true!
Read-only resources can be used by many volumes, and because of this they need to have a reference count so they can be deleted when they're not referenced anymore. The region_snapshot table uses a
volume_references
column, which counts how many uses there are. The region table does not have this column, and more over a simple integer works for reference counting but does not tell you what volume that use is from. This can be determined (see omdb's validate volume references command) but it's information that is tossed out, as Nexus knows what volumes use what resources! Instead, record what read-only resources a volume uses in a new table.As part of the schema change to add the new
volume_resource_usage
table, a migration is included that will create the appropriate records for all region snapshots.In testing, a few bugs were found: the worst being that read-only regions did not have their read_only column set to true. This would be a problem if read-only regions are created, but they're currently only created during region snapshot replacement. To detect if any of these regions were created, find all regions that were allocated for a snapshot volume:
A similar bug was found in the simulated Crucible agent.
This commit also reverts #6728, enabling region snapshot replacement again - it was disabled due to a lack of read-only region reference counting, so it can be enabled once again.