Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw extent v2 #1270

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Raw extent v2 #1270

wants to merge 14 commits into from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Apr 22, 2024

(Note: this PR is staged on top of #1268 ; only the last two commits are relevant)

The current raw extent implementation lays out on-disk data like this:

[ --- block --- | --- block --- | ... | ctx a0 | ctx a1 | ... | ctx b0 | ctx b1 | ... | metadata ]

In other words,

  • N Blocks are tightly packed at the front of the file
  • Then, there are two arrays of block contexts (one slot per block per array, so 2N slots in total)
  • Finally, there's a trailing metadata chunk

(The layout is also discussed in this block comment).

This layout prioritizes keeping blocks contiguous / aligned / tightly packed. In doing so, it makes sacrifices:

  • Context slots cannot be written atomically with their block data
    • This means that we have to have two context slot arrays, and ping-pong between then
  • It's possible for context slot arrays to get fragmented (some blocks using array A, and some using array B), which means reads have to be split between the two arrays.
    • We have extra logic to defragment if this effect becomes significant, which is subtle
  • Reading or writing a single block requires two syscalls
    • Those IO operations touch very different parts of the file!

The latter is the original sin of this layout, and is clearly visible in flamegraphs: reading the block data (4 KiB) and context slot (48 bytes) take basically the same amount of time, because (spoilers) each one has to read a full ZFS record (128 KiB).

315651005-3912780b-fde3-4fd8-b4ca-1e3a1f94e483

(graph from https://github.com/oxidecomputer/stlouis/issues/541)


Having blocks be aligned to 512 / 4K boundaries makes sense in terms of mechanical sympathy with the drive, but we're not controlling the drive directly; we've got all of ZFS between our filesystem operations and bytes actually landing in an SSD.

I decided to experiment with an alternate layout that tightly packs blocks and context slots:

[ --- block --- | ctx 0 | ctx 1 | --- block --- | --- block --- | ctx 2 | ctx 3 | ... | metadata ]

(Naively, you could do [block | ctx | block | ctx | ..., but interleaving them uses fewer iovecs)

This formats means that blocks and contexts can be read and written together (using preadv / pwritev), and live in the same ZFS records (so we don't have to load 2x records to get a single block's data).

Blocks and contexts are organized so that they always live in the same ZFS record (i.e. a 128 KiB chunk of the file), and are written in a single pwritev call. Based on discussion in #oxide-q&a, this means that block + context should always be written together (or not at all).

The new format looks like a significant performance improvement. Doing random writes to a completely full 128 GiB disk, here's the numbers:

Write size (blocks) V1 V2
1 12.7 MiB/sec ± 328.8 KiB/sec 17.7 MiB/sec ± 269.9 KiB/sec
2 24.6 MiB/sec ± 764.4 KiB/sec 34 MiB/sec ± 334.8 KiB/sec
4 47.1 MiB/sec ± 1 MiB/sec 65.4 MiB/sec ± 870.7 KiB/sec
8 81 MiB/sec ± 658.1 KiB/sec 112.1 MiB/sec ± 1.8 MiB/sec
16 135.8 MiB/sec ± 2.7 MiB/sec 192.5 MiB/sec ± 3.8 MiB/sec
32 220.5 MiB/sec ± 5.1 MiB/sec 291.3 MiB/sec ± 6.9 MiB/sec
64 290.2 MiB/sec ± 8 MiB/sec 368.8 MiB/sec ± 12.4 MiB/sec
128 455.5 MiB/sec ± 5.2 MiB/sec 549.9 MiB/sec ± 7.6 MiB/sec
192 561.6 MiB/sec ± 6.8 MiB/sec 651.8 MiB/sec ± 11 MiB/sec
256 580.6 MiB/sec ± 19.2 MiB/sec 746.2 MiB/sec ± 16.5 MiB/sec
320 552.8 MiB/sec ± 17.4 MiB/sec 679.9 MiB/sec ± 9.5 MiB/sec
384 580.6 MiB/sec ± 36.2 MiB/sec 680.1 MiB/sec ± 15.8 MiB/sec
448 611.3 MiB/sec ± 18.6 MiB/sec 738.4 MiB/sec ± 26 MiB/sec
512 652 MiB/sec ± 47.6 MiB/sec 708.8 MiB/sec ± 29.6 MiB/sec
576 605.8 MiB/sec ± 28.7 MiB/sec 712.3 MiB/sec ± 26.6 MiB/sec
640 641.5 MiB/sec ± 13.4 MiB/sec 720.8 MiB/sec ± 26 MiB/sec
704 653.8 MiB/sec ± 8.2 MiB/sec 747.6 MiB/sec ± 6.2 MiB/sec
768 666.2 MiB/sec ± 13.7 MiB/sec 756 MiB/sec ± 27.1 MiB/sec
832 654.3 MiB/sec ± 7.6 MiB/sec 700.9 MiB/sec ± 21.8 MiB/sec
896 673.2 MiB/sec ± 14.1 MiB/sec 746.7 MiB/sec ± 21.9 MiB/sec
960 676.7 MiB/sec ± 11 MiB/sec 776 MiB/sec ± 12.9 MiB/sec
1024 707.7 MiB/sec ± 32.3 MiB/sec 764.8 MiB/sec ± 24.9 MiB/sec

In graph form:

Screenshot 2024-04-19 at 4 40 12 PM

Todo

  • Write a big comment about ZFS_RECORDSIZE shenanigans / crash consistency
  • Retest after a870b1e
  • Test read performance
  • Make integration tests generic across extent types and run them for both RawFile and RawFileV2?

layout: RawLayout,

/// Has this block been written?
block_written: Vec<bool>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse my random outsider comment:

suggestion: This seems to never change in size after creation, so it could maybe be a boxed slice?

@mkeeter
Copy link
Contributor Author

mkeeter commented Aug 2, 2024

This is now updated onto main and ready for review.

It's a little simpler than the previous description: I decided to remove the fancy interleaving of blocks and contexts, and went back to the straight-forward

[block | ctx | block | ctx | ... ]

(this may change if I see compelling benchmarks, to come soon)

@leftwo leftwo marked this pull request as ready for review August 5, 2024 17:33
};
use zerocopy::AsBytes;

pub(crate) const DEFAULT_ZFS_RECORDSIZE: u64 = 128 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both cases in get_record_size where this is used, it doesn't seem like the path is a ZFS file system, so maybe this should be NON_ZFS_RECORDSIZE or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, renamed to DUMMY_RECORDSIZE in f7c13d7

Comment on lines +55 to +58
/// ## Expected recordsize
/// After the block data, we store a single `u64` representing the expected
/// recordsize when the file was written. When the file is reopened, we detect
/// if its recordsize has changed, which would be surprising!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO-jwm this may happen after migration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, after region / region snapshot replacement.

Comment on lines +165 to +168
if self.layout.has_padding_after(block) {
vs.push(IoSlice::new(&padding));
expected_bytes += padding.len();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO-jwm can pwritev partially fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a comment for myself to verify, but my thinking here was

  1. of course pwritev can partially fail, it can return a different number of bytes written out than what was added
  2. are there any zfs specific considerations to think about

(job_id.0, self.extent_number.0, n_blocks as u64)
});

// Now execute each chunk in a separate `pwritev` call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO-jwm this means the Crucible level write can partially succeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear! I think this was always true but was TODO for myself to 1) verify this and 2) suggest we add this as documentation somewhere?

downstairs/src/extent_inner_raw_v2.rs Outdated Show resolved Hide resolved
});
if self.layout.has_padding_after(block) {
iovecs.push(libc::iovec {
iov_base: padding.as_mut_ptr() as *mut _,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is preadv's behaviour when it's passed multiple iovecs with the same address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe they're processed in array order, so the last one wins. It doesn't really matter here, because we're not using the padding data for anything.

iov_len: block_size,
});
iovecs.push(libc::iovec {
iov_base: ctx as *mut _ as *mut _,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here haha?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol it's casting from &mut [u8; 32]*mut [u8; 32]*mut c_void

Comment on lines 655 to 657
// If the `zfs` executable isn't present, then we're
// presumably on a non-ZFS filesystem and will use a default
// recordsize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a remote possibility that PATH isn't set correctly but we are on a ZFS filesystem. I'm wondering if the downstairs should somehow expect that this binary is present if we're running on illumos (we already expect it for taking snapshots!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, changed in 5c446c7 to return an error if zfs isn't present on an illumos system.

impl std::fmt::Debug for RawLayout {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("RawLayout")
.field("extent_size", &self.extent_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copy-pasta from extent_inner_raw.rs, but I'm not sure why they're not just deriving Debug.

Fixed in 5b2a942 for both

@jmpesp
Copy link
Contributor

jmpesp commented Aug 7, 2024

Looks like I clicked "Comment" too early, so there's a few TODO-jwm there! :)

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need some additional testing, or some updates to the testing
done in integration_tests/src/lib.rs. That file by default will only create RAW
and SQLite backend types, and we should go through it and figure out where
we need to also do the same tests for RAW_V2 (or if all the RAW tests become
RAW_V2

We have snapshots with SQLite
We have snapshots with RAW.
So, we need to support the downstairs being able to "clone" either of those.
integration_test_clone_raw()

Should we add a test (or do we have a test already) of downstairs migration from
SQLite to RAW_V2?

downstairs/src/extent_inner_raw_v2.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_raw_v2.rs Show resolved Hide resolved
downstairs/src/extent_inner_raw_v2.rs Outdated Show resolved Hide resolved
// Try to recompute the context slot from the file. If this
// fails, then we _really_ can't recover, so bail out
// unceremoniously.
self.recompute_block_written_from_file(block).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this step "putting back" block context in the event we failed a write?
I'm unclear what we are doing here exactly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was out of date; fixed!

self.extent_number
)));
}
cdt::extent__read__file__done!(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for write() we still hit the dtrace __done probe even on error. We should do the
same for reads so we don't have differing behavior between the two IOs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (both here and in extent_inner_raw.rs, which had the same behavior)

downstairs/src/extent_inner_raw_v2.rs Show resolved Hide resolved
downstairs/src/extent_inner_raw_v2.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_raw_v2.rs Outdated Show resolved Hide resolved
@@ -3416,9 +3417,12 @@ enum WrappedStream {
/// tests, it can be useful to create volumes using older backends.
#[derive(Copy, Clone, Default, Debug, PartialEq)]
pub enum Backend {
#[default]
#[cfg(any(test, feature = "integration-tests"))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is some value in allowing crucible-downstairs to pick what backend it
wants at creation time? I think there could be value in making "old" versions for various kinds
of tests and not hiding them behind a feature would allow that.

It's also not something to worry about for this PR though. Why do I even bring it up then?
Good question. For which I have no answer..

@mkeeter
Copy link
Contributor Author

mkeeter commented Aug 8, 2024

I think we might need some additional testing, or some updates to the testing done in integration_tests/src/lib.rs. That file by default will only create RAW and SQLite backend types, and we should go through it and figure out where we need to also do the same tests for RAW_V2 (or if all the RAW tests become RAW_V2

We have snapshots with SQLite We have snapshots with RAW. So, we need to support the downstairs being able to "clone" either of those. integration_test_clone_raw()

I've gone through and made integration_tests generic across different backends (3615848).

Should we add a test (or do we have a test already) of downstairs migration from SQLite to RAW_V2?

Right now, the only migration is the automatic SQLite → RAW_V1, which happens if a read-write SQLite extent is opened. Since we don't have any read-write SQLite extents in the field (only read-only snapshots), I didn't bother implementing SQLite → RAW_V2.

@leftwo
Copy link
Contributor

leftwo commented Aug 8, 2024

Should we add a test (or do we have a test already) of downstairs migration from SQLite to RAW_V2?

Right now, the only migration is the automatic SQLite → RAW_V1, which happens if a read-write SQLite extent is opened. Since we don't have any read-write SQLite extents in the field (only read-only snapshots), I didn't bother implementing SQLite → RAW_V2.

Yeah, that seems like the right call here. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants