Eliminate some scans + store jobs in BTree #738

bcantrill · 2023-05-18T01:08:21Z

In addition to fixing #731, this also fixed #757 -- and pulls in #733.

The `block_contexts` table has a primary key of all the columns, but for the unencrypted case only three have values, and this doesn't seem to reject duplicate unencrypted contexts. Add a separate UNIQUE INDEX for unencrypted block context rows (namely, no nonce and tag), and change the INSERT to INSERT OR IGNORE to elide duplicate rows.

…esp/crucible

The `block_contexts` table has a primary key of all the columns, but for the unencrypted case only three have values, and this doesn't seem to reject duplicate unencrypted contexts. Add a separate UNIQUE INDEX for unencrypted block context rows (namely, no nonce and tag), and change the INSERT to INSERT OR IGNORE to elide duplicate rows.

Submit different block data as part of the IO loop. Add an option to submit all zeroes instead if that workload is still desired.

leftwo · 2023-05-18T18:31:30Z

Love the improvements, a marked difference.

I did hit a new panic running the tools/test_reconnect.sh test, but I'm not sure yet if this
is the fault of these changes, though they are my first suspect.

The panic happens when the upstairs has decided that a downstairs has been "gone too long"
and has marked that downstairs as faulted. This should result in all new IO for this downstairs, and
any current IO that has not completed being marked as "skipped". Without the LIveRepair
changes in place yet, the downstairs will just stay as faulted. And, indeed on stock bits, this
is what happens and we don't see a panic. Even when that downstairs "comes back", it remains
in purgatory (OnlineRepairReady) and is not allowed to re-join the region set.

However, when this same situation happens on the perf bits, I'm hitting this panic:

Dec 28 17:25:28.389 WARN Full(5398) Failed to notify client 2 of work 5398
05398/15000 Write block 4853  len 24576  data:   4   3   2   2   3   3
Dec 28 17:25:28.389 WARN Full(5399) Failed to notify client 2 of work 5399
05399/15000 Write block 3288  len  8192  data:   3   3
Dec 28 17:25:28.390 WARN Full(5400) Failed to notify client 2 of work 5400
Dec 28 17:25:28.390 INFO [2] a1bcf246-0610-4b3a-bd97-22b1d6a593e1 (f84b9698-7b00-4e87-97e5-bbd78792b8d4) Active Active Faulted ds_transition to OnlineRepairReady
Dec 28 17:25:28.390 INFO [2] Transition from Faulted to OnlineRepairReady
Dec 28 17:25:28.390 WARN [2] new RM replaced this: Some(RegionMetadata { generation: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], flush_numbers: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dirty: [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false] })
Dec 28 17:25:28.390 INFO [2] a1bcf246-0610-4b3a-bd97-22b1d6a593e1 Enter Ready for Online Repair mode
Dec 28 17:25:28.390 INFO [2] Starts cmd_loop
Dec 28 17:25:28.390 INFO [2] 127.0.0.1:8830 task reports connection:true
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', upstairs/src/lib.rs:2601:19
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Dec 28 17:25:28.446 INFO accepted connection, remote_addr: 127.0.0.1:45964, local_addr: 127.0.0.1:9998

The panic is in in_progress() while attempting to pull a job off the ds_active list:

    fn in_progress(&mut self, ds_id: u64, client_id: u8) -> Option<IOop> {
        let job = self.ds_active.get_mut(&ds_id).unwrap();

Now, I don't know what job_id that was yet, and I don't know which client_id it was either, trying to see
if I can figure that out, and how I also might improve the panic message here.

I can re-create this panic pretty easy, so I'll just dig a little further and try to understand where exactly we
hit it.

leftwo · 2023-05-18T19:51:08Z

Ah, I think what is missing is having something like this:

self.ds_new[client_id as usize] = Vec::new();

In ds_set_faulted().

Adding that and I no longer get the panic when a faulted downstairs attempts to rejoin the region set.

leftwo · 2023-05-22T16:30:09Z

Here is the diff that will fix the panic I found, plus an improvement on the panic message.

diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs
index beb8c00..e3edc7b 100644
--- a/upstairs/src/lib.rs
+++ b/upstairs/src/lib.rs
@@ -2598,7 +2598,12 @@ impl Downstairs {
      * has no work to do, so return None.
      */
     fn in_progress(&mut self, ds_id: u64, client_id: u8) -> Option<IOop> {
-        let job = self.ds_active.get_mut(&ds_id).unwrap();
+        let job = match self.ds_active.get_mut(&ds_id) {
+            Some(job) => job,
+            None => {
+                panic!("[{client_id}] in_progress can't find job {ds_id}");
+            }
+        };
 
         // If current state is Skipped, then we have nothing to do here.
         if job.state[&client_id] == IOState::Skipped {
@@ -2948,6 +2953,8 @@ impl Downstairs {
         let mut notify_guest = false;
         let mut retire_check = vec![];
 
+        // There should be no new jobs for this downstairs
+        self.ds_new[client_id as usize] = Vec::new();
         for (ds_id, job) in self.ds_active.iter_mut() {
             let state = job.state.get(&client_id).unwrap();

…fwork

This reverts commit 98215b7.

This reverts commit a0daa8b.

This reverts commit e4a2f4b.

This reverts commit 87ae64c.

This reverts commit 5734060.

bcantrill · 2023-05-29T04:32:33Z

Here is the diff that will fix the panic I found, plus an improvement on the panic message.

diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs
index beb8c00..e3edc7b 100644
--- a/upstairs/src/lib.rs
+++ b/upstairs/src/lib.rs
@@ -2598,7 +2598,12 @@ impl Downstairs {
      * has no work to do, so return None.
      */
     fn in_progress(&mut self, ds_id: u64, client_id: u8) -> Option<IOop> {
-        let job = self.ds_active.get_mut(&ds_id).unwrap();
+        let job = match self.ds_active.get_mut(&ds_id) {
+            Some(job) => job,
+            None => {
+                panic!("[{client_id}] in_progress can't find job {ds_id}");
+            }
+        };
 
         // If current state is Skipped, then we have nothing to do here.
         if job.state[&client_id] == IOState::Skipped {
@@ -2948,6 +2953,8 @@ impl Downstairs {
         let mut notify_guest = false;
         let mut retire_check = vec![];
 
+        // There should be no new jobs for this downstairs
+        self.ds_new[client_id as usize] = Vec::new();
         for (ds_id, job) in self.ds_active.iter_mut() {
             let state = job.state.get(&client_id).unwrap();

I was able to reproduce this by pstop'ing a downstairs while running measure-iops. Doing this revealed several problems: there were three other places that ds_new could become out of sync with ds_active; those have now all been fixed.

jmpesp

Overall I'm good with this PR (IOPS GO BRRRR), except:

Doing this revealed several problems: there were three other places that ds_new could become out of sync with ds_active; those have now all been fixed.

I'm anxious about the validity of our current tests, and about future cases of drift here. All tests pass on this PR, but 68 tests fail if I add an assertion that ds_new doesn't contain the job we're trying to mark InProgress:

diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs
index 99df635..124946e 100644
--- a/upstairs/src/lib.rs
+++ b/upstairs/src/lib.rs
@@ -2884,6 +2885,7 @@ impl Downstairs {
         let new_state = IOState::InProgress;
         let old_state = job.state.insert(client_id, new_state.clone()).unwrap();
         assert_eq!(old_state, IOState::New);
+        assert!(!self.ds_new[client_id as usize].contains(&ds_id));
         self.io_state_count.decr(&old_state, client_id);
         self.io_state_count.incr(&new_state, client_id);

The tests that fail call ds.in_progress directly instead of getting io_send to do it, which properly takes care of keeping ds_new correct. If I instead assert that the ds_id is contained in ds_new, then I instead see the two new dummy downstairs tests fail (where io_send has removed the job id from ds_new before calling in_progress). I don't think that assertion is correct or should be added in, but I do think more checks should be added in. I've suggested a few in the comments, keeping them as debug_assert so that they are (I assume!) compiled out to preserve the performance gains seen here.

(not blocking this PR) I wonder if it's worth it to separate ds_active, ds_new, io_state_count, and possibly more into one new struct, where correctness can be maintained through the functions of that struct.

upstairs/src/lib.rs

jmpesp · 2023-05-29T21:31:06Z

upstairs/src/lib.rs

+ * [`new_work`], presumably due to flow control.
+ */
+ fn requeue_work(&mut self, client_id: u8, work: &[u64]) {
+ self.ds_new[client_id as usize].extend_from_slice(work);


diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs index 99df635..80cf58f 100644 --- a/upstairs/src/lib.rs +++ b/upstairs/src/lib.rs @@ -3351,6 +3368,10 @@ impl Downstairs { * [`new_work`], presumably due to flow control. */ fn requeue_work(&mut self, client_id: u8, work: &[u64]) { + for job_id in work { + let job = self.ds_active.get_mut(&job_id).unwrap(); + debug_assert_eq!(*job.state.get(&client_id).unwrap(), IOState::New); + } self.ds_new[client_id as usize].extend_from_slice(work); }

upstairs/src/lib.rs

leftwo

I just have the one concern about guaranteeing that our ds_active, when iterating,
will indeed always return in key_id lowest to highest. I know we always allocate in
order, and add to the end of the list in order. I'm just a little nervous that there is a
path back in where we could put something out of order. The LiveRepair stuff
does do some reservations for job IDs before they actually exist, and this might be
a place where it could happen.

I'll dig a little further to see if I can find a situation like I'm concerned about.

upstairs/src/lib.rs

upstairs/src/live_repair.rs

leftwo · 2023-05-30T17:04:27Z

Ah, yes, it's a b-tree now. Ignore my fears about this.

leftwo

Ok, BTree, I'm all good to go here.

bcantrill and others added 10 commits May 13, 2023 23:26

avoid some scans

c21b262

use BTreeMap instead of HashMap

1e21807

Merge branch 'elide_duplicate_unencrypted_contexts' of github.com:jmp…

ef32020

…esp/crucible

fix "crutest dep" test failure

79e27f0

measure-iops should submit different blocks

092058c

Submit different block data as part of the IO loop. Add an option to submit all zeroes instead if that workload is still desired.

fix test_reconnect.sh test failure

e6592e5

rustfmt

f0df895

Merge branch 'main' into perfwork

b1fda0d

bcantrill and others added 17 commits May 27, 2023 22:33

Merge branch 'perfwork' of github.com:oxidecomputer/crucible into per…

023d326

…fwork

Merge branch 'main' of github.com:oxidecomputer/crucible

9b0f76c

Merge branch 'main' into perfwork

d68b0b1

fix build

784f845

allow flush interval to be tuned + change default

e4ade77

build

ecd8ccd

use mem::take instead of drain

5734060

use INTEGER (64 bits) for hash/on_disk_hash

87ae64c

downstairs sqlite: use unix-excl VFS, removing SHM file

e4a2f4b

use WITHOUT ROWID for the contexts table

a0daa8b

fix the test suite

b2df7fc

meant to get this cleanup in there

98215b7

Revert "meant to get this cleanup in there"

ceaf584

This reverts commit 98215b7.

Revert "use WITHOUT ROWID for the contexts table"

bc69933

This reverts commit a0daa8b.

Revert "downstairs sqlite: use unix-excl VFS, removing SHM file"

eab9117

This reverts commit e4a2f4b.

Revert "use INTEGER (64 bits) for hash/on_disk_hash"

244795a

This reverts commit 87ae64c.

Revert "use mem::take instead of drain"

1e90197

This reverts commit 5734060.

fix panic + reconstruction failures

a97fc2d

bcantrill mentioned this pull request May 29, 2023

default flush timeout should be much lower, tunable to less than 1 second #757

Closed

bcantrill marked this pull request as ready for review May 29, 2023 04:32

jmpesp approved these changes May 29, 2023

View reviewed changes

leftwo reviewed May 30, 2023

View reviewed changes

upstairs/src/lib.rs Show resolved Hide resolved

upstairs/src/lib.rs Show resolved Hide resolved

upstairs/src/live_repair.rs Show resolved Hide resolved

leftwo self-requested a review May 30, 2023 19:07

leftwo approved these changes May 30, 2023

View reviewed changes

bcantrill mentioned this pull request May 30, 2023

add assertions for ds_new/ds_active consistency #765

Open

bcantrill merged commit 8757b3f into main May 30, 2023

bcantrill deleted the perfwork branch May 30, 2023 22:02

leftwo mentioned this pull request May 31, 2023

measure-iops should submit different blocks #733

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate some scans + store jobs in BTree #738

Eliminate some scans + store jobs in BTree #738

bcantrill commented May 18, 2023 •

edited

Loading

leftwo commented May 18, 2023

leftwo commented May 18, 2023

leftwo commented May 22, 2023 •

edited

Loading

bcantrill commented May 29, 2023

jmpesp left a comment

jmpesp May 29, 2023

leftwo left a comment

leftwo commented May 30, 2023

leftwo left a comment

Eliminate some scans + store jobs in BTree #738

Eliminate some scans + store jobs in BTree #738

Conversation

bcantrill commented May 18, 2023 • edited Loading

leftwo commented May 18, 2023

leftwo commented May 18, 2023

leftwo commented May 22, 2023 • edited Loading

bcantrill commented May 29, 2023

jmpesp left a comment

Choose a reason for hiding this comment

jmpesp May 29, 2023

Choose a reason for hiding this comment

leftwo left a comment

Choose a reason for hiding this comment

leftwo commented May 30, 2023

leftwo left a comment

Choose a reason for hiding this comment

bcantrill commented May 18, 2023 •

edited

Loading

leftwo commented May 22, 2023 •

edited

Loading