Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate some scans + store jobs in BTree #738

Merged
merged 28 commits into from
May 30, 2023
Merged

Eliminate some scans + store jobs in BTree #738

merged 28 commits into from
May 30, 2023

Conversation

bcantrill
Copy link
Contributor

@bcantrill bcantrill commented May 18, 2023

In addition to fixing #731, this also fixed #757 -- and pulls in #733.

bcantrill and others added 10 commits May 13, 2023 23:26
The `block_contexts` table has a primary key of all the columns, but for
the unencrypted case only three have values, and this doesn't seem to
reject duplicate unencrypted contexts.

Add a separate UNIQUE INDEX for unencrypted block context rows (namely,
no nonce and tag), and change the INSERT to INSERT OR IGNORE to elide
duplicate rows.
The `block_contexts` table has a primary key of all the columns, but for
the unencrypted case only three have values, and this doesn't seem to
reject duplicate unencrypted contexts.

Add a separate UNIQUE INDEX for unencrypted block context rows (namely,
no nonce and tag), and change the INSERT to INSERT OR IGNORE to elide
duplicate rows.
Submit different block data as part of the IO loop. Add an option to
submit all zeroes instead if that workload is still desired.
@leftwo
Copy link
Contributor

leftwo commented May 18, 2023

Love the improvements, a marked difference.

I did hit a new panic running the tools/test_reconnect.sh test, but I'm not sure yet if this
is the fault of these changes, though they are my first suspect.

The panic happens when the upstairs has decided that a downstairs has been "gone too long"
and has marked that downstairs as faulted. This should result in all new IO for this downstairs, and
any current IO that has not completed being marked as "skipped". Without the LIveRepair
changes in place yet, the downstairs will just stay as faulted. And, indeed on stock bits, this
is what happens and we don't see a panic. Even when that downstairs "comes back", it remains
in purgatory (OnlineRepairReady) and is not allowed to re-join the region set.

However, when this same situation happens on the perf bits, I'm hitting this panic:

Dec 28 17:25:28.389 WARN Full(5398) Failed to notify client 2 of work 5398
05398/15000 Write block 4853  len 24576  data:   4   3   2   2   3   3
Dec 28 17:25:28.389 WARN Full(5399) Failed to notify client 2 of work 5399
05399/15000 Write block 3288  len  8192  data:   3   3
Dec 28 17:25:28.390 WARN Full(5400) Failed to notify client 2 of work 5400
Dec 28 17:25:28.390 INFO [2] a1bcf246-0610-4b3a-bd97-22b1d6a593e1 (f84b9698-7b00-4e87-97e5-bbd78792b8d4) Active Active Faulted ds_transition to OnlineRepairReady
Dec 28 17:25:28.390 INFO [2] Transition from Faulted to OnlineRepairReady
Dec 28 17:25:28.390 WARN [2] new RM replaced this: Some(RegionMetadata { generation: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], flush_numbers: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dirty: [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false] })
Dec 28 17:25:28.390 INFO [2] a1bcf246-0610-4b3a-bd97-22b1d6a593e1 Enter Ready for Online Repair mode
Dec 28 17:25:28.390 INFO [2] Starts cmd_loop
Dec 28 17:25:28.390 INFO [2] 127.0.0.1:8830 task reports connection:true
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', upstairs/src/lib.rs:2601:19
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Dec 28 17:25:28.446 INFO accepted connection, remote_addr: 127.0.0.1:45964, local_addr: 127.0.0.1:9998

The panic is in in_progress() while attempting to pull a job off the ds_active list:

    fn in_progress(&mut self, ds_id: u64, client_id: u8) -> Option<IOop> {
        let job = self.ds_active.get_mut(&ds_id).unwrap();

Now, I don't know what job_id that was yet, and I don't know which client_id it was either, trying to see
if I can figure that out, and how I also might improve the panic message here.

I can re-create this panic pretty easy, so I'll just dig a little further and try to understand where exactly we
hit it.

@leftwo
Copy link
Contributor

leftwo commented May 18, 2023

Ah, I think what is missing is having something like this:

self.ds_new[client_id as usize] = Vec::new();

In ds_set_faulted().

Adding that and I no longer get the panic when a faulted downstairs attempts to rejoin the region set.

@leftwo
Copy link
Contributor

leftwo commented May 22, 2023

Here is the diff that will fix the panic I found, plus an improvement on the panic message.

diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs
index beb8c00..e3edc7b 100644
--- a/upstairs/src/lib.rs
+++ b/upstairs/src/lib.rs
@@ -2598,7 +2598,12 @@ impl Downstairs {
      * has no work to do, so return None.
      */
     fn in_progress(&mut self, ds_id: u64, client_id: u8) -> Option<IOop> {
-        let job = self.ds_active.get_mut(&ds_id).unwrap();
+        let job = match self.ds_active.get_mut(&ds_id) {
+            Some(job) => job,
+            None => {
+                panic!("[{client_id}] in_progress can't find job {ds_id}");
+            }
+        };
 
         // If current state is Skipped, then we have nothing to do here.
         if job.state[&client_id] == IOState::Skipped {
@@ -2948,6 +2953,8 @@ impl Downstairs {
         let mut notify_guest = false;
         let mut retire_check = vec![];
 
+        // There should be no new jobs for this downstairs
+        self.ds_new[client_id as usize] = Vec::new();
         for (ds_id, job) in self.ds_active.iter_mut() {
             let state = job.state.get(&client_id).unwrap();

@bcantrill
Copy link
Contributor Author

Here is the diff that will fix the panic I found, plus an improvement on the panic message.

diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs
index beb8c00..e3edc7b 100644
--- a/upstairs/src/lib.rs
+++ b/upstairs/src/lib.rs
@@ -2598,7 +2598,12 @@ impl Downstairs {
      * has no work to do, so return None.
      */
     fn in_progress(&mut self, ds_id: u64, client_id: u8) -> Option<IOop> {
-        let job = self.ds_active.get_mut(&ds_id).unwrap();
+        let job = match self.ds_active.get_mut(&ds_id) {
+            Some(job) => job,
+            None => {
+                panic!("[{client_id}] in_progress can't find job {ds_id}");
+            }
+        };
 
         // If current state is Skipped, then we have nothing to do here.
         if job.state[&client_id] == IOState::Skipped {
@@ -2948,6 +2953,8 @@ impl Downstairs {
         let mut notify_guest = false;
         let mut retire_check = vec![];
 
+        // There should be no new jobs for this downstairs
+        self.ds_new[client_id as usize] = Vec::new();
         for (ds_id, job) in self.ds_active.iter_mut() {
             let state = job.state.get(&client_id).unwrap();

I was able to reproduce this by pstop'ing a downstairs while running measure-iops. Doing this revealed several problems: there were three other places that ds_new could become out of sync with ds_active; those have now all been fixed.

@bcantrill bcantrill marked this pull request as ready for review May 29, 2023 04:32
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm good with this PR (IOPS GO BRRRR), except:

Doing this revealed several problems: there were three other places that ds_new could become out of sync with ds_active; those have now all been fixed.

I'm anxious about the validity of our current tests, and about future cases of drift here. All tests pass on this PR, but 68 tests fail if I add an assertion that ds_new doesn't contain the job we're trying to mark InProgress:

diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs
index 99df635..124946e 100644
--- a/upstairs/src/lib.rs
+++ b/upstairs/src/lib.rs
@@ -2884,6 +2885,7 @@ impl Downstairs {
         let new_state = IOState::InProgress;
         let old_state = job.state.insert(client_id, new_state.clone()).unwrap();
         assert_eq!(old_state, IOState::New);
+        assert!(!self.ds_new[client_id as usize].contains(&ds_id));
         self.io_state_count.decr(&old_state, client_id);
         self.io_state_count.incr(&new_state, client_id);

The tests that fail call ds.in_progress directly instead of getting io_send to do it, which properly takes care of keeping ds_new correct. If I instead assert that the ds_id is contained in ds_new, then I instead see the two new dummy downstairs tests fail (where io_send has removed the job id from ds_new before calling in_progress). I don't think that assertion is correct or should be added in, but I do think more checks should be added in. I've suggested a few in the comments, keeping them as debug_assert so that they are (I assume!) compiled out to preserve the performance gains seen here.

(not blocking this PR) I wonder if it's worth it to separate ds_active, ds_new, io_state_count, and possibly more into one new struct, where correctness can be maintained through the functions of that struct.

upstairs/src/lib.rs Show resolved Hide resolved
* [`new_work`], presumably due to flow control.
*/
fn requeue_work(&mut self, client_id: u8, work: &[u64]) {
self.ds_new[client_id as usize].extend_from_slice(work);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/upstairs/src/lib.rs b/upstairs/src/lib.rs
index 99df635..80cf58f 100644
--- a/upstairs/src/lib.rs
+++ b/upstairs/src/lib.rs
@@ -3351,6 +3368,10 @@ impl Downstairs {
      * [`new_work`], presumably due to flow control.
      */
     fn requeue_work(&mut self, client_id: u8, work: &[u64]) {
+        for job_id in work {
+            let job = self.ds_active.get_mut(&job_id).unwrap();
+            debug_assert_eq!(*job.state.get(&client_id).unwrap(), IOState::New);
+        }
         self.ds_new[client_id as usize].extend_from_slice(work);
     }
 

upstairs/src/lib.rs Show resolved Hide resolved
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have the one concern about guaranteeing that our ds_active, when iterating,
will indeed always return in key_id lowest to highest. I know we always allocate in
order, and add to the end of the list in order. I'm just a little nervous that there is a
path back in where we could put something out of order. The LiveRepair stuff
does do some reservations for job IDs before they actually exist, and this might be
a place where it could happen.

I'll dig a little further to see if I can find a situation like I'm concerned about.

upstairs/src/lib.rs Show resolved Hide resolved
upstairs/src/lib.rs Show resolved Hide resolved
upstairs/src/live_repair.rs Show resolved Hide resolved
@leftwo
Copy link
Contributor

leftwo commented May 30, 2023

Ah, yes, it's a b-tree now. Ignore my fears about this.

@leftwo leftwo self-requested a review May 30, 2023 19:07
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, BTree, I'm all good to go here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

default flush timeout should be much lower, tunable to less than 1 second
4 participants