-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] add RPW for proactively pulling instance state from sled-agents #5611
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start -- this is a first batch of comments, and I can come back and do a deeper review in a moment
nexus-config/src/nexus_config.rs
Outdated
/// maximum number of retries to attempt before considering a sled-agent | ||
/// dead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm being nitpicky, but I think this distinction matters - I don't think we're considering that the sled-agent is necessarily dead here, but rather, it sounds like we're making some decision that the instance should e marked failed? I think that logical jump is slightly smaller.
Seems plausible to me that we could bump into this case with e.g. a network partition as well, in which case the sled may be alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to be clear: I like the idea of drawing inferences from client requests to the sleds being unresponsive, but I think we may want to consider this pattern more holistically. There are a lot of ways to "check if the sled is alive or not" (e.g., sending queries, checking-in with MGS, checking-in with the PSC), but I'm not sure this is sufficient to make the call about the whole sled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. Right now, we're not doing anything with this information besides logging a warning, but I think that the eventual end goal would be to use this as a signal that we should look in other places to figure out what's going on. E.g. I was also thinking that if the sled-agent isn't responding, Nexus may also want to try talking to the VMMs directly to see if they're interested in talking to it --- it might just be the sled-agent that's wedged. And, before we make determinations about the whole sled, we should absolutely look elsewhere as well.
let rsp = backoff::retry_notify( | ||
backoff, | ||
|| async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda agree with @andrewjstone 's comments here -- I'm not sure that we should be retrying in this background task.
I don't think that this task -- instance monitoring -- is ultimately responsible for deciding whether or not a sled is alive. It is, however, responsible for updating the state of an instance if the sled rebooted. In this case, if we see:
- Sled policy: not expunged
- Sled agent: not responding
... and we wait... later, we should see:
- Sled policy: not expunged
- Sled agent: responding! and I don't know about your instance.
I don't think this requires an inner retry loop, nor a "max retries" check.
As an alternative, if we expunge the sled, we'll see:
- Sled policy: expunged
- Sled agent: <we shouldn't talk to it>
We'll need to sort through #4872 to decide what to do here, but at least, we'll know that the sled being out-to-lunch is an explicitly marked operation, not something we need to infer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, regarding this and your comments from #5611 (comment) and #5611 (comment), here's what I'm currently thinking I'm going to do as part of this PR:
- remove all the retry logic
- add a new table of "sled instance check failures" to the database. for now, i'm imagining that this would be shaped sort of like
(sled_id, instance_id, id_of_the_nexus_performing_the_health_check, time_updated, failure_count)
. we might also want to add a notion of failure kinds to this as well. - if we ask a sled-agent about an instance we believe it has, and the sled-agent responds with an HTTP error that proactively indicates that it does not know about that instance, mark that instance as
Failed
(as I believe this would indicate that the sled has been rebooted and the instance is gone1) - for all other errors (the ones that currently get retried), such as TCP connections refused/timing out, etc, increment the failure count table for that
(sled_id, instance_id, our_nexus_id)
tuple - for now, all we do is increment that error count. eventually, the error count might be exposed to an operator and/or used to drive automated decisions (e.g. if the failure count for a particular instance is > n, but the sled-agent has responded successfully for other instances, restart the instance...if the failure count for all instances on a sled is > n, power cycle the sled, etc2)
What do you think?
Footnotes
-
Potentially, if we wanted to be maximally distrustful of the
sled-agent
's understanding of the state of the world, we could also try pinging the actualpropolis-server
process for that instance, to see if it still exists. If thesled-agent
doesn't believe it exists, it shouldn't, but unfortunately, as you probably already know, distributed systems... ↩ -
This is more or less what I was using the max-retries limit for in the current implementation. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, I think my main feedback regards that "failure table". This doesn't actually even need to block us -- if you want to do a simpler version that "just logs an error", I think that is an okay place to start. I like your suggestion that an explicit response from a Sled Agent HTTP server saying "the instance is gone" should be enough info for us to mark the instance as failed - if we get a response, we have confirmation of state.
I bring all this up because "how we track+present no-response failures" is thorny. It's cool, we should be working on it, but it's big, and probably needs some design considerations, and I don't want any of that to stop this PR from moving forward.
That being said...
- Almost all of the sled agent endpoints are hit by Nexus at one point or another -- why would we treat the instance monitoring call as the only one where we check for HTTP errors? If we went down the approach of "HTTP errors are worthy of recording", shouldn't we be doing this for all our endpoints accessed by Nexus here?
- Is a counter sufficient information to diagnose what has happened? The moment we update failure_count / time_updated, we have no idea how big the window of failures was. Did we see one transient failure a second ago? or has this sled been dead for hours?
- Alternatively, if we had a "row-per-error" type of structure, or something like a window, how would we do garbage collection to prevent the table from growing forever?
- ... and, given that these are events, would it make sense to store this info in a column-based database -- like clickhouse -- instead of cockroachdb? (I'm not sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've a couple of random thoughts here:
if we ask a sled-agent about an instance we believe it has, and the sled-agent responds with an HTTP error that proactively indicates that it does not know about that instance, mark that instance as Failed
What happens if we fetch the sled and instance IDs, the instance migrates, and then we make the request to the original sled about it?
we might also want to add a notion of failure kinds to this as well.
I think this is really good to add early. It would help debugging a lot if we could distinguish connection timeouts from refusals, for example.
for all other errors (the ones that currently get retried), such as TCP connections refused/timing out, etc, increment the failure count table for that (sled_id, instance_id, our_nexus_id) tuple
Is the history of the failure behavior useful to track? If I understand this proposal, there is a single record with the cumulative failure count, rather than one record for each failure. The latter may be very useful in the long-run for making more complex diagnoses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem like emitting both these failures and failures from clients to sled-agent endpoints elsewhere in Nexus as Oximeter metrics is potentially a good idea. The table of failure counts was just sort of a strawman proposal so that we're doing something with the information here...I'm also happy to just log them and punt on doing anything else in this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would like to spit these out as a timeseries, it might be helpful to look at the oximeter
self-stats module, which reports statistics about its own collections. A very similar kind of reporting could be done here, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick update on this: I'm currently working on emitting these failure counts as Oximeter metrics. I'll probably want @bnaecker's feedback on how I'm doing that once I push it up for review --- I can ping you then!
a924d8d
to
a679561
Compare
a679561
to
9ea71c5
Compare
Currently, the instance state management code in nexus is implemented as methods on the `Nexus` type. This is problematic, as these methods can only be called in HTTP endpoints, and not from background tasks or sagas. For example, as part of #5611, I needed to call the `Nexus::notify_instance_updated` method from a background task, which isn't possible with the current factoring. Therefore, this PR moves that method and its dependencies, as well as all the methods in the `instance_network` module into free functions that can be called without a `Nexus` reference. The `Nexus` methods are left in place in order to avoid having to update all the callers as part of this PR, but they now just call into the free functions. The free functions take a few more arguments that are provided by the `&self` reference in the `Nexus` methods, such as the internal DNS `Resolver` and `opctx_alloc`, so the methods pass these in from `&self`, making them a bit more convenient to call when a `Nexus` reference _is_ available. For both of those reasons, I thought keeping them around was the right thing to do. I had briefly considered moving this code to a new `nexus-instances` crate, but it seemed a bit iffy to me as `notify_instance_updated` also unregisters an Oximeter producer, and I didn't love the idea of moving Oximeter-related code to a `nexus-instances` crate --- perhaps it also ought to go in a `nexus-oximeter` crate, or something, but that felt like a rabbit hole that I was afraid I might just end up going down forever, so I'll save it for later. If there are any functional changes as part of this PR, I've made a very bad mistake; this _should_ be a pure refactor...
Hmm, upon further consideration, I think this change actually does depend on the instance-update saga that Greg proposed being implemented first. The reason for this is that i realized that in the current code, where instance states are only pushed from sled-agents to Nexus, the sled agent only calls
Footnotes
|
Or, perhaps the instance's generation number already guards against this? I'm not totally sure, I'll have to think a bit more about it... |
384a5f3
to
c3694fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still responding to your main question, but GH decided I can't post single comments anymore, only reviews!
|
||
// All requests fired off! While we wait for them to come back, | ||
// let's prune old instances. | ||
let pruned = self.metrics.lock().unwrap().prune(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, here we prune instances based on whether they've been marked touched by the tasks we spawn above. Does that mean we're racing them? For example, suppose we have the following:
- We spawn the checking task, it marks the instance touched.
- The prune call swaps that with
false
so it's marked "untouched" - On the next pass, we spawn the checking task, but before it runs...
- We get here, pruning that still-untouched instance.
I think the actual data we report to ClickHouse will be handled correctly, in either case, but it does mean there will appear to be many restarts of that cumulative timeseries. Logically, it seems like we should be using the database to tell us which instances to report data for: if we have it in our list from the database, keep it, otherwise prune it. That might not play well with the pagination you have here, but I think it could be made to work and avoids the race above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this for a minute, I don't actually believe this is racy.
The checking task itself does not actually mutate the metrics, it just returns
a value to this task, which updates the metric and sets the touched flag. This is the only task that
mutates the metrics, and there can only be one of these tasks executing at any
given time, as activating the background task borrows it mutably. And, we won't
activate the background task again until all in-flight checks have completed,
because this task waits for all tasks spawned on the JoinSet
to complete
before this future completes. There would be
a potential race if the checking tasks were responsible for actually updating
the metrics, but they're not.
I think there is an issue here, though, but it's much simpler: the issue is
just that we should call prune
after all tasks spawned on the JoinSet
have
completed, rather than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the checking task itself does not actually mutate the metrics
Ah, I didn't grok that, thanks.
we should call prune after all tasks spawned on the JoinSet have completed
I think that makes sense. It seems logically equivalent to pruning any instances that are not returned from the database, right? If so, then that seems good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm very confused what's going on with GH's UI here. This was supposed to be a one-off comment answering your question. It appears I can only add it as part of a review now. Let me know if anything is unclear!
/// The number of successful checks for a single instance, VMM, and sled agent. | ||
#[derive(Clone, Debug, Metric)] | ||
struct Check { | ||
/// The string representation of the instance's state as understood by | ||
/// the VMM. If the check failed, this will generally be "failed". | ||
state: String, | ||
/// `Why the instance was marked as being in this state. | ||
/// | ||
/// If an instance was marked as "failed" due to a check failure, this | ||
/// will be a string representation of the failure reason. Otherwise, if | ||
/// the check was successful, this will be "success". Note that this may | ||
/// be "success" even if the instance's state is "failed", which | ||
/// indicates that we successfully queried the instance's state from the | ||
/// sled-agent, and the *sled-agent* reported that the instance has | ||
/// failed --- which is distinct from the instance watcher marking an | ||
/// instance as failed due to a failed check. | ||
reason: String, | ||
/// this will be a string representation of the failure reason. | ||
datum: Cumulative<u64>, | ||
} | ||
|
||
/// The number of unsuccessful checks for an instance and sled agent pair. | ||
#[derive(Clone, Debug, Metric)] | ||
struct IncompleteCheck { | ||
/// The reason why the check was unsuccessful. | ||
/// | ||
/// This is generated from the [`Incomplete`] enum's `Display` implementation. | ||
reason: String, | ||
/// The number of failed checks for this instance and sled agent. | ||
datum: Cumulative<u64>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question, and though I've been thinking for a while, I'm very sorry to say I don't have a good answer for you :/
In general, I'd probably make this decision based on how one wants to query the data. If one wants to easily break out or group by a field, they should probably be part of the same timeseries. If that's rare, then I'd probably put these mostly-unrelated metrics (instance-state vs instance-state-fetch-failures) into two different timeseries.
My only hesitation with grouping them is that it sort of complicates the meaning of the data. IIUC, then the reason
field is non-empty iff the state is "unknown"
, i.e., we failed the check. So even the successful samples carry this now-empty field. To be clear, I'm not worried about resource usage here, just confusion about what the field means, and when it's valid. Maybe changing that field name to failure_reason
is enough to mitigate that confusion, though. And this is a pretty non-specific worry -- I think it should mechanically work fine to do that, even if one wants to separately query for things like known states or failed checks. In either case, we can answer the same kinds of questions:
- what fraction of time was the instance in some state (say, stopped or running), over some period of time?
- how often did the instance change states?
- was the instance rebooted in this period?
- why did Nexus fail to determine the instance's state?
So I'd just say, choose what feels most natural to you at this point! I wish I had more concrete advice for you, but I guess we'll have to develop that together as we use it more :)
Thanks for all your patience on the back-and-forth on this PR! I've one other question, which I feel does not block integrating this, just something I'm curious about. How does the fact that each Nexus instance runs these checks impact the resulting metrics? |
Yeah, that's why I asked you whether Oximeter automatically included producer metadata or not --- I was, once again, suffering from Prometheus brain damage (prometheus always adds the IP and port of the scrape target, as well as metadata about how it was discovered). Commit 08e01ad adds the UUID of the Nexus process to the metric target. We could also add the rack UUID if you think that's worth including now as well? |
@hawkw My current stance is that more fields are usually better :) That might change in the future, but they can always be ignored in the query and storing them is inexpensive. So if you think the rack / cluster ID is something you might want to know, then go for it! |
Given that it sounds like metric schema changes are painful, I'll throw it in just in case we need it! :) |
I wouldn't call them painful, so much as impossible :) I hope to change that in the coming months. |
Currently, instance states are communicated by the sled-agents of the
sleds on which an instance is running calling into a Nexus HTTP API.
This is a "push"-shaped communication pattern, where the monitored
resources (the sled-agents) act as clients of the monitor (Nexus) to
publish their state to it. While push-shaped communications are
generally more efficient than pull-shaped communications (as they only
require communication when a monitored resource actively changes state),
they have a substantial disadvantage: if the sled-agent is gone, stuck,
or otherwise failed, Nexus will never become aware of this, and will
continue happily chugging along assuming the instances are in the last
state reported by their sled-agents.
In order to allow Nexus to determine when a sled-agent or VMM has
failed, this branch introduces a background task ("instance_watcher")
that periodically attempts to pull the states of a sled-agent's managed
instances. This way, if calls to sled agents fail, the Nexus process is
made aware that a sled-agent it expected to publish the state of its
monitored instances may no longer be doing so, and we can (eventually)
take corrective action.
Presently, the
instance_watcher
task does not perform correctiveaction, such as restarting failed instances or advancing instances to
the "failed" state in CRDB, as affirmatively detecting failures will
require other work. For example, if a sled-agent is unreachable when
checking on a particular instance, the instance itself might be fine,
and we shouldn't decide that it's gone unless we also can't reach the
propolis-server
process directly. Furthermore, there's the split-brainproblem to consider: the Nexus process performing the check might be on
one side of a network partition...and it might not be on the same side
of that partition as the majority of the rack, so we may not want to
consider an unreachable sled-agent "failed" unless all the Nexus
replicas can't talk to it. And, there are probably even more
considerations I haven't thought of yet. I'm planning on starting an RFD
to propose a comprehensive design for instance health checking.
In the meantime, though, the
instance_watcher
task just emits a bunchof Oximeter metrics describing the results of instance checks. These
metrics capture:
that something is wrong with the instance or its sled-agent or sled
error, such as a client-side request error or an error updating the
instance's database records
occurred as a result of the check:
instances
table was updated as aresult of the state returned by sled-agent
vmms
table was updated as a result ofthe state returned by sled-agent