-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763
Comments
I think ignoring the TTL makes sense (not that it's right, just that it makes sense) in the context of this code: Lines 167 to 195 in a29f08b
Our API to interface with CockroachDB from Nexus involves "constructing a URL", and right now, we embed a single point-in-time address of CockroachDB into this URL. We should probably avoid doing this, and embed the service name into the URL, so it actually use DNS during later lookups. |
Understood. The current behavior, in conjunction with #3613, results in an 1 in 15 probability that a user will not be able to use the system at all. I happened to be in this situation when my workstation always used the one nexus that had the first CRDB backend I brought down to test failover. :( |
In this case, there's no TTL: a specific pantry from the list of available ones is selected once and used for a disk until that disk is finalized. One thing that could be done is to check of the selected pantry is still responsive and choose another one if not, but it was important not keep as much code out of the import chunk hot-path as possible, as any checks there will be multiplied by the number of chunks to import and slow down imports. That being said, slow imports are better than non-working imports :) I'll give this some thought. |
Sorry for not being clear in pantry's case. I understand the need to stay with the same pantry for the same disk snapshot. But the issue I ran into is with new/separate snapshot requests still holding on to the unresponsive one. |
#3783) First pass at #3763 for crdb. Even though we did query internal DNS, we were previously using only a single host as part of connecting to crdb from Nexus. And since the internal DNS server always returns records in the same order, that meant every Nexus instance was always using the same CockroachDB instance even now that we've been provisioning multiple. This also meant if that CRDB instance went down we'd be hosed (as seen in #3763). To help with that, this PR changes Nexus to use all the CRDB hosts reported via Internal DNS when creating the connection URL. There are some comments in the code, but this still not quite as robust as we could be, but short of something cueball-like it's still an improvement. To test I disabled the initial crdb nexus connected to and it was able to recover by connecting to the next crdb instance and continue serving requests. From the log we can see a successful query, connection errors once i disabled `fd00:1122:3344:101::5`, and then a successful query with connection reestablished to next crdb instance (`fd00:1122:3344:101::3`): ``` 23:43:24.729Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): authorize result action = Query actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. }) resource = Database result = Ok(()) 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.730Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.730Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:30.803Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): roles roles = RoleSet { roles: {} } 23:43:30.804Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): authorize result action = Query actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. }) resource = Database result = Ok(()) ```
#3783 partially addresses this for crdb. We now, grab all the cockroachdb hosts via internal dns at nexus startup and add them all to the connection string. Whenever a new connection is established it'll try the listed hosts in order and use the first one that successfully connects. While that's an improvement in that Nexus won't fail to serve requests if one crdb instance goes down, there are still some issues:
|
I'm poking at this issue again, now that we're looking at possibly expunging zones which are running CRDB nodes. Where the System Currently StandsCockroachDB nodes, when initializing, access internal DNS to find the IP addresses of other nodes within the cluster: omicron/smf/cockroachdb/method_script.sh Lines 13 to 19 in cfa6bd9
Nexus, when creating a connection to a pool, performs a "one-time DNS lookup" here: Lines 189 to 224 in d80cd29
So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":
What do we want to doI've tried to dig into the CockroachDB and Postgres docs to explore our options in this area when constructing the OptionsSpecify a host address in the form of a hostname, use DNS to resolveIt should hopefully be possible to provide a hostname to the As far as I can tell -- feedback welcome if folks see alternate pathways -- the mechanism to point For nexus, this would mean:
Subsequent experimentation is necessary to determine how failures are propagated back to Nexus when some of these nodes die, and to identify if new nodes are actually made accessible. Specify multiple hosts during nexus-side construction of
|
Here's the result of some of my experimentation. I'd really rather have a local test here - that's what I was trying to build out in #5628 - but it's quite difficult to create all this in an isolated environment, since changing the DNS server used by postgres relies on changing "system-wide" config in SetupI spun up a DNS server within Omicron, sitting on port 53: sudo ./target/debug/dns-server --http-address [::1]:0 --dns-address [::1]:53 --config-file $PWD/dns-server/examples/config.toml I then spun up three CockroachDB nodes talking to each other within a cluster. These are all on localhost, on ports 7709, 7710, and 7711. cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb1 --listen-addr [::1]:7709 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb2 --listen-addr [::1]:7710 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb3 --listen-addr [::1]:7711 --http-addr :0
cockroach init --insecure --host [::]:7709 Then I populated my internal DNS server with these records: # SRV records
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test
# AAAA records
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test ac33791c-62c6-43b0-bcbd-b15e7727b533.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test c6cda479-5fde-49a0-a079-7c960022baff.host ::1 Next, I added my nameserver running on localhost to
To check that DNS is up and running, I used dig:
Which looks like I'd expect -- I'm seeing those 7709 - 7711 ports in the SRV records, and a bunch of references to I can use the Cockroach shell to connect directly to a node via IP:
However, using a hostname appears to be hitting issues:
|
Resolving using But you should be able to have the nodes discover each other via DNS without explicitly listing them out for |
Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right? Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB |
Correct
…On Mon, Apr 29, 2024, 3:03 PM Sean Klein ***@***.***> wrote:
Resolving using SRV records doesn't work with the cli (
cockroachdb/cockroach#64439
<cockroachdb/cockroach#64439>).
But you should be able to have the nodes discover each other via DNS
without explicitly listing them out for --join (though that's perhaps
behind another flag --experimental-srv-dns depending on the version?) But
afaik that's just limited to the initial bootstrapping and unsure how it
deals with the set changing at runtime.
Thanks for the pointer, I'll look into this flag! To be clear, that would
be for CockroachDB nodes to connect to each other using DNS, right?
Just being clear that it's distinct from any attempts by e.g. Nexus to use
a libpq client to connect to CockroachDB
—
Reply to this email directly, view it on GitHub
<#3763 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACGCVZN5SBTWMXN5CM5673Y727SRAVCNFSM6AAAAAA2XRPVFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBTG43DANJYGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Here's my follow-up -- I'm digging into It appears the I'm pretty sure I'm bottoming out here, because this matches the error I was seeing: Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records -- this appears to match my experiments locally, where I spun up a small C Program, and could read from the "node name" of Out of curiosity, I dug into This calls into https://docs.rs/tokio/latest/tokio/net/fn.lookup_host.html , which also appears (through local testing) to "work with AAAA records when you also supply the port, but not with SRV records". This pretty much matches the |
I'm kinda getting the sense we have two options for a path forward here:
There are some drawbacks to this approach that we would need to work through:
|
Right, last I looked it involved changes across multiple crates: We create an |
I'll see if I can modify Diesel and async-bb8-diesel to make that URL mutable. That seems like it would help us form a foundation to modify things from nexus, if we are unable to rely on libpq to act as a DNS client on our behalf. |
Okay, seems possible to let the database URL be mutable from Nexus. That would at least let Nexus update the "set of valid CRDB nodes" with the latest info it knows about. See: oxidecomputer/async-bb8-diesel#62 , diesel-rs/diesel#4005 |
Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing |
Totally, but this was my takeaway from reading libpq's source: someone on the client-side needs to make a decision to go query DNS, get a new set of nodes, and update the set of IPs we end up talking to. If libpq handled this for us, that would be nice, but if it doesn't, I think it ends up being functionally somewhat similar for Nexus to take this responsibility too. Arguably, I think Nexus could be more optimal here, since it can perform this action as a downstream operation of an RPW to avoid unnecessary queries to DNS until the set is known to have changed. |
This use case is really similar to the case of managing HTTP clients for our internal services: there's DNS resolution that ideally would be decoupled from connection establishment, and connection establishment that would be decoupled from usage. This is another case where we'll want a more sophisticated connection management component, which likely means building our own pool to better control the behavior here. We basically decided in the update call today to pursue this, so the rest of this might be moot. But in case it's useful, here's responding to a few things above.
This is a good summary. The CockroachDB part (in the start method) was explicitly designed this way in order to stay in sync with the latest cluster topology and I'd be more surprised if it didn't work! But we've known from the start that there was at least some work to be done to make the pool survive database failures better, and likely we were going to have to do our own pool.
For what it's worth, I've almost never seen non-application-specific components do SRV lookups.
I think we've basically solved this part already (see above). We considered
It makes sense to me that the DNS resolution and even TCP connection establishment behavior would ultimately be application-specific and would happen outside libpq. While it's no doubt convenient that clients like |
In my most recent testing, the behavior for CRDB connection failures has changed (or might be expected?). When the crdb node in use has gone down, I saw messages like this in nexus logs:
(Note: The crdb instance in use is not necessarily the first one listed; in my case, disabling the 105 instance is what triggered the error.) Nexus was able to continue serving both read and update API requests in spite of the above error. It's possible that any ongoing requests against the crdb node right at the moment it's going down could have failed but all requests I made afterwards succeeded. @davepacheco took a look at the code to understand what changed the behavior. Here are his comments:
|
Just to be explicit: #5876 will fix this issue for CockroachDB -- in that PR, we use https://github.com/oxidecomputer/qorb to access each CRDB node individually, and create a pool of connections for each one.
|
Replaces all usage of bb8 with a new connection pooling library called [qorb](https://github.com/oxidecomputer/qorb). qorb, detailed in RFD 477, provides the following benefits over bb8: - It allows lookup of multiple backends via DNS SRV records - It dynamically adjusts the number of connections to each bakend based on their health, and prioritizes vending out connections to healthy backends - It should be re-usable for both our database and progenitor clients (using a different "backend connector", but the same core library and DNS resolution mechanism). Fixes #4192 Part of #3763 (fixes CRDB portion)
The meat of this PR is the change in implementation of `get_pantry_address`: instead of asking our internal DNS resolver to look up a crucible pantry (which does not randomize, so in practice we always get whichever pantry the DNS server listed first), we ask a Qorb connection pool for the address of a healthy client. `get_pantry_address` itself does not use the client directly and only cares about its address, but the pool does keep a client around so that it can call `pantry_status()` as a health check. (It doesn't look at the contents of the result; only whether or not the request succeeded - @jmpesp if that should be more refined, please say so.) This partially addresses #3763; once this lands, if a pantry is down or unhealthy but still present in DNS (i.e., not expunged), Qorb + the status health checks should mean we'll pick a different pantry for new operations, instead of the current behavior of always sticking to the first pantry in DNS. --------- Co-authored-by: Sean Klein <[email protected]>
…es (#6836) This is a much smaller change than the diff stat implies; most of the changes are expectorate outputs because the example system we set up for tests now includes Crucible pantry zones, which shifted a bunch of other zone UUIDs. Fully supporting Crucible pantry replacement depends on #3763, which I'm continuing to work on. But the reconfigurator side of "start new pantries" is about as trivial as things go and does not depend on #3763, hence this PR.
I believe this is now fixed for the pantry too:
Given CRDB and the pantry are both addressed, I'm going to close this. If we run into stale handles to services now that the qorb integration is done, we should file new issues. |
While testing service failover on rack2, I noticed that nexus held on to the same cockroachdb backend without attempting to use any of the other nodes in the 5-node database cluster, causing requests to fail until the one it favored came back.
The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.
The text was updated successfully, but these errors were encountered: