-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleted instances remain as running in CRDB after propolis zone is already removed #3207
Comments
The instance state transitions above and in the logs look more or less correct to me, although I'm similarly at a loss as to why the It's currently expected that an The saga execution log for the saga that tried to re-create the instance, on the other hand, has this suspicious-looking entry:
This suggests the already-extant object is a disk named @askfongjojo What mechanism did you use to try to re-create the instance? (Nexus doesn't appear to log the bodies of incoming HTTP requests, and I can't quite work out what value the "disks" parameter to the create request had in this case.) The instance creation API allows callers to specify whether they want to create a new disk or merely attach an existing one (in the |
Yeah, this can be confusing as I gave the disk the same name as the instance. The script I used creates both disk and instance and I just reran it as-is, expecting the disk creation to fail while the instance creation to succeed. This is something I've done many times last week with a loop count of 3; the only difference this time is that I have 12:
The time_deleted is NULL for this instance so the saga has strangely declare done without updating the database, or actually updated the instance state back to
The real problem is that there is no way to stop the instance to make any changes. |
Ah, I realize my bug description was unclear about the issue. I've updated it to indicate the split-brain issue. |
Thanks for the extra details! I also realized just now that I missed the Nexus log lines that show that there was an instance-already-exists conflict when trying to re-create CRDB's apparent behavior here scares me. The instance record in your comment here has a If I'm understanding things right, that means we've got two separate cases where Nexus recorded that it did something (wrote 'stopping'/'stopped' states for the instance, executed an instance delete saga) that should have been visible in CRDB, and two separate lines of observation (manual queries of CRDB, Nexus thinking the instance isn't deleted) that indicate that those writes aren't visible to subsequent reads. I can't imagine anything is going to work if CRDB's not persisting things Nexus thinks it's persisting. If all this is correct, I think the next step here would be to see if the CRDB zone has logs or other context that indicate that it's having a problem. |
I have functionally zero CockroachDB expertise but do see the following unusual events in its logs from around the time the instance re-creation attempt failed:
To be clear, I have no idea what any of this really means or what was being waited on. But it looks like this is one of only two time ranges in the logs where "slow range RPC" errors occurred (the other was around 21:01:43Z), and the 320-second interval preceding these messages would have included the times where we tried to stop and delete these instances. |
Yes, the saga table recorded nothing for a whole 6-minute window when there were many orchestration activities:
The crdb smf service log showed that it started again at 22:35:41 though there wasn't anything that indicated why. The timestamps are very confusing though, as if it uses local TZ in one place and UTC timestamp in another.
There is some matching info from another log file:
|
I spent this afternoon looking at this without too much to show for it. I'll summarize what I did find below. Since I'm about out of time and I believe this system will be shutdown tonight for other reasons, I figured I'd collect what I could. Here are log files from Nexus and CockroachDB, plus a
These were collected on gc19 (BRM42220017) using:
and
I also tried to collect the whole Cockroach store directory. I can't tell if it worked:
I suspect we got the main logs but not the ones that were called out. Anyway, that's here now:
Finally, the shell history for the CockroachDB zone is there now too:
|
I confirmed these observations that @askfongjojo and @gjcolombo found:
I have also checked:
But there's also a bunch of weirdness here:
I also discovered that bb8 has an error_sink that we could be using to log errors. This might help understand how Nexus responded to the CockroachDB restart. |
Tried again with tar's |
On some level, there are basically three possibilities:
|
The recent discussions around async cancellation got me wondering if that behavior could cause something like we're seeing here. Diesel itself is sync, so if there's a cancellation problem on the database side of Nexus, it'd presumably have to be in async-bb8-diesel. I did find something a little fishy in
I didn't see anything in the code path that returns a PooledConnection to the pool that terminates any open transactions (or resets any other connection state, for that matter). At this point this is all a guess based on reading the code. I don't see what would prevent it from happening. Could it have actually happened here? To simplify things, what would have to happen is:
I don't see any reason yet why this couldn't have happened. The next question is whether this would leave any evidence that we could look for? I will go back through the data I collected above and see if any of it is inconsistent with this explanation. |
The only thing I've found above that might be inconsistent with this explanation was that I wrote:
but it's not impossible. I was investigating many hours after this had happened. We hadn't done #3232 yet so we'd have no log entry if connections had been closed and re-established for whatever reason. I also don't think we'd expect any log entries on cancellation, at least not in async-bb8-diesel nor dropshot. (We should add them in Dropshot at least, but after oxidecomputer/dropshot#701 this can't happen in Nexus anyway.) |
I have written a much simpler program using Dropshot, diesel, and async-bb8-diesel to reproduce the failure mode I described above. It's exactly as feared. It leads to behavior that looks a lot like what we saw here: the program behaves in every way as though the data was written to the database, but if you inspect the database, the writes never made it there. I've filed oxidecomputer/async-bb8-diesel#47 for this. Talking it over with @ahl we realized that there should be some evidence of this: there must have been some Dropshot request that was cancelled in |
I think this might have happened rather a lot of times. So many that I doubted my methodology, but I think this might actually make sense. I unpacked the Nexus log files we saved earlier and summarized the request ids from "authorize result" log messages:
Then I dumped request ids from "request completed" messages:
Then I used
That's way more than I expected. But these are mostly pretty long-running operations like creating and destroying Instances, Snapshots, etc. @ahl points out that there was a period where the CLI timeout was generally too short. It seems like that could cause this problem if any of those things were using |
Okay, there's one call to |
I can confirm that the CLI requests for create instance all hit the I also hit Ctrl-C at least twice to terminate some API calls. I don't know if this had the same end result as a request timing-out (that the async process already had a life of its own and wouldn't cleanly cancel). |
This happened when I used a script to invoke
oxide instance stop
for 10 instances successively in a loop, and then delete them with cli in another loop. The requests came back with success response (ignore the two destroyed instances - they will be covered in another ticket):However, these instances are now "shell" records in the instance table - the propolis zones are already gone from the sled:
I looked into one of the instances, sysbench-mysql-7 which is located on sled gc11. It was set to
stopped
at 22:40:03The zone was removed at 22:40:08 according to the sled-agent log for gc11:
but the delete saga was created 44 seconds later, at 22:40:52, presumably because nexus was busy taking deletion requests of other instances:
I didn't find any saga error in the log but the saga also doesn't exist in the database:
At 22:42:40, I tried to recreate a VM with the same name but that failed with an
object already exists
error. These VMs are now ghost records and cannot be removed because the VMs don't have a backend and can't be stopped/deleted.The complete nexus log is located in catacomb.eng.oxide.computer:/data/staff/dogfood/may-23/.
The text was updated successfully, but these errors were encountered: