Quick recovery after cluster member failure, how? #272

webdock-io · 2024-03-23T20:29:49Z

Testing Microcloud further here and I wanted to simulate a cluster member catastrophic failure. Easy enough to do in my sandbox setup where I have microcloud set up on 3vms on the same physical host.

I simply ran "lxc stop --force" on a vm which had a running lxd container on it, in order to simulate a crash.

I then - maybe naively - assumed that the amazing thing with clustered lxd and ceph would be that I should be able to just spin up the container immediately on another cluster member. Right? However, I had a hard time finding information regarding recommended steps online, so I just tried the following:

# lxc exec lxdvm1 bash
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
|   NAME   | STATE | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR |      |      | CONTAINER | 0         | lxdvm2   |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1
Error: Get "https://10.1.255.88:8443/1.0/instances/lxdtest1": Unable to connect to: 10.1.255.88:8443 ([dial tcp 10.1.255.88:8443: connect: no route to host])
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
|   NAME   | STATE | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR |      |      | CONTAINER | 0         | lxdvm2   |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc move lxdtest1 --target lxdvm1
root@lxdvm1:~# lxc ls
+----------+---------+------+------+-----------+-----------+----------+
|   NAME   |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+---------+------+------+-----------+-----------+----------+
| lxdtest1 | STOPPED |      |      | CONTAINER | 1         | lxdvm1   |
+----------+---------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1


Error: User signaled us three times, exiting. The remote operation will keep running
Try `lxc info --show-log lxdtest1` for more info

As you can see the lxc start just hangs. I tried a few times but it just sits there. lxc info --show-log reveals nothing useful.

Is this not how it's supposed to work? Surely quickly being able to recover from a node going down is one of the core points of all this clustering/ceph goodness, or am I just thinking about this wrong? :)

Thank you for any insights you can provide here

The text was updated successfully, but these errors were encountered:

webdock-io · 2024-03-24T08:41:02Z

After some more testing today, on a second test run this just worked. I found an old forum post from Stephane Graber that this is indeed the way to do it. I am happy this works, but I am at a loss why it didn't work the first time.

Yesterday I tried running lxc monitor to see what was happening, and the start operation was processed but left in a "pending" state it seemed. What lxd was waiting for and why, I can't tell. I suspect it was my environment and related to networking as I've been having some problems with that on the lxdvm1 instance - so this is probably just my messy test environment to blame here.

Anyway, this works - for now - I'll update here if I hit this particular issue again as I'll be doing a lot of testing in the coming week or so for various scenarios. If I see noting further I'll make sure to close this.

roosterfish added the Incomplete Waiting on more information from reporter label Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick recovery after cluster member failure, how? #272

Quick recovery after cluster member failure, how? #272

webdock-io commented Mar 23, 2024

webdock-io commented Mar 24, 2024

Quick recovery after cluster member failure, how? #272

Quick recovery after cluster member failure, how? #272

Comments

webdock-io commented Mar 23, 2024

webdock-io commented Mar 24, 2024