Enhancement request for Azure Fence Agent (kdump feature) #509

grantmarcroft · 2022-11-04T23:44:10Z

In the Azure cloud, a production system outage often creates the necessity for operations teams to contact the cluster software vendor to obtain a root cause analysis of the unexpected reboot. stonith:external/sbd has a unique "crash" feature to kdump an unhealthy node, making a deeper failure analysis possible. With Azure Fence Agent, there is no such feature.

What makes this feature possible from the platform perspective is the ability to trigger an NMI, which covers the case of the OS being too unresponsive to handle magic sysrq.

https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/serial-console-nmi-sysrq#non-maskable-interrupt-nmi

Whether the kdump succeeds or fails due to a hypervisor issue, fence_azure_arm should eventually deallocate the node as per its current function.

As of the time of this writing, some Azure users use a fence topology to first attempt stonith:external/sbd fencing with "crash" and then stonith:fence_azure_arm as a backup fence mechanism in the case SBD (or kdump, itself) fails.

The additional cost of SBD storage device(s) on the platform could be eliminated with this feature.

Before this is recommended by someone else:
The stonith:fence_kdump doesn't kdump a node. It is a reactive fence agent used to pause STONITH long enough to collect a kdump before the "shutoff switch is flipped" in the event of a kernel panic.

oalbrigt · 2022-11-07T09:32:07Z

Do you have a link to the sbd agent? I think it's Suse specific, so I dont know where to find it's source.

grantmarcroft · 2022-11-07T19:51:59Z

Hello Oyvind. Here it is: https://github.com/ClusterLabs/fence-agents/blob/main/agents/sbd/fence_sbd.py SBD source here: https://github.com/ClusterLabs/sbd And manual page describing crash functionality here: https://github.com/ClusterLabs/sbd/blob/main/man/sbd.8.pod.in - Grant

…

On Mon, Nov 07, 2022 at 01:32:18AM -0800, Oyvind Albrigtsen wrote: Do you have a link to the sbd agent? I think it's Suse specific, so I dont know where to find it's source. -- Reply to this email directly or view it on GitHub: #509 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

wenningerk · 2022-11-08T10:24:47Z

hmm ... I know working on the sbd-setup wasn't actually what you were asking for but
maybe discussion helps getting ahead somehow.
Do you have details of this sbd topology setup mentioned?
Having sbd on one level and simply fence_azure_arm on the next sounds a bit
dangerous to me ... especially as you are mentioning backup mechanism.
The fence agent can just verify if writing the poison-pill to the device went ok.
The fence-target has to assure by itself that it is either able to read the poision-pill within
a timeout or suicide reliably if it can't.
What I could imagine would be poison-pill & fence_kdump on one level and
fence_azure_arm as backup. That should reliably check if crashing the node had
worked - even without a watchdog-device that is considered as reliable enough for
sbd. (Are we talking of azure-bare-metal with a supported hardware-watchdog
or some setup with softdog that might not be supported together with sbd depending
on the distro?)
Alternatively to using poison pill I could imagine an sbd-configuration without
disks but without telling pacemaker that sbd is there (stonith-watchdog-timeout = 0
or banning fence_watchdog from all nodes for newer pacemaker that supports
making the hidden fence_watchdog - that always had been there with
watchdog-fencing - visible as an explicit fencing-resource).
A topology of fence_kdump on one level and fence_azure_arm on the next
should then give the target-node enough time to suicide with a kdump + verify
that this worked or fail if it didn't and fall through to azure-fencing.
Haven't tried either of those - just ideas ...

oalbrigt self-assigned this Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement request for Azure Fence Agent (kdump feature) #509

Enhancement request for Azure Fence Agent (kdump feature) #509

grantmarcroft commented Nov 4, 2022

oalbrigt commented Nov 7, 2022

grantmarcroft commented Nov 7, 2022 via email

wenningerk commented Nov 8, 2022

Enhancement request for Azure Fence Agent (kdump feature) #509

Enhancement request for Azure Fence Agent (kdump feature) #509

Comments

grantmarcroft commented Nov 4, 2022

oalbrigt commented Nov 7, 2022

grantmarcroft commented Nov 7, 2022 via email

wenningerk commented Nov 8, 2022