-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement request for Azure Fence Agent (kdump feature) #509
Comments
Do you have a link to the sbd agent? I think it's Suse specific, so I dont know where to find it's source. |
Hello Oyvind.
Here it is:
https://github.com/ClusterLabs/fence-agents/blob/main/agents/sbd/fence_sbd.py
SBD source here:
https://github.com/ClusterLabs/sbd
And manual page describing crash functionality here:
https://github.com/ClusterLabs/sbd/blob/main/man/sbd.8.pod.in
- Grant
…On Mon, Nov 07, 2022 at 01:32:18AM -0800, Oyvind Albrigtsen wrote:
Do you have a link to the sbd agent? I think it's Suse specific, so I dont know where to find it's source.
--
Reply to this email directly or view it on GitHub:
#509 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
hmm ... I know working on the sbd-setup wasn't actually what you were asking for but |
In the Azure cloud, a production system outage often creates the necessity for operations teams to contact the cluster software vendor to obtain a root cause analysis of the unexpected reboot. stonith:external/sbd has a unique "crash" feature to kdump an unhealthy node, making a deeper failure analysis possible. With Azure Fence Agent, there is no such feature.
What makes this feature possible from the platform perspective is the ability to trigger an NMI, which covers the case of the OS being too unresponsive to handle magic sysrq.
https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/serial-console-nmi-sysrq#non-maskable-interrupt-nmi
Whether the kdump succeeds or fails due to a hypervisor issue, fence_azure_arm should eventually deallocate the node as per its current function.
As of the time of this writing, some Azure users use a fence topology to first attempt stonith:external/sbd fencing with "crash" and then stonith:fence_azure_arm as a backup fence mechanism in the case SBD (or kdump, itself) fails.
The additional cost of SBD storage device(s) on the platform could be eliminated with this feature.
Before this is recommended by someone else:
The stonith:fence_kdump doesn't kdump a node. It is a reactive fence agent used to pause STONITH long enough to collect a kdump before the "shutoff switch is flipped" in the event of a kernel panic.
The text was updated successfully, but these errors were encountered: