ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP server hangs. #35

synacor-bsinger · 2018-07-18T06:39:33Z

Copied from https://bugzilla.zimbra.com/show_bug.cgi?id=107769

When the LDAP master server or LDAP replica server is hung up (not a crash), current cluster monitoring which refers that LDAP server continues to connects to hung LDAP server.

This reason is, LDAP access is established in OS tcp stack even if slapd is haung state. Then this connection is waited until ldap_read_timeout and happened all virtual machines which refers hung LDAP server were caused reset during the system test.

The following code to zmcontrol resolves this issue. This code is provided from our partner who deployed the large ISP system just started the production service. So please review and consider for the later release.

To reproduce this issue, just send SIGSTOP to slapd process which referred from MBS servers. This situation is also possible when some kind of network issues happen.

…r hangs.

CLAassistant · 2018-07-18T06:39:40Z

All committers have signed the CLA.

gordyt

LGTM

fatkudu

LGTM

plobbes · 2018-07-19T17:05:26Z

I can't help but feel like the troubleshooting for this (as described) is incomplete and that we're possibly addressing a portion of the problem, or maybe even a symptom of the problem and in the process changing slightly the semantics of zmcontrol (which may or may not be OK). In theory, based on the original problem description, a customer on vsphere using libexec/vmware-heartbeat to monitor basic host health found their VMs all restarted due to the monitoring failing to handle a LDAP master being down.

Looking at vmware-heartbeat we can see that a call to isZcsHealth() leads to a zmcontrol status call and unless zmcontrol exits with a zero (success) return code, isZcsHealthy will return false. So, the "fix" here is attempting to make zmcontrol more likely to return success - reasonable enough, but I want to talk through this...

What we don't know for sure, is the reason that zmcontrol status did not return success. There are a few variables potentially at play here, see https://wiki.zimbra.com/wiki/VMware_HA_script_in_Zimbra_Collaboration for details.

Consider this: If the vSphere HA failure interval was set to 30 secs, a slow zmcontrol status could cause a restart regardless of LDAP state. zmcontrol has it's own internal timer of 180 secs, in addition to that the Net::LDAP new call uses a timeout of 30 secs, so a slow LDAP server could also lead to problems.

The proposed change here does a few things:

Use $ldap_url instead of $ldap_master_url to get a list of LDAP servers to talk to
Lowers the timeout from 30 secs to 5 secs
Fails to take advantage of the built in support for multiple LDAP hosts within Net::LDAP

I think we need to address issue #1 above, is it OK to avoid the ldap_master_url list? Was this used on purpose to try and ensure we're dealing with the most up to date LDAP server (in case LDAP replication is slow?). Perhaps this is a moot point now as we tend towards multimaster replication and an expectation that the masters are very close to in sync and $ldap_master_url and $ldap_url are perhaps identical? One option would be to use the combination of the two (with any duplicates removed).

Perhaps the most important change being proposed is #2 above, but we need to address #3 to do this fix properly.

plobbes

Left a long comment with my concerns.

See https://metacpan.org/pod/distribution/perl-ldap/lib/Net/LDAP.pod#CONSTRUCTOR for description of HOST usage.

Prashantsurana · 2019-03-12T18:58:12Z

Hello @bsinger Can you please address the review comments?

ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP serve…

84e0a3b

…r hangs.

Prashantsurana requested review from plobbes, gordyt, tmclane, fatkudu, Prashantsurana and rupalid July 19, 2018 14:37

gordyt approved these changes Jul 19, 2018

View reviewed changes

fatkudu approved these changes Jul 19, 2018

View reviewed changes

plobbes suggested changes Jul 19, 2018

View reviewed changes

k-kato self-requested a review October 23, 2018 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP server hangs. #35

ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP server hangs. #35

synacor-bsinger commented Jul 18, 2018

CLAassistant commented Jul 18, 2018 •

edited

Loading

gordyt left a comment

fatkudu left a comment

plobbes commented Jul 19, 2018

plobbes left a comment

Prashantsurana commented Mar 12, 2019

ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP server hangs. #35

Are you sure you want to change the base?

ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP server hangs. #35

Conversation

synacor-bsinger commented Jul 18, 2018

CLAassistant commented Jul 18, 2018 • edited Loading

gordyt left a comment

Choose a reason for hiding this comment

fatkudu left a comment

Choose a reason for hiding this comment

plobbes commented Jul 19, 2018

plobbes left a comment

Choose a reason for hiding this comment

Prashantsurana commented Mar 12, 2019

CLAassistant commented Jul 18, 2018 •

edited

Loading