-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP server hangs. #35
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I can't help but feel like the troubleshooting for this (as described) is incomplete and that we're possibly addressing a portion of the problem, or maybe even a symptom of the problem and in the process changing slightly the semantics of zmcontrol (which may or may not be OK). In theory, based on the original problem description, a customer on vsphere using libexec/vmware-heartbeat to monitor basic host health found their VMs all restarted due to the monitoring failing to handle a LDAP master being down. Looking at vmware-heartbeat we can see that a call to isZcsHealth() leads to a zmcontrol status call and unless zmcontrol exits with a zero (success) return code, isZcsHealthy will return false. So, the "fix" here is attempting to make zmcontrol more likely to return success - reasonable enough, but I want to talk through this... What we don't know for sure, is the reason that zmcontrol status did not return success. There are a few variables potentially at play here, see https://wiki.zimbra.com/wiki/VMware_HA_script_in_Zimbra_Collaboration for details. Consider this: If the vSphere HA failure interval was set to 30 secs, a slow zmcontrol status could cause a restart regardless of LDAP state. zmcontrol has it's own internal timer of 180 secs, in addition to that the Net::LDAP new call uses a timeout of 30 secs, so a slow LDAP server could also lead to problems. The proposed change here does a few things:
I think we need to address issue #1 above, is it OK to avoid the ldap_master_url list? Was this used on purpose to try and ensure we're dealing with the most up to date LDAP server (in case LDAP replication is slow?). Perhaps this is a moot point now as we tend towards multimaster replication and an expectation that the masters are very close to in sync and $ldap_master_url and $ldap_url are perhaps identical? One option would be to use the combination of the two (with any duplicates removed). Perhaps the most important change being proposed is #2 above, but we need to address #3 to do this fix properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a long comment with my concerns.
See https://metacpan.org/pod/distribution/perl-ldap/lib/Net/LDAP.pod#CONSTRUCTOR for description of HOST usage.
Hello @bsinger Can you please address the review comments? |
Copied from https://bugzilla.zimbra.com/show_bug.cgi?id=107769