Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a health check for connection pools, redis being connected #256

Open
TysonAndre opened this issue Mar 3, 2020 · 3 comments
Open

Add a health check for connection pools, redis being connected #256

TysonAndre opened this issue Mar 3, 2020 · 3 comments

Comments

@TysonAndre
Copy link
Contributor

http://uniqush.org/documentation/usage.html does not have a health check. The closest is /version or /psps, but those don't establish that the connections are active.

For example, in extraordinary networking outages (outgoing traffic is unacknowledged, and no error or response is received), the available file handles for a process can be used up, and calls to tls.Dial would fail for the APNs binary protocol.

[Push][Error] .... RequestID=... Service=... Subscriber=... PushServiceProvider=... DeliveryPoint=Unknown Failed: Failed to send payload to APNS: We are unable to get a connection: x509: failed to load system roots and no roots provided; open /etc/ssl/certs/ca-certificates.crt: too many open files

The affected code:

func (cm *connManagerImpl) NewConn() (net.Conn, <-chan bool, error) {
	if cm.err != nil {
		return nil, nil, fmt.Errorf("Error initializing apnsConnManager: %v", cm.err)
	}

	tlsconn, err := tls.Dial("tcp", cm.addr, cm.conf)
	if err != nil {
		if err.Error() == "EOF" {
			err = fmt.Errorf("Certificate is probably invalid/expired: %v", err)
		}
		return nil, nil, err
	}
	closed := make(chan bool, 1) // notifies worker when reader detects socket closed. Has a buffer so adding one element is non-blocking
	go resultCollector(cm.resultChan, tlsconn, closed)
	return tlsconn, closed, nil
}

Possible checks:

  1. Is it possible to retrieve an open connection (to GCM/APNs) for all pools of a PSP (e.g. everything listed in /psps) within a given timeout (e.g. 2 seconds)?
  2. Have pushes succeeded recently for a given set of pools (this may be suitable for some use cases but not others, but is still a useful warning - If the client deliberately pauses all pushes for any reason, the health check would fail)
  3. Can golang create new file handles (unix/linux). If not, the server should be marked as unhealthy and possibly in need of a restart.

It'd also be useful to know what leaked connections, if anything (apns, gcm, http clients to uniqush (probably not), or redis). Sadly, I didn't record this, but it may be possible to reproduce artificially (e.g. change all host names and dns servers to be networking black holes)

@TysonAndre
Copy link
Contributor Author

The error in question seems to be a startup error affecting the built-in crypto/x509 libraries, at least in go 1.13.7. The failure to load the system roots is cached permanently on unix/linux. This is not used on Windows.

During a period of networking outage, this could happen if the uniqush-push process was restarted (automatically?) when there were no available file descriptors - the lack of available file descriptors probably caused the singleton system certificate pool to fail to get loaded.

Checking for a failure to establish connections to APNS would catch that.

Alternatively, trying to verify all APNs client certificates in a health check might be more reliable

// src/crypto/x509/verify.go
	// Use Windows's own verification and chain building.
	if opts.Roots == nil && runtime.GOOS == "windows" {
		return c.systemVerify(&opts)
	}

	if opts.Roots == nil {
		opts.Roots = systemRootsPool()
		if opts.Roots == nil {
			return nil, SystemRootsError{systemRootsErr}
		}
	}
// src/crypto/x509/root.go 
var (
	once           sync.Once
	systemRoots    *CertPool
	systemRootsErr error
)

func systemRootsPool() *CertPool {
	once.Do(initSystemRoots)
	return systemRoots
}

func initSystemRoots() {
	systemRoots, systemRootsErr = loadSystemRoots()
	if systemRootsErr != nil {
		systemRoots = nil
	}
}
// src/crypto/x509/verify.g
func (se SystemRootsError) Error() string {
	msg := "x509: failed to load system roots and no roots provided"
	if se.Err != nil {
		return msg + "; " + se.Err.Error()
	}
	return msg
}

@TysonAndre
Copy link
Contributor Author

	manager.cert, manager.err = tls.LoadX509KeyPair(psp.FixedData["cert"], psp.FixedData["key"])
	if manager.err != nil {
		return manager
	}
	manager.conf = &tls.Config{
		Certificates:       []tls.Certificate{manager.cert},
		InsecureSkipVerify: false,
	}

https://golang.org/pkg/crypto/tls/#Certificate contains an https://golang.org/pkg/crypto/x509/#Certificate - https://golang.org/pkg/crypto/x509/#Certificate.Verify can be used to check if the certificate is still valid?

If opts.Roots is nil and system roots are unavailable the returned error will be of type SystemRootsError.

https://golang.org/pkg/crypto/x509/#InvalidReason has many possible reasons, such as expiry

@TysonAndre
Copy link
Contributor Author

A straightforward fix for this failure mode would be to refuse to start up on non-windows OSes if loadSystemRoots failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant