Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper client session is not reset when cluster loses session state (authentication failed: EOF) #36

Open
colinmcintosh opened this issue Jul 28, 2020 · 11 comments · May be fixed by #37
Labels
bug Something isn't working needs investigatation Need to dig into root cause

Comments

@colinmcintosh
Copy link

Hi, I noticed an issue where the go-zookeeper client will not reset the session if the Zookeeper cluster loses it's session state (instance reset, redeploy, etc) and the server responds with an EOF. When this happens the client enters an endless loop trying to authenticate with the previous Session ID. It looks like the loop happens here:

zk/conn.go

Line 435 in 50daf81

c.logger.Printf("authentication failed: %s", err)

I haven't had a chance to look through the code yet for this but I'm wondering if anyone knows what process needs to happen to reset the session in the client?

@colinmcintosh colinmcintosh linked a pull request Jul 29, 2020 that will close this issue
@nemith
Copy link

nemith commented Mar 26, 2021

Is this similer to what we see in #17?

@nemith nemith added bug Something isn't working needs investigatation Need to dig into root cause labels Mar 26, 2021
@mathispesch
Copy link

Hi there,
I'm able to reproduce this with https://github.com/samuel/go-zookeeper: I'm getting the same authentication failed: EOF error if all zookeeper nodes are unavailable at the same time and get replaced. If the issue is only due to the network, the connection recovers ok but if the zookeeper nodes get replaced (a restart is enough), the app needs to be restarted since it will try to reconnect with the stale session and fail to do so. I would say that this is the same behaviour as described in #17.

@mathispesch
Copy link

I tried to investigate this issue more but the problem here is that I don't know how to differentiate between an io.EOF error because there is no quorum and one because the whole cluster has been reset.
I tried to implement a fix based on #37, adding an additional session reset when getting a io.EOF error but this is a too broad clause and will cause the session to be reset even when there is no need to (e.g. if there is no quorum).

@nemith
Copy link

nemith commented Jun 22, 2021

Interesting. It's odd that the zookeeper server just end the connection if the session could not be found vs a more specific error (or perhaps we aren't parsing the error?)

Has anyone dug through the java client code to see if there is some special handling there?

@jpfourny
Copy link

jpfourny commented Feb 9, 2022

I created a patch to workaround this. It's based on a different fork, but it should be compatible.
Reset_session_state_when_entire_quorum_losses_our_session.patch.txt

TL;DR; The client keeps track of how long ago was the last successful authentication, and if we see an auth request fail with io.EOF AFTER the session timeout elapses, we have to assume the ZK quorum lost our session. To recover, just reset the clients session state and invalidate watches. This seems reasonable to me, and it appears to work as expected when I restart a ZK in a single-node case.

Shall I open a PR?

@inf-rno
Copy link

inf-rno commented Feb 9, 2022

This is the commit on the fork jpfourny@cd8998a
I think this is a reasonable patch and better than simply getting stalled.

@colinmcintosh
Copy link
Author

@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to invalidateWatches. If you think they are sufficiently similar enough I can update the existing PR with that call as well.

0607fee

@jpfourny
Copy link

jpfourny commented Feb 9, 2022

@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to invalidateWatches. If you think they are sufficiently similar enough I can update the existing PR with that call as well.

0607fee

That works. I do recommend you move c.invalidateWatches(err) into c.resetSession. I imagine watches would be no good if the session is reset, anyway.

I vote for getting this merged. If we lost quorum for longer than session timeout, then our session would have expired anyway. I feel it's safe for the client to abandon their session at this point, no matter what the state of the quorum actually is.

@colinmcintosh
Copy link
Author

I updated PR #37 to include the call to c.invalidateWatches in c.resetSession as @jpfourny suggested.

@LittleCuteBug
Copy link

Hi, Is there still ongoing work on this issue

@amukherj
Copy link

How can I recreate this scenario with a Zookeeper cluster (single- or 3-node) and a client that uses go-zookeeper? Any suggestions @jpfourny @colinmcintosh ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs investigatation Need to dig into root cause
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants