Zookeeper client session is not reset when cluster loses session state (authentication failed: EOF) #36

colinmcintosh · 2020-07-28T20:37:39Z

Hi, I noticed an issue where the go-zookeeper client will not reset the session if the Zookeeper cluster loses it's session state (instance reset, redeploy, etc) and the server responds with an EOF. When this happens the client enters an endless loop trying to authenticate with the previous Session ID. It looks like the loop happens here:

zk/conn.go

Line 435 in 50daf81

c.logger.Printf("authentication failed: %s", err)

I haven't had a chance to look through the code yet for this but I'm wondering if anyone knows what process needs to happen to reset the session in the client?

nemith · 2021-03-26T16:46:12Z

Is this similer to what we see in #17?

mathispesch · 2021-06-21T14:56:17Z

Hi there,
I'm able to reproduce this with https://github.com/samuel/go-zookeeper: I'm getting the same authentication failed: EOF error if all zookeeper nodes are unavailable at the same time and get replaced. If the issue is only due to the network, the connection recovers ok but if the zookeeper nodes get replaced (a restart is enough), the app needs to be restarted since it will try to reconnect with the stale session and fail to do so. I would say that this is the same behaviour as described in #17.

mathispesch · 2021-06-22T13:57:48Z

I tried to investigate this issue more but the problem here is that I don't know how to differentiate between an io.EOF error because there is no quorum and one because the whole cluster has been reset.
I tried to implement a fix based on #37, adding an additional session reset when getting a io.EOF error but this is a too broad clause and will cause the session to be reset even when there is no need to (e.g. if there is no quorum).

nemith · 2021-06-22T14:16:59Z

Interesting. It's odd that the zookeeper server just end the connection if the session could not be found vs a more specific error (or perhaps we aren't parsing the error?)

Has anyone dug through the java client code to see if there is some special handling there?

jpfourny · 2022-02-09T16:29:22Z

I created a patch to workaround this. It's based on a different fork, but it should be compatible.
Reset_session_state_when_entire_quorum_losses_our_session.patch.txt

TL;DR; The client keeps track of how long ago was the last successful authentication, and if we see an auth request fail with io.EOF AFTER the session timeout elapses, we have to assume the ZK quorum lost our session. To recover, just reset the clients session state and invalidate watches. This seems reasonable to me, and it appears to work as expected when I restart a ZK in a single-node case.

Shall I open a PR?

inf-rno · 2022-02-09T16:46:54Z

This is the commit on the fork jpfourny@cd8998a
I think this is a reasonable patch and better than simply getting stalled.

colinmcintosh · 2022-02-09T17:26:27Z

@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to invalidateWatches. If you think they are sufficiently similar enough I can update the existing PR with that call as well.

0607fee

jpfourny · 2022-02-09T18:11:03Z

@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to invalidateWatches. If you think they are sufficiently similar enough I can update the existing PR with that call as well.

0607fee

That works. I do recommend you move c.invalidateWatches(err) into c.resetSession. I imagine watches would be no good if the session is reset, anyway.

I vote for getting this merged. If we lost quorum for longer than session timeout, then our session would have expired anyway. I feel it's safe for the client to abandon their session at this point, no matter what the state of the quorum actually is.

colinmcintosh · 2022-02-10T04:04:40Z

I updated PR #37 to include the call to c.invalidateWatches in c.resetSession as @jpfourny suggested.

LittleCuteBug · 2023-06-02T02:46:22Z

Hi, Is there still ongoing work on this issue

Signed-off-by: zwtop <[email protected]>

amukherj · 2024-03-25T14:36:03Z

How can I recreate this scenario with a Zookeeper cluster (single- or 3-node) and a client that uses go-zookeeper? Any suggestions @jpfourny @colinmcintosh ?

colinmcintosh linked a pull request Jul 29, 2020 that will close this issue

Fix for stale session issue #37

Open

nemith added bug Something isn't working needs investigatation Need to dig into root cause labels Mar 26, 2021

zwtop pushed a commit to everoute/zk that referenced this issue Jun 5, 2023

fix(go-zookeeper#36): reset session after client disconnected

5bb4629

Signed-off-by: zwtop <[email protected]>

This was referenced Jun 5, 2023

fix(go-zookeeper/zk#36): reset session after client disconnected #108

Closed

fix(go-zookeeper/zk#36): reset session after client disconnected everoute/zk#1

Merged

zwtop pushed a commit to everoute/zk that referenced this issue Jun 6, 2023

fix(go-zookeeper#36): reset session after client disconnected

da86d29

Signed-off-by: zwtop <[email protected]>

rdevmore mentioned this issue Nov 15, 2023

Fixes for EOF connection error ContinuumLLC/go-zookeeper#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zookeeper client session is not reset when cluster loses session state (authentication failed: EOF) #36

Zookeeper client session is not reset when cluster loses session state (authentication failed: EOF) #36

colinmcintosh commented Jul 28, 2020

nemith commented Mar 26, 2021

mathispesch commented Jun 21, 2021

mathispesch commented Jun 22, 2021

nemith commented Jun 22, 2021

jpfourny commented Feb 9, 2022 •

edited

Loading

inf-rno commented Feb 9, 2022

colinmcintosh commented Feb 9, 2022

jpfourny commented Feb 9, 2022 •

edited

Loading

colinmcintosh commented Feb 10, 2022

LittleCuteBug commented Jun 2, 2023

amukherj commented Mar 25, 2024

Zookeeper client session is not reset when cluster loses session state (authentication failed: EOF) #36

Zookeeper client session is not reset when cluster loses session state (authentication failed: EOF) #36

Comments

colinmcintosh commented Jul 28, 2020

nemith commented Mar 26, 2021

mathispesch commented Jun 21, 2021

mathispesch commented Jun 22, 2021

nemith commented Jun 22, 2021

jpfourny commented Feb 9, 2022 • edited Loading

inf-rno commented Feb 9, 2022

colinmcintosh commented Feb 9, 2022

jpfourny commented Feb 9, 2022 • edited Loading

colinmcintosh commented Feb 10, 2022

LittleCuteBug commented Jun 2, 2023

amukherj commented Mar 25, 2024

jpfourny commented Feb 9, 2022 •

edited

Loading

jpfourny commented Feb 9, 2022 •

edited

Loading