The driver does not deal with changing redis IP addresses in cluster mode #183

akshaymankar · 2022-05-30T15:13:10Z

Hedis loses connection to a redis cluster, if the redis cluster nodes change their IP addresses during a restart.

How to reproduce the problem

Start haskell app which talks to redis in cluster mode
Restart each redis node in a way that IP address of each node changes. This happens when using bitnami redis-cluster helm chart.
See that the app cannot talk to redis anymore

While the app cannot talk to redis, we get these kinds of error messages:

service: Network.Socket.connect: <socket: 51>: does not exist (No route to host)

Impact

This makes using hedis in Kubernetes very difficult.

Speculation

This could be a result of using socketToHandle here. The docs for the function say:

Caveat Handle is not recommended for network programming in Haskell, e.g. merely performing hClose on a TCP socket won't cooperate with peer's gracefulClose, i.e. proper shutdown sequence with appropriate handshakes specified by the protocol.

It could also be some other problem, this is just speculation.

Additional Info

I create ConnectInfo like this:

Redis.defaultConnectInfo
  { Redis.connectHost = "<kubernetes-service-name>"
    Redis.connectPort = 6379,
    Redis.connectTimeout = Just (secondsToNominalDiffTime 5),
    Redis.connectMaxConnections = 100
  }

And connect to redis like this:

Redis.connectCluster connInfo

This connection does not use TLS.

The text was updated successfully, but these errors were encountered:

k-bx · 2022-05-30T16:14:43Z

Please see #184

flokli · 2022-05-30T16:28:48Z

@michaelglass, #168 suggests you're using cluster support quite extensively. Is this something you could take a look into?

aravindgopall · 2023-01-06T07:55:17Z

Need to run refreshShardMap which would get the new cluster info (by invoking cluster info call to redis). But in your case how will this query even run, you would need at least one node's static IP to run this query and get info which can be used to connect to all the nodes with new IPs.

PS: the current connections being held in pool might be still used and removed after only exceptions are thrown.

michaelglass · 2023-01-06T09:09:47Z

sorry about the late response. Maybe @stoeffel or @omnibs might have more context. I don't write much Haskell anymore 😔

aravindgopall · 2023-01-13T06:14:04Z

For anyone else looking here, we have fork in juspay (https://github.com/juspay/hedis) with some of these fixes done already. Would like to upstream them in sometime.

ysangkok · 2023-01-13T14:56:31Z

@aravindgopall Glad to see you forked. Is it the fix/connection branch people are supposed to use? Why is it 104 commits behind? Would you be interested in taking over hedis maintainership?

aravindgopall · 2023-01-14T15:29:56Z

Right now it is fix/connection branch, we are testing the changes in our setup. Once done will merge everything to cluster branch followed by master.

Why is it 104 commits behind?
The cluster branch have all of the changes before being merge to upstream hedis (done by @alexjg), there is a PR waiting to make it up to date with upstream. This all might take a week or two to settle.

@ysangkok Would be very much interested on this.

aravindgopall · 2023-01-18T07:04:43Z

JFI. fix/connection branch is up to date with upstream master and being tested. Once done will make merge to master and release.

aravindgopall · 2023-01-23T15:00:36Z

JFI, cluster branch have the latest code with the above fix and also some other fixes in Juspay Fork.

@ysangkok how do you want me to continue, shall i create a PR to upstream or we can continue on juspay fork.

jbrechtel · 2023-01-25T14:25:57Z

JFI, cluster branch have the latest code with the above fix and also some other fixes in Juspay Fork.

@ysangkok how do you want me to continue, shall i create a PR to upstream or we can continue on juspay fork.

I would suggest creating a PR to this repo and we can get it merged.

jbrechtel · 2023-01-25T14:26:44Z

Right now it is fix/connection branch, we are testing the changes in our setup. Once done will merge everything to cluster branch followed by master.

Why is it 104 commits behind?
The cluster branch have all of the changes before being merge to upstream hedis (done by @alexjg), there is a PR waiting to make it up to date with upstream. This all might take a week or two to settle.

@ysangkok Would be very much interested on this.

I mistakenly offered to maintain this library some months ago and it turns that I simply don't have the time. Could you email me and we can discuss handing over maintainership?

Hi @aravindgopall - I mistakenly offered to maintain this library some months ago and it turns that I simply don't have the time. Could you email me and we can discuss handing over maintainership? james at flipstone . com

stefanwire · 2023-01-31T10:59:33Z

Need to run refreshShardMap which would get the new cluster info (by invoking cluster info call to redis). But in your case how will this query even run, you would need at least one node's static IP to run this query and get info which can be used to connect to all the nodes with new IPs.

PS: the current connections being held in pool might be still used and removed after only exceptions are thrown.

This issue won't be solved by refreshing the shard map from IP addresses existing in the map: after Helm updated the Redis cluster, all Redis nodes will have new IP addresses assigned. That is, no IP address stored in the available shard map would connect to a then running Redis node. The only way to reconnect the client is by resolving a new valid IP address via the Redis cluster's DNS name and then obtain an entirely new shard map.

For our use-case, we retain the DNS name outside Hedis. This is suboptimal, since the DNS name is passed to Hedis for establishing the initial connection and could be retained for failovers when all IPs have become invalid. For instance, due to Helm being a villain.

aravindgopall · 2023-01-31T11:47:18Z

@stefanwire This is a valid scenario. Right now it resolves this issue where at least one of the Redis node persisted the ip address. Generally helpful in more frequent scenarios like failovers, scaling etc..

stefanwire · 2023-01-31T12:53:18Z

Feel free to check and adapt the code linked in my previous comment to solve the issue where all IPs wouldn't connect anymore. Also, check my open PR which adds the CLUSTER INFO command in order to improve robustness for Redis in cluster mode. We would be happy to use Hedis without local wrappers when this is done and merged.

omnibs · 2024-03-11T23:46:48Z

I did a quick check on Juspay's branch with the following test scenario:

AWS Elasticache Redis cluster, 1 shard, 2 nodes
Launch program, issue some redis commands
Trigger "failover primary" on the EC UI
Try to issue any additional redis commands

It doesn't seem to work:

Without Juspay's branch: timeout executing redis commands, forever
With Juspay's branch: NoNodeException, forever

This means the shardMap MVar introduced in Juspay's branch somehow becomes empty during failover, and there's no recovery mechanism for when that happens.

aravindgopall · 2024-03-12T04:24:34Z

Hi @omnibs , can you give a try with Juspay master branch and share the results. In our internal test it recovers without NoNodeException. Please let us know if you need any more info.

omnibs · 2024-03-12T06:01:14Z

With the master branch it does recover from NoNodeException. Thank you for the pointer!

It seemed to take several minutes, though (can't re-test bc I ran out of test failovers for today).

I don't follow in the code what's making it recover, but I'm curious if there's something we can tweak to make it recover faster.

aravindgopall · 2024-03-12T07:29:56Z

it's retry mechanism based on MOVED/timeout (this can be configured to recover faster), when it occurs it will get the new shardMap info by picking a random node (again chances this might be down too can cause the time). Post this it will send the command to the new node which is responsible.

Would be great to see some numbers, graphs to understand this. (eg: re sharding time, throughput, how many errors etc..)

omnibs · 2024-03-13T22:36:45Z

Thank you for explaining!

So I re-ran my test in a simple manner:

I left curl on a loop, poking the health-check endpoint of our app in an environment every 2s
- The health-check endpoint runs a PING through hedis, nothing more.
I triggered a failover
I waited for things to normalize

It took a pretty long time:

15:12:24 - good
15:12:26 - Connection Lost (failover happened)
15:12:29 - NoNodeException
...
15:21:04 - good (recovery)

So it took 8min30s to recover from NoNodeException.

The failover itself is quick. I didn't time it, but restarting another instance of the app brought it back to life immediately, while the instance I was curling kept at NoNodeException for several minutes after that.

it's retry mechanism based on MOVED/timeout (this can be configured to recover faster), when it occurs it will get the new shardMap info by picking a random node

Do my results make sense with this mechanism?

It sounds like after the failover my shardMap was suddenly empty, and kept empty for a long time, so it couldn't have tried to pick a random node. But if this is correct, I don't understand what makes it recover after the 8min30s.

I plan to debug the mechanism further to figure out what's up, but any pointers you might have on what's happening would be appreciated!

ps: just to clarify, there's no resharding time, the failover works over a single shard

omnibs · 2024-03-14T00:19:06Z

Ok, my test was too simplistic, and for Redis Cluster's intended usage, perhaps unrealistic.

Redis outside Elasticache requires at least 3 primaries for cluster mode. Elasticache allows you to set it up with a single primary.

I tested a local redis cluster with 3 primaries, ran a failover with debug segfault and with manual failover, and I didn't even notice the failover. No errors.

I tested failing over an Elasticache cluster with 2 primaries, and all went well. We got a single NoNodeException, which turned into a timeout after 500ms, and fully recovered after 12s, which is very reasonable.

@aravindgopall one last question then, does Juspay plan to upstream its fixes to cluster mode? I noticed there were 2 PRs for upstreaming them which got closed (#192 and #191).

stefanwire mentioned this issue Jul 8, 2022

add CLUSTER INFO command and use it for new checkedConnectCluster #185

Open

qnikst added the bug label Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The driver does not deal with changing redis IP addresses in cluster mode #183

The driver does not deal with changing redis IP addresses in cluster mode #183

akshaymankar commented May 30, 2022

k-bx commented May 30, 2022

flokli commented May 30, 2022

aravindgopall commented Jan 6, 2023

michaelglass commented Jan 6, 2023

aravindgopall commented Jan 13, 2023

ysangkok commented Jan 13, 2023

aravindgopall commented Jan 14, 2023

aravindgopall commented Jan 18, 2023 •

edited

Loading

aravindgopall commented Jan 23, 2023

jbrechtel commented Jan 25, 2023

jbrechtel commented Jan 25, 2023

stefanwire commented Jan 31, 2023

aravindgopall commented Jan 31, 2023 •

edited

Loading

stefanwire commented Jan 31, 2023

omnibs commented Mar 11, 2024 •

edited

Loading

aravindgopall commented Mar 12, 2024

omnibs commented Mar 12, 2024

aravindgopall commented Mar 12, 2024

omnibs commented Mar 13, 2024 •

edited

Loading

omnibs commented Mar 14, 2024

The driver does not deal with changing redis IP addresses in cluster mode #183

The driver does not deal with changing redis IP addresses in cluster mode #183

Comments

akshaymankar commented May 30, 2022

How to reproduce the problem

Impact

Speculation

Additional Info

k-bx commented May 30, 2022

flokli commented May 30, 2022

aravindgopall commented Jan 6, 2023

michaelglass commented Jan 6, 2023

aravindgopall commented Jan 13, 2023

ysangkok commented Jan 13, 2023

aravindgopall commented Jan 14, 2023

aravindgopall commented Jan 18, 2023 • edited Loading

aravindgopall commented Jan 23, 2023

jbrechtel commented Jan 25, 2023

jbrechtel commented Jan 25, 2023

stefanwire commented Jan 31, 2023

aravindgopall commented Jan 31, 2023 • edited Loading

stefanwire commented Jan 31, 2023

omnibs commented Mar 11, 2024 • edited Loading

aravindgopall commented Mar 12, 2024

omnibs commented Mar 12, 2024

aravindgopall commented Mar 12, 2024

omnibs commented Mar 13, 2024 • edited Loading

omnibs commented Mar 14, 2024

aravindgopall commented Jan 18, 2023 •

edited

Loading

aravindgopall commented Jan 31, 2023 •

edited

Loading

omnibs commented Mar 11, 2024 •

edited

Loading

omnibs commented Mar 13, 2024 •

edited

Loading