Consider a connection issue a transient one #292

alexsnaps · 2024-04-18T18:30:26Z

This addresses the issue that, if we get partitioned "not being able to connect to redis", this is considered partitioning.
Yet, on startup, the failure of connecting successfully will have the process err out... The issue is that we want to give it some "more" time to reconnect than what we allow it to in "regular request". Sadly reconnect logic is buried deep into the crate's code...

There were 3 options I could think of to achieve that:

Set a different timeout when connecting initially, but use the response timeout (or function there off, see below) when reconnecting
Treat the connection refusal as a transient error, but on startup
Only try to fix a partitioning in the out of band code paths... i.e. the flushing the batcher.

afaict, there is no way to implement the first option, my preferred one. The third option ended also being my least preferred one. I'd rather resolve the partition a little slower than:

Wait until the flush period is elapsed
But most importantly I didn't want to couple the resolve a partition logic with the flushing the counter to redis one... This feels... wrong. Complecting things

So here is the fairly straight forward patch:

Treat connection refusal as transient
Transient errors on bootstrap aren't treated in a lenient way anyways, that only happens within the CachedRedisStorage storage
Infer the connection timeout from the response one, as we know TLS 1.2 might require 2x RTT to handshake.

The most important bit of this PR is:

Do you agree with my rational?
Did I miss another option that might be a better approach to this?

If we all think option 1 is really a whole lot better, we could raise it with the Redis crate maintainer and push a PR. Or look around if another Redis crate handles initial and reconnects in a way that'd be better suited to address our issue...

If, otoh, this makes sense, then please review the actual patch :)

didierofrivia

We could research a bit for alternatives that meet all the redis features we are using and see if a candidate does it better, if not, probably submitting an issue would be the way. For now, these changes make sense

alexsnaps requested review from eguzki, didierofrivia and adam-cattermole April 18, 2024 18:30

didierofrivia approved these changes Apr 19, 2024

View reviewed changes

alexsnaps force-pushed the refactor-partition branch from b732c41 to 9d5eed7 Compare April 22, 2024 13:59

alexsnaps force-pushed the fix-reconnect branch 2 times, most recently from 9c204f4 to 3b346a2 Compare April 22, 2024 14:17

Base automatically changed from refactor-partition to main April 22, 2024 14:43

Consider a connection issue a transient one

452e99b

alexsnaps force-pushed the fix-reconnect branch from 3b346a2 to 452e99b Compare April 22, 2024 14:43

alexsnaps merged commit b7c748a into main Apr 22, 2024
20 checks passed

alexsnaps deleted the fix-reconnect branch April 22, 2024 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider a connection issue a transient one #292

Consider a connection issue a transient one #292

alexsnaps commented Apr 18, 2024

didierofrivia left a comment

Consider a connection issue a transient one #292

Consider a connection issue a transient one #292

Conversation

alexsnaps commented Apr 18, 2024

didierofrivia left a comment

Choose a reason for hiding this comment