Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a connection issue a transient one #292

Merged
merged 1 commit into from
Apr 22, 2024
Merged

Conversation

alexsnaps
Copy link
Member

This addresses the issue that, if we get partitioned "not being able to connect to redis", this is considered partitioning.
Yet, on startup, the failure of connecting successfully will have the process err out... The issue is that we want to give it some "more" time to reconnect than what we allow it to in "regular request". Sadly reconnect logic is buried deep into the crate's code...

There were 3 options I could think of to achieve that:

  1. Set a different timeout when connecting initially, but use the response timeout (or function there off, see below) when reconnecting
  2. Treat the connection refusal as a transient error, but on startup
  3. Only try to fix a partitioning in the out of band code paths... i.e. the flushing the batcher.

afaict, there is no way to implement the first option, my preferred one. The third option ended also being my least preferred one. I'd rather resolve the partition a little slower than:

  • Wait until the flush period is elapsed
  • But most importantly I didn't want to couple the resolve a partition logic with the flushing the counter to redis one... This feels... wrong. Complecting things

So here is the fairly straight forward patch:

  • Treat connection refusal as transient
  • Transient errors on bootstrap aren't treated in a lenient way anyways, that only happens within the CachedRedisStorage storage
  • Infer the connection timeout from the response one, as we know TLS 1.2 might require 2x RTT to handshake.

The most important bit of this PR is:

  • Do you agree with my rational?
  • Did I miss another option that might be a better approach to this?

If we all think option 1 is really a whole lot better, we could raise it with the Redis crate maintainer and push a PR. Or look around if another Redis crate handles initial and reconnects in a way that'd be better suited to address our issue...

If, otoh, this makes sense, then please review the actual patch :)

Copy link
Contributor

@didierofrivia didierofrivia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could research a bit for alternatives that meet all the redis features we are using and see if a candidate does it better, if not, probably submitting an issue would be the way. For now, these changes make sense

@alexsnaps alexsnaps force-pushed the fix-reconnect branch 2 times, most recently from 9c204f4 to 3b346a2 Compare April 22, 2024 14:17
Base automatically changed from refactor-partition to main April 22, 2024 14:43
@alexsnaps alexsnaps merged commit b7c748a into main Apr 22, 2024
20 checks passed
@alexsnaps alexsnaps deleted the fix-reconnect branch April 22, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants