-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for #1520: Add support for retry on transactions for resharding #2062
base: main
Are you sure you want to change the base?
Conversation
- Tests the use case where a transaction is run against the wrong primary node, and it automatically redirects to the correct one - RedisHashslotMigratedAndNoRedirectException is thrown if a NoRedirect is thrown
Attempting another method for forcing a transaction to run against a specific endpoint
@@ -56,6 +58,12 @@ public Task<bool> ExecuteAsync(CommandFlags flags) | |||
return base.ExecuteAsync(msg, proc); // need base to avoid our local wrapping override | |||
} | |||
|
|||
internal bool ExecuteInternal(CommandFlags flags, ServerEndPoint endpoint = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of this, but I also didn't want to change the public API of this for the sake of making a functional test work.
@@ -188,7 +188,69 @@ static string StringGet(IServer server, RedisKey key, CommandFlags flags = Comma | |||
string e = StringGet(conn.GetServer(node.EndPoint), key); | |||
Assert.Equal(value, e); // wrong replica, allow redirect | |||
|
|||
var ex = Assert.Throws<RedisServerException>(() => StringGet(conn.GetServer(node.EndPoint), key, CommandFlags.NoRedirect)); | |||
var ex = Assert.Throws<RedisHashslotMigratedAndNoRedirectException>(() => StringGet(conn.GetServer(node.EndPoint), key, CommandFlags.NoRedirect)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much of an impact switching from RedisServerException to RedisHashslotMigratedAndNoRedirectException will have; it's a pretty specific criteria and error to cause this.
One strategy that could be used to mitigate the impact would be to switch RedisHashslotMigratedAndNoRedirectException be a subclass of RedisServerException (requires unsealing the class).
// still need to activate continuations for GetMessages(), | ||
// which might be waiting for the last innerOperation to | ||
// complete. | ||
ResultBox?.ActivateContinuations(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood the logic behind this correct, in order for the last section of the GetMessage() logic to work, where it waits on the last result box of the innerOperations, this needs to be fired.
This ultimately means that in the lifetime of a QueuedMessage, it will fire/pulse twice. Once, when it's marked Complete() after it finishes it's call to REDIS, and then again when TransactionComplete() is called when the TransactionMessage runs it's Complete(). (which calls TransactionComplete() on all its innerOperations -- calling base.Complete()).
Hey @dusty9023 - I won't have time to get to this until tonight at the earliest for a really in-depth pass (this needs some thinking time), but wanted to say thanks a ton for a great write-up here. We'll evaluate and see if this works as-is or if we need some tweaks but your investigation and notes are a tremendous help and time save here - thank you! |
Alrighty, trying to grok the issue/changes here (thanks a ton for descriptions and example code!). I understand what you're hitting, but would want to take a much simpler approach to solving it if possible - hope to get to it this week but depends what all else eats time (trying to get a few things in). |
A proposed solution for #1520. I ask that the proposed change be thoroughly reviewed by those who are experienced with the repo, and the Redis Transaction code in the library.
Investigation:
In order to investigate, I took one of the many transaction errors we saw when increasing/decreasing the number of shards in our cluster, and used the associated keys + nodes to reply the the exact same transaction using the debugger.
The error for the transaction EXEC was the same as what we saw in our logs:
Looking at the InnerOperations of the TransactionMessage, I was surprised to find that the ResultBox was null, and the result was already lost by the time the SetResult() for transaction message was called.
I eventually tracked the cause down to base Message class' Complete() method. This removed the resultbox to ensure it couldn't be called twice, but it also mean't that the result of the InnerOperations result aren't available when the transaction result is being processed.
This required additional changes to ensure that the result of the InnerOperations were available to the TransactionMessage when it was processing the result, but that they were also cleaned up when the TransactionMessage itself was complete.
Digging into the ResultProcessor, also showed that it RELIED on the RawResult error message following a very particular format in order to detect, and thus retry, messages where the hashslot had been moved.
Based on these facts, the RawResult error message had to be edited to include:
This required changes to the TransactionMessage and QueueMessage's lifecycle (Complete() method), relying heavily on the TransactionMessage Complete() to mark the QueueMessages actually completed.
Important Notes:
Due to the challenges in testing, I resorted to the following for testing:
cluster | CLUSTER | April 3 2022 17:50:50 UTC | Scaling out cluster ---- from 25 shards to 30 shards
cluster | CLUSTER | April 3 2022 18:13:00 UTC | Added node n in availability zone
...
cluster | CLUSTE> | April 3 2022 18:13:00 UTC | Added node n+1 in availability zone
cluster | CLUSTER | April 3 2022 18:14:39 UTC | Migrating slots from shard 0001 to 0026 to rebalance slots
...
cluster | CLUSTER| April 3 2022 18:15:40 UTC | Moved a total of 109 slots out of 109 slots from shard 0001 to shard 0026
....
During this time we saw no errors in our logs.