Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for #1520: Add support for retry on transactions for resharding #2062

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dusty9023
Copy link

@dusty9023 dusty9023 commented Apr 3, 2022

A proposed solution for #1520. I ask that the proposed change be thoroughly reviewed by those who are experienced with the repo, and the Redis Transaction code in the library.

Investigation:
In order to investigate, I took one of the many transaction errors we saw when increasing/decreasing the number of shards in our cluster, and used the associated keys + nodes to reply the the exact same transaction using the debugger.

The error for the transaction EXEC was the same as what we saw in our logs:

EXECABORT Transaction discarded because of previous errors.

image

Looking at the InnerOperations of the TransactionMessage, I was surprised to find that the ResultBox was null, and the result was already lost by the time the SetResult() for transaction message was called.
image

I eventually tracked the cause down to base Message class' Complete() method. This removed the resultbox to ensure it couldn't be called twice, but it also mean't that the result of the InnerOperations result aren't available when the transaction result is being processed.
image

This required additional changes to ensure that the result of the InnerOperations were available to the TransactionMessage when it was processing the result, but that they were also cleaned up when the TransactionMessage itself was complete.

Digging into the ResultProcessor, also showed that it RELIED on the RawResult error message following a very particular format in order to detect, and thus retry, messages where the hashslot had been moved.

Based on these facts, the RawResult error message had to be edited to include:

  • prefix the "MOVED (HASHSLOT) (endpoint) " to the entire raw result error if we determine that the ONLY reason that the transaction failed was due to a single hashslot being moved.
  • Also aggregate all the InnerOperation errors/fault messages, and append them to the existing RawResult from the EXEC call.

This required changes to the TransactionMessage and QueueMessage's lifecycle (Complete() method), relying heavily on the TransactionMessage Complete() to mark the QueueMessages actually completed.

Important Notes:

  • I attempted to write unit tests for these changes that didn't rely on a cluster, but most of the nested classes involved are private. Changing those to internal, also revealed that many of the dependencies, such as ServerEndPoint, PhysicalBridge, PhysicalConnection... are all sealed and cannot be mocked using MOQ.
  • In order to test these changes, you require a REDIS cluster that has at least two primaries. One that had keys written to it originally, then the hashslot with that key is moved to the other primary.
  • I've also added/updated a functional test to cover this case.
  • I noticed the "src/StackExchange.Redis/PublicAPI.Unshipped.txt", and "src/StackExchange.Redis/PublicAPI.Shipped.txt", but I'm not aware of what process you follow for moving an API from being considered Unshipped to Shipped.

Due to the challenges in testing, I resorted to the following for testing:

  • Taking a transaction that had failed in our pre-production cluster due to re-sharding, and then retrying it again using these changes.
  • After proving the solution worked on a smaller scale, we deployed the change to our pre-production environment, and scaled our cluster (resharding) from 25 to 30 shards, while having active traffic on the cluster.

cluster | CLUSTER | April 3 2022 17:50:50 UTC | Scaling out cluster ---- from 25 shards to 30 shards
cluster | CLUSTER | April 3 2022 18:13:00 UTC | Added node n in availability zone
...
cluster | CLUSTE> | April 3 2022 18:13:00 UTC | Added node n+1 in availability zone
cluster | CLUSTER | April 3 2022 18:14:39 UTC | Migrating slots from shard 0001 to 0026 to rebalance slots
...
cluster | CLUSTER| April 3 2022 18:15:40 UTC | Moved a total of 109 slots out of 109 slots from shard 0001 to shard 0026
....

During this time we saw no errors in our logs.

Dustin Durand added 6 commits April 3, 2022 12:05
- Tests the use case where a transaction is run against the wrong primary node, and it automatically redirects to the correct one
- RedisHashslotMigratedAndNoRedirectException is thrown if a NoRedirect is thrown
Attempting another method for forcing a transaction to run against a specific endpoint
@@ -56,6 +58,12 @@ public Task<bool> ExecuteAsync(CommandFlags flags)
return base.ExecuteAsync(msg, proc); // need base to avoid our local wrapping override
}

internal bool ExecuteInternal(CommandFlags flags, ServerEndPoint endpoint = null)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of this, but I also didn't want to change the public API of this for the sake of making a functional test work.

@@ -188,7 +188,69 @@ static string StringGet(IServer server, RedisKey key, CommandFlags flags = Comma
string e = StringGet(conn.GetServer(node.EndPoint), key);
Assert.Equal(value, e); // wrong replica, allow redirect

var ex = Assert.Throws<RedisServerException>(() => StringGet(conn.GetServer(node.EndPoint), key, CommandFlags.NoRedirect));
var ex = Assert.Throws<RedisHashslotMigratedAndNoRedirectException>(() => StringGet(conn.GetServer(node.EndPoint), key, CommandFlags.NoRedirect));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much of an impact switching from RedisServerException to RedisHashslotMigratedAndNoRedirectException will have; it's a pretty specific criteria and error to cause this.

One strategy that could be used to mitigate the impact would be to switch RedisHashslotMigratedAndNoRedirectException be a subclass of RedisServerException (requires unsealing the class).

// still need to activate continuations for GetMessages(),
// which might be waiting for the last innerOperation to
// complete.
ResultBox?.ActivateContinuations();
Copy link
Author

@dusty9023 dusty9023 Apr 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the logic behind this correct, in order for the last section of the GetMessage() logic to work, where it waits on the last result box of the innerOperations, this needs to be fired.

This ultimately means that in the lifetime of a QueuedMessage, it will fire/pulse twice. Once, when it's marked Complete() after it finishes it's call to REDIS, and then again when TransactionComplete() is called when the TransactionMessage runs it's Complete(). (which calls TransactionComplete() on all its innerOperations -- calling base.Complete()).

@NickCraver
Copy link
Collaborator

Hey @dusty9023 - I won't have time to get to this until tonight at the earliest for a really in-depth pass (this needs some thinking time), but wanted to say thanks a ton for a great write-up here. We'll evaluate and see if this works as-is or if we need some tweaks but your investigation and notes are a tremendous help and time save here - thank you!

@NickCraver
Copy link
Collaborator

Alrighty, trying to grok the issue/changes here (thanks a ton for descriptions and example code!). I understand what you're hitting, but would want to take a much simpler approach to solving it if possible - hope to get to it this week but depends what all else eats time (trying to get a few things in).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants