Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrapping request client calls with a read/write mutex #3701
Wrapping request client calls with a read/write mutex #3701
Changes from all commits
cab1168
03fc923
86c24de
ed1039e
0f03dd8
c0e9d25
da09f53
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively creates a semaphore that allows just one request at the same. In theory, this can affect performance if multiple goroutines try to make multiple requests simultaneously. This, in turn, can cause problems in higher layers where some timeouts may be hit unexpectedly. Have we done any benchmarks against real Electrum servers to measure the impact of that change performance-wise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run the integration tests and system tests, no performace-wise tests. I'm not sure if we have any benchmarks/performace focused tests (please correct me if I'm wrong). Yes, it might affect the performance but I guess only when we expect a lot of write requests at the same time. Not too familiar with the flows to answer this off the top of my head. FWIW we had a Read mutex already, and I'd expect we read more than we write based on the fact that we have way more getter functions and this particular issue came from
GetLatestBlockHeight
.The closest to a real-world testing would be including this fix in a release candidate -> deploy to a tesnet and see if we hit any timeout issues. Open for suggestions regarding the performace-wise testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree regarding the
client
instance itself - we rarely reconnect (thus replace this instance) and mostly use it to perform network calls to the Electrum server. But, my whole point was actually about the performance of those network calls. So far, with the read mutex, multiple client goroutines could use theclient
instance to make multiple network calls at the same time. Now, only one client goroutine can do so. Others have to wait for their turn. That means the last goroutines in the queue may wait too long if their predecessors stumble upon longer response times from the Electrum server (or expect other network problems). This can cause cascading failures as those client goroutines may have their own timeouts. This is why I'm raising the performance concern.That said, I'm wondering whether we can limit the impact of this fix. You said (and this is my reasoning as well), the problem is related only to
GetLatestBlockHeight
, specificallyclient.SubscribeHeaders
is problematic. Can't we just wrap theclient.SubscribeHeaders
call with a mutex instead of wrapping the wholerequestWithRetry
function? This way, only one function is affected and we can easily focus on it during real-world tests. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'd also like to limit the impact of this fix, but limiting this mutex only for
GetLatestBlockHeight
might not be enough and we can observe the same issue in other getters. The thing is that most of the getters hit thes.transport.SendMessage(bytes)
at some point and if the transport layer is set tows
then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue. I suspect that we observe the problem only forGetLatestBlockHeight
because this function is being hit more frequently than the other getters in a certain flow. Note that thisconcurrent write to websocket connection
error happen not very frequently, but when it does it affects the client uptime.So having said that, I guess there's a tradeoff we need to make. Limiting the scope of this mutex has a lower impact on performance, but the error might pop up in a different function. Leaving the fix for a wider scope might affect the performance, but no error would occur. The consequence of this error was lowering the uptime bar to 50% because it affected the rewards distribution. Obviously, stakers wouldn't mind a client being slightly slower, but not losing the rewards. Also, I agree with your point that we want to avoid timeout issues that might be a concern, but we don't know if it is. I also suspect that our client had some slack for such cases like this and the timeouts are not balancing on the edge. I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is the missing piece I was not aware of. Thanks for clarifying that!
That makes sense now but there is one thing I would add in this case: better observability for those calls. It may happen that we won't detect anything on testnet but this problem will manifest in the future and we won't have an easy way to figure out whether it is actually caused by throttled calls to the Electrum server. We can, for example, simply wrap all
electrum.Connection
methods with a logger that will displaystarting [method_name] call to Electrum server
at the beginning and thencompleted [method_name] to Electrum server with [success/error] after [time]
at the end. This will be really helpful to have such info logs next to timeout errors if they even occur. That's just a proposition so leaving it up to you but I think it will greatly facilitate future debugging.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea. Mind if I open this in a separate PR? This one is for fixing an existing error, another PR will be around improving observability. Both will be included in the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Leaving this discussion open for future reference.