Wrapping request client calls with a read/write mutex #3701

dimpar · 2023-08-24T15:11:28Z

Depends on #3704

It happens that the keep-core client panics when interacting with the go-electrum library. We noticed a "concurrent write to websocket connection" error was thrown when calling GetLatestBlockHeight() function. The stack trace leads to gorilla/websocket/WriteMessage which is called from the go-electrum library. The latest block height call was already wrapped with a read mutex, but we should also wrap it with the write mutex to prevent such concurrent errors.

It happens that keep-core client panics when interacting with the go-electrum library. We noticed "concurrent write to websocket connection" error was thrown when calling GetLatestBlockHeight() function. The stack trace leads to gorilla/websocket/WriteMessage which is called from the go-electrum client. The latest block height call was already wrapped with a read mutex, but we should also wrap it with the write mutex as well to prevent such concurrent errors.

dimpar · 2023-08-24T15:14:29Z

~~I wasn't able to recreate the concurrent write to websocket connection error in the integration tests.~~
I've added an integration test that hit the GetLatestBlockHeight() concurrently. Without the W lock I was able to replicate the same panic. cab1168 fixed it.

In this test we spin up 20 goroutines that hit electrum's client.SubscribeHeaders() concurrently. In the cab1168 commit we introduced a RW lock on getting the latest block height function. Without a W lock, the test would've failed with 'concurrent write to websocket connection' panic. If there's no panic, then the W lock works. 20 goroutine was always enough to see the panic thrown with only R lock.

pkg/bitcoin/electrum/electrum.go

lukasz-zimnoch · 2023-09-13T11:28:26Z

pkg/bitcoin/electrum/electrum.go

@@ -771,9 +771,9 @@ func requestWithRetry[K interface{}](
 			requestCtx, requestCancel := context.WithTimeout(ctx, c.config.RequestTimeout)
 			defer requestCancel()

-			c.clientMutex.RLock()
+			c.clientMutex.Lock()


This effectively creates a semaphore that allows just one request at the same. In theory, this can affect performance if multiple goroutines try to make multiple requests simultaneously. This, in turn, can cause problems in higher layers where some timeouts may be hit unexpectedly. Have we done any benchmarks against real Electrum servers to measure the impact of that change performance-wise?

I run the integration tests and system tests, no performace-wise tests. I'm not sure if we have any benchmarks/performace focused tests (please correct me if I'm wrong). Yes, it might affect the performance but I guess only when we expect a lot of write requests at the same time. Not too familiar with the flows to answer this off the top of my head. FWIW we had a Read mutex already, and I'd expect we read more than we write based on the fact that we have way more getter functions and this particular issue came from GetLatestBlockHeight.

The closest to a real-world testing would be including this fix in a release candidate -> deploy to a tesnet and see if we hit any timeout issues. Open for suggestions regarding the performace-wise testing.

Agree regarding the client instance itself - we rarely reconnect (thus replace this instance) and mostly use it to perform network calls to the Electrum server. But, my whole point was actually about the performance of those network calls. So far, with the read mutex, multiple client goroutines could use the client instance to make multiple network calls at the same time. Now, only one client goroutine can do so. Others have to wait for their turn. That means the last goroutines in the queue may wait too long if their predecessors stumble upon longer response times from the Electrum server (or expect other network problems). This can cause cascading failures as those client goroutines may have their own timeouts. This is why I'm raising the performance concern.

That said, I'm wondering whether we can limit the impact of this fix. You said (and this is my reasoning as well), the problem is related only to GetLatestBlockHeight, specifically client.SubscribeHeaders is problematic. Can't we just wrap the client.SubscribeHeaders call with a mutex instead of wrapping the whole requestWithRetry function? This way, only one function is affected and we can easily focus on it during real-world tests. WDYT?

Hm, I'd also like to limit the impact of this fix, but limiting this mutex only for GetLatestBlockHeight might not be enough and we can observe the same issue in other getters. The thing is that most of the getters hit the s.transport.SendMessage(bytes) at some point and if the transport layer is set to ws then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue. I suspect that we observe the problem only for GetLatestBlockHeight because this function is being hit more frequently than the other getters in a certain flow. Note that this concurrent write to websocket connection error happen not very frequently, but when it does it affects the client uptime.
So having said that, I guess there's a tradeoff we need to make. Limiting the scope of this mutex has a lower impact on performance, but the error might pop up in a different function. Leaving the fix for a wider scope might affect the performance, but no error would occur. The consequence of this error was lowering the uptime bar to 50% because it affected the rewards distribution. Obviously, stakers wouldn't mind a client being slightly slower, but not losing the rewards. Also, I agree with your point that we want to avoid timeout issues that might be a concern, but we don't know if it is. I also suspect that our client had some slack for such cases like this and the timeouts are not balancing on the edge. I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env. Thoughts?

The thing is that most of the getters hit the s.transport.SendMessage(bytes) at some point and if the transport layer is set to ws then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue

Ok, this is the missing piece I was not aware of. Thanks for clarifying that!

I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env.

That makes sense now but there is one thing I would add in this case: better observability for those calls. It may happen that we won't detect anything on testnet but this problem will manifest in the future and we won't have an easy way to figure out whether it is actually caused by throttled calls to the Electrum server. We can, for example, simply wrap all electrum.Connection methods with a logger that will display starting [method_name] call to Electrum server at the beginning and then completed [method_name] to Electrum server with [success/error] after [time] at the end. This will be really helpful to have such info logs next to timeout errors if they even occur. That's just a proposition so leaving it up to you but I think it will greatly facilitate future debugging.

Yes, good idea. Mind if I open this in a separate PR? This one is for fixing an existing error, another PR will be around improving observability. Both will be included in the next release.

Sure! Leaving this discussion open for future reference.

pkg/bitcoin/electrum/electrum_integration_test.go

Commenting this test because as a side effect it exposed another known issue which now happens very often because of this test. It's becoming annoying having red builds because of this test. It should be uncommented once the fix for checksum0/go-electrum#10 is in place.

Refs the discussion #3701 (comment) Electrum client calls sometimes are prone to timeouts or other issues. For debugging purposes we wrap electrum client calls with a logger and measure the time it takes to complete a call.

dimpar requested review from nkuba and pdyraga August 24, 2023 15:11

nkuba mentioned this pull request Aug 28, 2023

Silent panic - concurrent write to websocket #3699

Closed

dimpar added 3 commits August 29, 2023 14:18

Merge remote-tracking branch 'origin' into concurrent-write-to-websocket

03fc923

Merge remote-tracking branch 'origin' into concurrent-write-to-websocket

ed1039e

dimpar added this to the v2.0.0-m5 milestone Sep 7, 2023

tomaszslabon previously approved these changes Sep 12, 2023

View reviewed changes

Merge remote-tracking branch 'origin' into concurrent-write-to-websocket

0f03dd8

lukasz-zimnoch reviewed Sep 13, 2023

View reviewed changes

Changing RWMutex to Mutex since we no longer use Read mutex

c0e9d25

dimpar dismissed tomaszslabon’s stale review via c0e9d25 September 13, 2023 12:35

dimpar force-pushed the concurrent-write-to-websocket branch from 9882119 to da09f53 Compare September 13, 2023 19:41

lukasz-zimnoch approved these changes Sep 14, 2023

View reviewed changes

lukasz-zimnoch merged commit 5c2604c into main Sep 14, 2023
29 checks passed

lukasz-zimnoch deleted the concurrent-write-to-websocket branch September 14, 2023 07:24

dimpar mentioned this pull request Sep 14, 2023

Logs for electrum connection #3715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrapping request client calls with a read/write mutex #3701

Wrapping request client calls with a read/write mutex #3701

dimpar commented Aug 24, 2023 •

edited

Loading

dimpar commented Aug 24, 2023 •

edited

Loading

lukasz-zimnoch Sep 13, 2023

dimpar Sep 13, 2023

lukasz-zimnoch Sep 13, 2023

dimpar Sep 13, 2023 •

edited

Loading

lukasz-zimnoch Sep 13, 2023

dimpar Sep 13, 2023

lukasz-zimnoch Sep 14, 2023

Wrapping request client calls with a read/write mutex #3701

Wrapping request client calls with a read/write mutex #3701

Conversation

dimpar commented Aug 24, 2023 • edited Loading

dimpar commented Aug 24, 2023 • edited Loading

lukasz-zimnoch Sep 13, 2023

Choose a reason for hiding this comment

dimpar Sep 13, 2023

Choose a reason for hiding this comment

lukasz-zimnoch Sep 13, 2023

Choose a reason for hiding this comment

dimpar Sep 13, 2023 • edited Loading

Choose a reason for hiding this comment

lukasz-zimnoch Sep 13, 2023

Choose a reason for hiding this comment

dimpar Sep 13, 2023

Choose a reason for hiding this comment

lukasz-zimnoch Sep 14, 2023

Choose a reason for hiding this comment

dimpar commented Aug 24, 2023 •

edited

Loading

dimpar commented Aug 24, 2023 •

edited

Loading

dimpar Sep 13, 2023 •

edited

Loading