Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapping request client calls with a read/write mutex #3701

Merged
merged 7 commits into from
Sep 14, 2023
8 changes: 4 additions & 4 deletions pkg/bitcoin/electrum/electrum.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ var (
type Connection struct {
parentCtx context.Context
client *electrum.Client
clientMutex *sync.RWMutex
clientMutex *sync.Mutex
config Config
}

Expand All @@ -59,7 +59,7 @@ func Connect(parentCtx context.Context, config Config) (bitcoin.Chain, error) {
c := &Connection{
parentCtx: parentCtx,
config: config,
clientMutex: &sync.RWMutex{},
clientMutex: &sync.Mutex{},
}

if err := c.electrumConnect(); err != nil {
Expand Down Expand Up @@ -771,9 +771,9 @@ func requestWithRetry[K interface{}](
requestCtx, requestCancel := context.WithTimeout(ctx, c.config.RequestTimeout)
defer requestCancel()

c.clientMutex.RLock()
lukasz-zimnoch marked this conversation as resolved.
Show resolved Hide resolved
c.clientMutex.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively creates a semaphore that allows just one request at the same. In theory, this can affect performance if multiple goroutines try to make multiple requests simultaneously. This, in turn, can cause problems in higher layers where some timeouts may be hit unexpectedly. Have we done any benchmarks against real Electrum servers to measure the impact of that change performance-wise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run the integration tests and system tests, no performace-wise tests. I'm not sure if we have any benchmarks/performace focused tests (please correct me if I'm wrong). Yes, it might affect the performance but I guess only when we expect a lot of write requests at the same time. Not too familiar with the flows to answer this off the top of my head. FWIW we had a Read mutex already, and I'd expect we read more than we write based on the fact that we have way more getter functions and this particular issue came from GetLatestBlockHeight.

The closest to a real-world testing would be including this fix in a release candidate -> deploy to a tesnet and see if we hit any timeout issues. Open for suggestions regarding the performace-wise testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree regarding the client instance itself - we rarely reconnect (thus replace this instance) and mostly use it to perform network calls to the Electrum server. But, my whole point was actually about the performance of those network calls. So far, with the read mutex, multiple client goroutines could use the client instance to make multiple network calls at the same time. Now, only one client goroutine can do so. Others have to wait for their turn. That means the last goroutines in the queue may wait too long if their predecessors stumble upon longer response times from the Electrum server (or expect other network problems). This can cause cascading failures as those client goroutines may have their own timeouts. This is why I'm raising the performance concern.

That said, I'm wondering whether we can limit the impact of this fix. You said (and this is my reasoning as well), the problem is related only to GetLatestBlockHeight, specifically client.SubscribeHeaders is problematic. Can't we just wrap the client.SubscribeHeaders call with a mutex instead of wrapping the whole requestWithRetry function? This way, only one function is affected and we can easily focus on it during real-world tests. WDYT?

Copy link
Contributor Author

@dimpar dimpar Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'd also like to limit the impact of this fix, but limiting this mutex only for GetLatestBlockHeight might not be enough and we can observe the same issue in other getters. The thing is that most of the getters hit the s.transport.SendMessage(bytes) at some point and if the transport layer is set to ws then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue. I suspect that we observe the problem only for GetLatestBlockHeight because this function is being hit more frequently than the other getters in a certain flow. Note that this concurrent write to websocket connection error happen not very frequently, but when it does it affects the client uptime.
So having said that, I guess there's a tradeoff we need to make. Limiting the scope of this mutex has a lower impact on performance, but the error might pop up in a different function. Leaving the fix for a wider scope might affect the performance, but no error would occur. The consequence of this error was lowering the uptime bar to 50% because it affected the rewards distribution. Obviously, stakers wouldn't mind a client being slightly slower, but not losing the rewards. Also, I agree with your point that we want to avoid timeout issues that might be a concern, but we don't know if it is. I also suspect that our client had some slack for such cases like this and the timeouts are not balancing on the edge. I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that most of the getters hit the s.transport.SendMessage(bytes) at some point and if the transport layer is set to ws then we hit the following code https://github.com/keep-network/go-electrum/blob/master/electrum/transport_ws.go#L91 and here we might have another occurrence of this issue

Ok, this is the missing piece I was not aware of. Thanks for clarifying that!

I'd propose releasing a client with a wider scope mutex fix and seeing if we hit any timeouts in a testing env.

That makes sense now but there is one thing I would add in this case: better observability for those calls. It may happen that we won't detect anything on testnet but this problem will manifest in the future and we won't have an easy way to figure out whether it is actually caused by throttled calls to the Electrum server. We can, for example, simply wrap all electrum.Connection methods with a logger that will display starting [method_name] call to Electrum server at the beginning and then completed [method_name] to Electrum server with [success/error] after [time] at the end. This will be really helpful to have such info logs next to timeout errors if they even occur. That's just a proposition so leaving it up to you but I think it will greatly facilitate future debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. Mind if I open this in a separate PR? This one is for fixing an existing error, another PR will be around improving observability. Both will be included in the next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Leaving this discussion open for future reference.

r, err := requestFn(requestCtx, c.client)
c.clientMutex.RUnlock()
c.clientMutex.Unlock()

if err != nil {
return fmt.Errorf("request failed: [%w]", err)
Expand Down
43 changes: 43 additions & 0 deletions pkg/bitcoin/electrum/electrum_integration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,49 @@ func TestGetTransactionConfirmations_Negative_Integration(t *testing.T) {
})
}

// TODO: We should uncomment this test once https://github.com/checksum0/go-electrum/issues/10
// is fixed. This test was added to validate the fix of the following issue
// https://github.com/keep-network/keep-core/issues/3699 but at the same time
// made `panic: assignment to entry in nil map` happen very frequently which is
// disturbing during the development and running the existing integration tests.

// func TestGetLatestBlockHeightConcurrently_Integration(t *testing.T) {
// goroutines := 20

// for testName, testConfig := range testConfigs {
// t.Run(testName+"_get", func(t *testing.T) {
// electrum, cancelCtx := newTestConnection(t, testConfig.clientConfig)
// defer cancelCtx()

// var wg sync.WaitGroup

// for i := 0; i < goroutines; i++ {
// wg.Add(1)

// go func() {
// result, err := electrum.GetLatestBlockHeight()

// if err != nil {
// t.Fatal(err)
// }

// if result == 0 {
// t.Errorf(
// "returned block height is 0",
// )
// }

// wg.Done()
// }()
// }

// wg.Wait()
// })

// // Passed if no "panic: concurrent write to websocket connection"
// }
// }

func TestGetLatestBlockHeight_Integration(t *testing.T) {
expectedBlockHeightRef := map[string]uint{}
results := map[string]map[string]uint{}
Expand Down