-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xdsclient: stop caching xdsChannels for potential reuse, after all references are released #7924
Conversation
…erences are released
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7924 +/- ##
==========================================
- Coverage 81.98% 81.96% -0.03%
==========================================
Files 377 379 +2
Lines 38180 38230 +50
==========================================
+ Hits 31302 31334 +32
- Misses 5571 5583 +12
- Partials 1307 1313 +6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm. Few minor comments.
xds/internal/xdsclient/authority.go
Outdated
@@ -687,7 +694,7 @@ func (a *authority) unwatchResource(rType xdsresource.Type, resourceName string, | |||
delete(state.watchers, watcher) | |||
if len(state.watchers) > 0 { | |||
if a.logger.V(2) { | |||
a.logger.Infof("%d more watchers exist for type %q, resource name %q", rType.TypeName(), resourceName) | |||
a.logger.Infof("More watchers exist for type %q, resource name %q", rType.TypeName(), resourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Watchers still exist for type %q, resource name %q"? considering that "more" is applicable with quantitative comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// 3. The same resource is requested again. The test verifies that the request | ||
// is sent with the previously ACKed version. | ||
// is sent with an empty version string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like you are verifying this by ensuring that the re-request is identical to first request. We should mention that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
streamRequestCh.Drain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have to drain? May be good to put a comment to explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment, and slightly re-worked the test since I saw another flake on Github actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel
Based on the PR description, i thought this should never happen anymore?
// containing both the old good update and the latest NACK error. The test | ||
// verifies that a when a new watch is registered for the same resource, the new | ||
// watcher receives the good update followed by the NACK error. | ||
func (s) TestLDSWatch_ResourceCaching_NACKError(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are not verifying if the update is coming from resource cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the scenario being tested:
- watcher1 registers a watch for resource1, and gets a response at version1
- version1 is ACKed
- management server then sends resource1 at version2
- version2 is NACKed
- watcher2 registers a watch for resource1
- if this resource is not handled from the cache and goes to the management server instead, it will only get the response at version2. Since we are verifying that this watcher gets the good update at version1 followed by the error at version2, we are indirectly verifying that the watch was handled from the cache.
if err != nil { | ||
t.Fatalf("Failed to receive new connection from wrapped listener: %v", err) | ||
} | ||
conn := val.(*testutils.ConnWrapper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this type assertion always succeed? If not, should we use conn, ok
and check the assertion actually succeeded before continuing?
Ditto for all other newly added type assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type assertion would always succeed because we pass a listener wrapper created in line 818 with the call to testutils.NewListenerWrapper
. net.Conn
instances returned by this listener are always of type testutils.ConnWrapper
. See: https://github.com/grpc/grpc-go/blob/master/internal/testutils/wrappers.go#L27
// Wait for the CDS resource to be not requested anymore. | ||
// Wait for the CDS resource to be not requested anymore, or the connection | ||
// to the management server to be closed (which happens as part of the last | ||
// resource watch being canceled). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the order of whether the connection being closed and CDS resource being cancelled matter here (and would that need to be reflected in the tests)? My understanding is yes, when the last watch is cancelled (comes first), then the connection/channel then gets closed(comes second), then we may/may not see empty discover request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the last watch is canceled by calling the cancel function returned from WatchCluster
, the xDS client implementation first unsubscribes to the resource. This is done by invoking the unsubscribe
method on the underlying xdsChannel
which represents the connection to the management server. After this, if the xDS client implementation figures that there are no more resource watches on this management server, it closes the corresponding xdsChannel
.
While the closing of the xdsChannel
is handled synchronously by ensuring that the channel is actually closed and all Go resources (and goroutines) allocated by it are freed by the time close
returns, the unsubscription of a resource is done asynchronously. That is the reason why the order of these two events is not guaranteed.
// state can only be accessed in the context of an | ||
// xdsClientSerializer callback. Hence making a copy of the error | ||
// here for watchCallbackSerializer. | ||
err := state.md.ErrState.Err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, why would we need to make a copy of the error for the callback serializer? What is the consequence of just using state.md.ErrState.Err
directly as previously? Is it for easier testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state
must only be accessed in the context of an xdsClientSerializer
callback. The serializer ensures that only one callback is active at any given time, thereby guaranteeing mutual exclusion and therefore we don't need to guard accesses to state
with a mutex. See: https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/authority.go#L90
If you observer closely, the call to watcher.OnError
does not happen in the context of an xdsClientSerializer
callback. It happens in the context of a watcherCallbackSerializer
callback. So, the race detector was complaining about a data race. Making a copy of the error solves the issue. If you look at line 646, the same is done when watcher.OnUpdate
is called.
xds/internal/xdsclient/clientimpl.go
Outdated
c.channelsMu.Unlock() | ||
})) | ||
c.channelsMu.Unlock() | ||
state.channel.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, it is thread-safe to access the current channel state
object without the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to reason about it, it does look like it is a safe thing to do because state.channel
is written to when the state
instance is created and then never changed. Which is probably why the race detector also did not complain about it.
But I've gone ahead and made a copy of the pointer to the underlying channel before releasing the lock. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. One follow up question about empty request and channel draining
} | ||
streamRequestCh.Drain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel
Based on the PR description, i thought this should never happen anymore?
This is from the PR description: This change does lead to a minor behavior change. Earlier when the last watch on an xdsChannel was being canceled, we would see a discovery request with no resources specified in it. But with the current change, when the last watch on an xdsChannel is being canceled, the channel itself would get closed. Hence we might not see the empty discovery request on the wire. The above paragraph does use might to indicate the uncertainty. Is that not clear? If so, let me know how you want me to change it. |
LGTM (I do not have permission to approve in Github)! Thanks for the detailed explanation, this really helped me understand the gRPC Go xDS stack at a deeper level, I can see the parallels with the gRPC Java xDS stack. |
Prior to this PR, xdsChannels were cached for a configurable amount of time after all references to them were released. This was in the hope that an xdsChannel might get used again very soon.
But with fallback support, we want channels to lower priority servers to be closed as soon as a higher priority server comes up. If at all we want to bring back support to cache unused channels, this can be implemented in the transport builder when someone asks for it.
Link to internal discussion: https://chat.google.com/room/AAAAbkw9L3c/oa6GgM1MPlk
This change does lead to a minor behavior change. Earlier when the last watch on an xdsChannel was being canceled, we would see a discovery request with no resources specified in it. But with the current change, when the last watch on an xdsChannel is being canceled, the channel itself would get closed. Hence we might not see the empty discovery request on the wire.
Summary of test changes:
TestClose
incdsbalancer_test.go
was verifying that an empty discovery request was being sent out when the LB policy was closed. This does not happen anymore. So, the test is changed to only verify that the child policy is closed when the parent is closed.TestADS_ACK_NACK_ResourceIsNotRequestedAnymore
inads_stream_ack_nack_test.go
:TestADS_ResourcesAreRequestedAfterStreamRestart
inads_stream_restart_test.go
:xds/internal/xdsclient/tests/authority_test.go
TestLDSWatch_ResourceCaching_NACKError
inlds_watchers_test.go
was being flaky because it was validating that exactly two discovery requests were being sent. But since the resource in that test was being NACKed, the management server will keep resending the same resource and therefore the client will keep NACKing it.RELEASE NOTES: