Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: stop caching xdsChannels for potential reuse, after all references are released #7924

Merged
merged 4 commits into from
Dec 13, 2024

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Dec 12, 2024

Prior to this PR, xdsChannels were cached for a configurable amount of time after all references to them were released. This was in the hope that an xdsChannel might get used again very soon.

But with fallback support, we want channels to lower priority servers to be closed as soon as a higher priority server comes up. If at all we want to bring back support to cache unused channels, this can be implemented in the transport builder when someone asks for it.

Link to internal discussion: https://chat.google.com/room/AAAAbkw9L3c/oa6GgM1MPlk

This change does lead to a minor behavior change. Earlier when the last watch on an xdsChannel was being canceled, we would see a discovery request with no resources specified in it. But with the current change, when the last watch on an xdsChannel is being canceled, the channel itself would get closed. Hence we might not see the empty discovery request on the wire.

Summary of test changes:

  • TestClose in cdsbalancer_test.go was verifying that an empty discovery request was being sent out when the LB policy was closed. This does not happen anymore. So, the test is changed to only verify that the child policy is closed when the parent is closed.
  • TestADS_ACK_NACK_ResourceIsNotRequestedAnymore in ads_stream_ack_nack_test.go:
    • The test was checking that an empty discovery request was being sent when a previously requested resource was no longer being requested. This is being changed to verify that the underlying connection to the management server is closed instead.
    • Similarly when a watch for the same resource is re-registered, earlier, the test was expecting a request with the previously cached version number. But now, since the channel is being deleted, the new watch will result in a request with an empty version string.
  • TestADS_ResourcesAreRequestedAfterStreamRestart in ads_stream_restart_test.go:
    • Remove that check that looks for an empty discovery request
  • xds/internal/xdsclient/tests/authority_test.go
    • One of the tests was specifically verifying that an xdsChannel was not being closed when the last watch was canceled. That test is changed to verify that the xdsChannel is being closed when the last watch is canceled.
    • Couple of tests are deleted because they no longer make sense
  • TestLDSWatch_ResourceCaching_NACKError in lds_watchers_test.go was being flaky because it was validating that exactly two discovery requests were being sent. But since the resource in that test was being NACKed, the management server will keep resending the same resource and therefore the client will keep NACKing it.

RELEASE NOTES:

  • TBD

@easwars easwars requested a review from purnesh42H December 12, 2024 02:07
@easwars easwars added the Type: Internal Cleanup Refactors, etc label Dec 12, 2024
@easwars easwars added this to the 1.70 Release milestone Dec 12, 2024
@easwars
Copy link
Contributor Author

easwars commented Dec 12, 2024

@danielzhaotongliu

Copy link

codecov bot commented Dec 12, 2024

Codecov Report

Attention: Patch coverage is 76.19048% with 5 lines in your changes missing coverage. Please review.

Project coverage is 81.96%. Comparing base (c1b6b37) to head (53b8f9c).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/xdsclient/authority.go 50.00% 2 Missing ⚠️
xds/internal/xdsclient/clientimpl.go 66.66% 2 Missing ⚠️
internal/testutils/channel.go 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7924      +/-   ##
==========================================
- Coverage   81.98%   81.96%   -0.03%     
==========================================
  Files         377      379       +2     
  Lines       38180    38230      +50     
==========================================
+ Hits        31302    31334      +32     
- Misses       5571     5583      +12     
- Partials     1307     1313       +6     
Files with missing lines Coverage Δ
xds/internal/xdsclient/client_new.go 85.71% <100.00%> (-0.65%) ⬇️
xds/internal/xdsclient/client_refcounted.go 82.05% <100.00%> (ø)
internal/testutils/channel.go 83.33% <83.33%> (ø)
xds/internal/xdsclient/authority.go 76.87% <50.00%> (+1.76%) ⬆️
xds/internal/xdsclient/clientimpl.go 74.84% <66.66%> (+0.36%) ⬆️

... and 19 files with indirect coverage changes

Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm. Few minor comments.

@@ -687,7 +694,7 @@ func (a *authority) unwatchResource(rType xdsresource.Type, resourceName string,
delete(state.watchers, watcher)
if len(state.watchers) > 0 {
if a.logger.V(2) {
a.logger.Infof("%d more watchers exist for type %q, resource name %q", rType.TypeName(), resourceName)
a.logger.Infof("More watchers exist for type %q, resource name %q", rType.TypeName(), resourceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Watchers still exist for type %q, resource name %q"? considering that "more" is applicable with quantitative comparison

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// 3. The same resource is requested again. The test verifies that the request
// is sent with the previously ACKed version.
// is sent with an empty version string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you are verifying this by ensuring that the re-request is identical to first request. We should mention that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
streamRequestCh.Drain()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to drain? May be good to put a comment to explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, and slightly re-worked the test since I saw another flake on Github actions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel

Based on the PR description, i thought this should never happen anymore?

// containing both the old good update and the latest NACK error. The test
// verifies that a when a new watch is registered for the same resource, the new
// watcher receives the good update followed by the NACK error.
func (s) TestLDSWatch_ResourceCaching_NACKError(t *testing.T) {
Copy link
Contributor

@purnesh42H purnesh42H Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are not verifying if the update is coming from resource cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the scenario being tested:

  • watcher1 registers a watch for resource1, and gets a response at version1
  • version1 is ACKed
  • management server then sends resource1 at version2
  • version2 is NACKed
  • watcher2 registers a watch for resource1
    • if this resource is not handled from the cache and goes to the management server instead, it will only get the response at version2. Since we are verifying that this watcher gets the good update at version1 followed by the error at version2, we are indirectly verifying that the watch was handled from the cache.

@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Dec 13, 2024
@easwars easwars assigned purnesh42H and unassigned easwars Dec 13, 2024
if err != nil {
t.Fatalf("Failed to receive new connection from wrapped listener: %v", err)
}
conn := val.(*testutils.ConnWrapper)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this type assertion always succeed? If not, should we use conn, ok and check the assertion actually succeeded before continuing?

Ditto for all other newly added type assertions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type assertion would always succeed because we pass a listener wrapper created in line 818 with the call to testutils.NewListenerWrapper. net.Conn instances returned by this listener are always of type testutils.ConnWrapper. See: https://github.com/grpc/grpc-go/blob/master/internal/testutils/wrappers.go#L27

// Wait for the CDS resource to be not requested anymore.
// Wait for the CDS resource to be not requested anymore, or the connection
// to the management server to be closed (which happens as part of the last
// resource watch being canceled).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the order of whether the connection being closed and CDS resource being cancelled matter here (and would that need to be reflected in the tests)? My understanding is yes, when the last watch is cancelled (comes first), then the connection/channel then gets closed(comes second), then we may/may not see empty discover request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the last watch is canceled by calling the cancel function returned from WatchCluster, the xDS client implementation first unsubscribes to the resource. This is done by invoking the unsubscribe method on the underlying xdsChannel which represents the connection to the management server. After this, if the xDS client implementation figures that there are no more resource watches on this management server, it closes the corresponding xdsChannel.

While the closing of the xdsChannel is handled synchronously by ensuring that the channel is actually closed and all Go resources (and goroutines) allocated by it are freed by the time close returns, the unsubscription of a resource is done asynchronously. That is the reason why the order of these two events is not guaranteed.

// state can only be accessed in the context of an
// xdsClientSerializer callback. Hence making a copy of the error
// here for watchCallbackSerializer.
err := state.md.ErrState.Err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why would we need to make a copy of the error for the callback serializer? What is the consequence of just using state.md.ErrState.Err directly as previously? Is it for easier testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state must only be accessed in the context of an xdsClientSerializer callback. The serializer ensures that only one callback is active at any given time, thereby guaranteeing mutual exclusion and therefore we don't need to guard accesses to state with a mutex. See: https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/authority.go#L90

If you observer closely, the call to watcher.OnError does not happen in the context of an xdsClientSerializer callback. It happens in the context of a watcherCallbackSerializer callback. So, the race detector was complaining about a data race. Making a copy of the error solves the issue. If you look at line 646, the same is done when watcher.OnUpdate is called.

c.channelsMu.Unlock()
}))
c.channelsMu.Unlock()
state.channel.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, it is thread-safe to access the current channel state object without the lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to reason about it, it does look like it is a safe thing to do because state.channel is written to when the state instance is created and then never changed. Which is probably why the race detector also did not complain about it.

But I've gone ahead and made a copy of the pointer to the underlying channel before releasing the lock. Thanks.

Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. One follow up question about empty request and channel draining

}
streamRequestCh.Drain()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel

Based on the PR description, i thought this should never happen anymore?

@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Dec 13, 2024
@easwars
Copy link
Contributor Author

easwars commented Dec 13, 2024

we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel

Based on the PR description, i thought this should never happen anymore?

This is from the PR description:

This change does lead to a minor behavior change. Earlier when the last watch on an xdsChannel was being canceled, we would see a discovery request with no resources specified in it. But with the current change, when the last watch on an xdsChannel is being canceled, the channel itself would get closed. Hence we might not see the empty discovery request on the wire.

The above paragraph does use might to indicate the uncertainty. Is that not clear? If so, let me know how you want me to change it.

@danielzhaotongliu
Copy link
Contributor

LGTM (I do not have permission to approve in Github)! Thanks for the detailed explanation, this really helped me understand the gRPC Go xDS stack at a deeper level, I can see the parallels with the gRPC Java xDS stack.

@easwars easwars merged commit 3f76275 into grpc:master Dec 13, 2024
15 checks passed
@easwars easwars deleted the delete_authority_idle_cache branch December 13, 2024 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants