xdsclient: stop caching xdsChannels for potential reuse, after all references are released #7924

easwars · 2024-12-12T02:07:17Z

Prior to this PR, xdsChannels were cached for a configurable amount of time after all references to them were released. This was in the hope that an xdsChannel might get used again very soon.

But with fallback support, we want channels to lower priority servers to be closed as soon as a higher priority server comes up. If at all we want to bring back support to cache unused channels, this can be implemented in the transport builder when someone asks for it.

Link to internal discussion: https://chat.google.com/room/AAAAbkw9L3c/oa6GgM1MPlk

This change does lead to a minor behavior change. Earlier when the last watch on an xdsChannel was being canceled, we would see a discovery request with no resources specified in it. But with the current change, when the last watch on an xdsChannel is being canceled, the channel itself would get closed. Hence we might not see the empty discovery request on the wire.

Summary of test changes:

TestClose in cdsbalancer_test.go was verifying that an empty discovery request was being sent out when the LB policy was closed. This does not happen anymore. So, the test is changed to only verify that the child policy is closed when the parent is closed.
TestADS_ACK_NACK_ResourceIsNotRequestedAnymore in ads_stream_ack_nack_test.go:
- The test was checking that an empty discovery request was being sent when a previously requested resource was no longer being requested. This is being changed to verify that the underlying connection to the management server is closed instead.
- Similarly when a watch for the same resource is re-registered, earlier, the test was expecting a request with the previously cached version number. But now, since the channel is being deleted, the new watch will result in a request with an empty version string.
TestADS_ResourcesAreRequestedAfterStreamRestart in ads_stream_restart_test.go:
- Remove that check that looks for an empty discovery request
xds/internal/xdsclient/tests/authority_test.go
- One of the tests was specifically verifying that an xdsChannel was not being closed when the last watch was canceled. That test is changed to verify that the xdsChannel is being closed when the last watch is canceled.
- Couple of tests are deleted because they no longer make sense
TestLDSWatch_ResourceCaching_NACKError in lds_watchers_test.go was being flaky because it was validating that exactly two discovery requests were being sent. But since the resource in that test was being NACKed, the management server will keep resending the same resource and therefore the client will keep NACKing it.

RELEASE NOTES:

TBD

…erences are released

easwars · 2024-12-12T23:22:56Z

@danielzhaotongliu

codecov · 2024-12-12T23:30:26Z

Codecov Report

Attention: Patch coverage is 76.19048% with 5 lines in your changes missing coverage. Please review.

Project coverage is 81.96%. Comparing base (c1b6b37) to head (53b8f9c).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
xds/internal/xdsclient/authority.go	50.00%	2 Missing ⚠️
xds/internal/xdsclient/clientimpl.go	66.66%	2 Missing ⚠️
internal/testutils/channel.go	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7924      +/-   ##
==========================================
- Coverage   81.98%   81.96%   -0.03%     
==========================================
  Files         377      379       +2     
  Lines       38180    38230      +50     
==========================================
+ Hits        31302    31334      +32     
- Misses       5571     5583      +12     
- Partials     1307     1313       +6

Files with missing lines	Coverage Δ
xds/internal/xdsclient/client_new.go	`85.71% <100.00%> (-0.65%)`	⬇️
xds/internal/xdsclient/client_refcounted.go	`82.05% <100.00%> (ø)`
internal/testutils/channel.go	`83.33% <83.33%> (ø)`
xds/internal/xdsclient/authority.go	`76.87% <50.00%> (+1.76%)`	⬆️
xds/internal/xdsclient/clientimpl.go	`74.84% <66.66%> (+0.36%)`	⬆️

... and 19 files with indirect coverage changes

purnesh42H

Overall lgtm. Few minor comments.

purnesh42H · 2024-12-13T12:26:40Z

xds/internal/xdsclient/authority.go

@@ -687,7 +694,7 @@ func (a *authority) unwatchResource(rType xdsresource.Type, resourceName string,
 			delete(state.watchers, watcher)
 			if len(state.watchers) > 0 {
 				if a.logger.V(2) {
-					a.logger.Infof("%d more watchers exist for type %q, resource name %q", rType.TypeName(), resourceName)
+					a.logger.Infof("More watchers exist for type %q, resource name %q", rType.TypeName(), resourceName)


"Watchers still exist for type %q, resource name %q"? considering that "more" is applicable with quantitative comparison

purnesh42H · 2024-12-13T12:30:24Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

 //  3. The same resource is requested again. The test verifies that the request
-//     is sent with the previously ACKed version.
+//     is sent with an empty version string.


it looks like you are verifying this by ensuring that the re-request is identical to first request. We should mention that here?

purnesh42H · 2024-12-13T12:31:21Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

 	}
+	streamRequestCh.Drain()


why do we have to drain? May be good to put a comment to explain?

Added a comment, and slightly re-worked the test since I saw another flake on Github actions.

we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel

Based on the PR description, i thought this should never happen anymore?

purnesh42H · 2024-12-13T12:36:56Z

xds/internal/xdsclient/tests/lds_watchers_test.go

+// containing both the old good update and the latest NACK error. The test
+// verifies that a when a new watch is registered for the same resource, the new
+// watcher receives the good update followed by the NACK error.
+func (s) TestLDSWatch_ResourceCaching_NACKError(t *testing.T) {


So, we are not verifying if the update is coming from resource cache?

This is the scenario being tested:

watcher1 registers a watch for resource1, and gets a response at version1

version1 is ACKed

management server then sends resource1 at version2

version2 is NACKed

watcher2 registers a watch for resource1

if this resource is not handled from the cache and goes to the management server instead, it will only get the response at version2. Since we are verifying that this watcher gets the good update at version1 followed by the error at version2, we are indirectly verifying that the watch was handled from the cache.

danielzhaotongliu · 2024-12-13T18:15:18Z

xds/internal/balancer/cdsbalancer/cdsbalancer_test.go

+	if err != nil {
+		t.Fatalf("Failed to receive new connection from wrapped listener: %v", err)
+	}
+	conn := val.(*testutils.ConnWrapper)


Would this type assertion always succeed? If not, should we use conn, ok and check the assertion actually succeeded before continuing?

Ditto for all other newly added type assertions.

This type assertion would always succeed because we pass a listener wrapper created in line 818 with the call to testutils.NewListenerWrapper. net.Conn instances returned by this listener are always of type testutils.ConnWrapper. See: https://github.com/grpc/grpc-go/blob/master/internal/testutils/wrappers.go#L27

danielzhaotongliu · 2024-12-13T18:57:00Z

xds/internal/balancer/cdsbalancer/cdsbalancer_test.go

-	// Wait for the CDS resource to be not requested anymore.
+	// Wait for the CDS resource to be not requested anymore, or the connection
+	// to the management server to be closed (which happens as part of the last
+	// resource watch being canceled).


Does the order of whether the connection being closed and CDS resource being cancelled matter here (and would that need to be reflected in the tests)? My understanding is yes, when the last watch is cancelled (comes first), then the connection/channel then gets closed(comes second), then we may/may not see empty discover request.

When the last watch is canceled by calling the cancel function returned from WatchCluster, the xDS client implementation first unsubscribes to the resource. This is done by invoking the unsubscribe method on the underlying xdsChannel which represents the connection to the management server. After this, if the xDS client implementation figures that there are no more resource watches on this management server, it closes the corresponding xdsChannel.

While the closing of the xdsChannel is handled synchronously by ensuring that the channel is actually closed and all Go resources (and goroutines) allocated by it are freed by the time close returns, the unsubscription of a resource is done asynchronously. That is the reason why the order of these two events is not guaranteed.

danielzhaotongliu · 2024-12-13T19:04:49Z

xds/internal/xdsclient/authority.go

+			// state can only be accessed in the context of an
+			// xdsClientSerializer callback. Hence making a copy of the error
+			// here for watchCallbackSerializer.
+			err := state.md.ErrState.Err


For my understanding, why would we need to make a copy of the error for the callback serializer? What is the consequence of just using state.md.ErrState.Err directly as previously? Is it for easier testing?

state must only be accessed in the context of an xdsClientSerializer callback. The serializer ensures that only one callback is active at any given time, thereby guaranteeing mutual exclusion and therefore we don't need to guard accesses to state with a mutex. See: https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/authority.go#L90

If you observer closely, the call to watcher.OnError does not happen in the context of an xdsClientSerializer callback. It happens in the context of a watcherCallbackSerializer callback. So, the race detector was complaining about a data race. Making a copy of the error solves the issue. If you look at line 646, the same is done when watcher.OnUpdate is called.

danielzhaotongliu · 2024-12-13T19:17:23Z

xds/internal/xdsclient/clientimpl.go

-			c.channelsMu.Unlock()
-		}))
+		c.channelsMu.Unlock()
+		state.channel.close()


To confirm, it is thread-safe to access the current channel state object without the lock?

If we want to reason about it, it does look like it is a safe thing to do because state.channel is written to when the state instance is created and then never changed. Which is probably why the race detector also did not complain about it.

But I've gone ahead and made a copy of the pointer to the underlying channel before releasing the lock. Thanks.

purnesh42H

lgtm. One follow up question about empty request and channel draining

purnesh42H · 2024-12-13T20:27:22Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

 	}
+	streamRequestCh.Drain()


we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel

Based on the PR description, i thought this should never happen anymore?

easwars · 2024-12-13T20:36:06Z

we drain the request channel here so that if an empty discovery request was received, it is pulled out of the request channel

Based on the PR description, i thought this should never happen anymore?

This is from the PR description:

This change does lead to a minor behavior change. Earlier when the last watch on an xdsChannel was being canceled, we would see a discovery request with no resources specified in it. But with the current change, when the last watch on an xdsChannel is being canceled, the channel itself would get closed. Hence we might not see the empty discovery request on the wire.

The above paragraph does use might to indicate the uncertainty. Is that not clear? If so, let me know how you want me to change it.

danielzhaotongliu · 2024-12-13T20:50:26Z

LGTM (I do not have permission to approve in Github)! Thanks for the detailed explanation, this really helped me understand the gRPC Go xDS stack at a deeper level, I can see the parallels with the gRPC Java xDS stack.

xdsclient: stop caching xdsChannels for potential reuse after all ref…

a2e8a67

…erences are released

easwars requested a review from purnesh42H December 12, 2024 02:07

easwars assigned purnesh42H Dec 12, 2024

easwars added the Type: Internal Cleanup Refactors, etc label Dec 12, 2024

easwars added this to the 1.70 Release milestone Dec 12, 2024

handle a few more test flakes

a4887ee

purnesh42H reviewed Dec 13, 2024

View reviewed changes

purnesh42H assigned easwars and unassigned purnesh42H Dec 13, 2024

address first round of review comments

dde4de3

easwars assigned purnesh42H and unassigned easwars Dec 13, 2024

danielzhaotongliu reviewed Dec 13, 2024

View reviewed changes

purnesh42H approved these changes Dec 13, 2024

View reviewed changes

purnesh42H assigned easwars and unassigned purnesh42H Dec 13, 2024

copy channel ref before releasing lock

53b8f9c

easwars merged commit 3f76275 into grpc:master Dec 13, 2024
15 checks passed

easwars deleted the delete_authority_idle_cache branch December 13, 2024 22:14

easwars mentioned this pull request Dec 13, 2024

xdsclient: introduce pool to manage multiple xDS clients with same bootstrap content #7898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xdsclient: stop caching xdsChannels for potential reuse, after all references are released #7924

xdsclient: stop caching xdsChannels for potential reuse, after all references are released #7924

easwars commented Dec 12, 2024 •

edited

Loading

easwars commented Dec 12, 2024

codecov bot commented Dec 12, 2024 •

edited

Loading

purnesh42H left a comment

purnesh42H Dec 13, 2024

easwars Dec 13, 2024

purnesh42H Dec 13, 2024

easwars Dec 13, 2024

purnesh42H Dec 13, 2024

easwars Dec 13, 2024

purnesh42H Dec 13, 2024

purnesh42H Dec 13, 2024 •

edited

Loading

easwars Dec 13, 2024

danielzhaotongliu Dec 13, 2024

easwars Dec 13, 2024

danielzhaotongliu Dec 13, 2024

easwars Dec 13, 2024

danielzhaotongliu Dec 13, 2024

easwars Dec 13, 2024

danielzhaotongliu Dec 13, 2024

easwars Dec 13, 2024

purnesh42H left a comment

purnesh42H Dec 13, 2024

easwars commented Dec 13, 2024

danielzhaotongliu commented Dec 13, 2024

xdsclient: stop caching xdsChannels for potential reuse, after all references are released #7924

xdsclient: stop caching xdsChannels for potential reuse, after all references are released #7924

Conversation

easwars commented Dec 12, 2024 • edited Loading

easwars commented Dec 12, 2024

codecov bot commented Dec 12, 2024 • edited Loading

Codecov Report

purnesh42H left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easwars commented Dec 13, 2024

danielzhaotongliu commented Dec 13, 2024

easwars commented Dec 12, 2024 •

edited

Loading

codecov bot commented Dec 12, 2024 •

edited

Loading

purnesh42H Dec 13, 2024 •

edited

Loading