Performance and efficiency improvements in daemon/server mode #1154

dcbw · 2023-09-08T19:45:54Z

When running in server mode we can make a number of performance and efficiency improvements:

Use a shared informer to listen for Pod events from the apiserver and grab pod info from that cache rather than doing direct apiserver requests each time. This reduces apiserver load and retry latency, since multus can poll the local cache more frequently than it should do direct apiserver requests.
Use a higher QPS on the long-lived kube client since Multus is a performance critical component. Prevents client-side throttling when cluster is under heavy CNI operation, like during mass pod/namespace deletion.
Use protobuf instead of JSON in the kube client. More efficient encoding reduces load on the apiserver.
Don't bother checking the readiness indicator file every CNI operation in server mode when the ConfigManager is active, since it already watches the readiness indicator and will exit if it gets deleted. Turns out that os.Stat()-ing the file often returns "file not found" (even though it does!) when Multus is running in a container, and that triggers the backoff behavior which adds large latency to CNI operations.

coveralls · 2023-09-08T21:14:26Z

coverage: 64.54% (+0.6%) from 63.938% when pulling d9c06e9 on dcbw:shared-informer into 02ce071 on k8snetworkplumbingwg:master.

dcbw · 2023-09-13T02:34:55Z

pkg/types/conf.go

+		Jitter:   0.1,
+	}
+	if utilwait.ExponentialBackoff(backoff, func() (bool, error) {
+		_, err := os.Stat(readinessIndicatorFile)


This commit was dropped and the code referenced is no longer in this PR.

We want the in-cluster client that the multus server uses to use the same client config (QPS, protobuf, grpc, etc) as the regular client. Signed-off-by: Dan Williams <[email protected]>

Multus is a pretty critical piece of infrastructure, so it shouldn't be subject to the same lower QPS limits as most components are. Signed-off-by: Dan Williams <[email protected]>

Then we can just use the Server struct kube client and exec rather than passing them through the function parameters. Signed-off-by: Dan Williams <[email protected]>

We'll need these for the next commit. Signed-off-by: Dan Williams <[email protected]>

Signed-off-by: Dan Williams <[email protected]>

Move server start code to a common function that both regular and test code can use. Also shut down the server from the testcases. Signed-off-by: Dan Williams <[email protected]>

dougbtv · 2023-09-13T19:03:17Z

pkg/k8sclient/k8sclient.go

+	if clientInfo == nil {
+		return nil, fmt.Errorf("failed to create in-cluster kube client")
+	}
+	return clientInfo, err


@s1061123 just want to make sure this approach works with your incoming changes?

dougbtv

I'm OK to move forward with this changeset as-is. Overall this looks like great improvements to the thick plugin approach and it's nice to get another take on it.

I would like to get Tomo's take on it as well, especially in consideration of some incoming changes that Tomo has staged.

Thanks for breaking it into commits, let's not squash those when we merge as these are rather logical. Also, thanks on the thorough refactoring for changed parameter counts, unused exports as well as testing.

dcbw · 2023-09-14T13:05:58Z

Fixed up the codecheck comments (missing function doc and an if/else simplification). No functional changes.

maiqueb · 2023-09-14T13:51:44Z

/cc

This is really good.

Conceptually really on-board w/ this, especially using protobuf instead of JSON, and informers over frying the K8S API.

I'll take a look ASAP.

…access When running in server mode we can use a shared informer to listen for Pod events from the apiserver, and grab pod info from that cache rather than doing direct apiserver requests each time. This reduces apiserver load and retry latency, since multus can poll the local cache more frequently than it should do direct apiserver requests. Oddly static pods don't show up in the informer by the timeout and require a direct apiserver request. Since static pods are not common and are typically long-running, it should not be a big issue to fall back to direct apiserver access for them. Signed-off-by: Dan Williams <[email protected]>

A couple of the setup variables for NewManager*() are already in the multus config that it gets passed, so use those instead of passing explicitly. Signed-off-by: Dan Williams <[email protected]>

Simplify setup by moving the post-creation operations like GenerateConfig() and PersistMultusConfig() into a new Start() function that also begins watching the configuration directory. This better encapsulates the manager functionality in the object. We can also get rid of the done channel passed to the config manager and just use the existing WaitGroup to determine when to exit the daemon main(). Signed-off-by: Dan Williams <[email protected]>

The test was comparing the same configuration to itself, since nothing in the changed CNI configuration is used in the written multus configuration. Instead make sure the updated CNI config contains something that will be reflected in the written multus configuration, and while we're there use a more robust way to wait for the config to be written via gomega.Eventually(). Signed-off-by: Dan Williams <[email protected]>

…dule Signed-off-by: Dan Williams <[email protected]>

Signed-off-by: Dan Williams <[email protected]>

…Manager For whatever reason calling os.Stat() on the readiness indicator file from CmdAdd()/CmdDel() when multus is running in server mode and is containerized often returns "file not found", which triggers the polling behavior of GetReadinessIndicatorFile(). This greatly delays CNI operations that should be pretty quick. Even if an exponential backoff is used, os.Stat() can still return "file not found" multiple times, even though the file clearly exists. But it turns out we don't need to check the readiness file in server mode when running with MultusConfigFile == "auto". In this mode the server starts the ConfigManager which (a) waits until the file exists and (b) fsnotify watches the readiness and (c) exits the daemon immediately if the file is deleted or moved. This means we can assume that while the daemon is running and the server is handling CNI requests that the readiness file exists; otherwise the daemon would have exited. Thus CmdAdd/CmdDel don't need to run a lot of possibly failing os.Stat() calls in the CNI hot paths. Signed-off-by: Dan Williams <[email protected]>

dcbw · 2023-09-14T13:59:48Z

One thing to note is that static pods don't seem to show up via the informer, but are preset in the apiserver when queried directly. Server-side pod watch filtering doesn't seem to be the issue since this happened even before I added the env bits to the daemonset yaml. But given that static pods are fairly rare and usually long-lived, it shouldn't be a problem to fall back to apiserver access for them.

maiqueb · 2023-09-14T14:06:19Z

One thing to note is that static pods don't seem to show up via the informer, but are preset in the apiserver when queried directly. Server-side pod watch filtering doesn't seem to be the issue since this happened even before I added the env bits to the daemonset yaml. But given that static pods are fairly rare and usually long-lived, it shouldn't be a problem to fall back to apiserver access for them.

Good catch. IMO it is perfectly reasonable to fallback to GETting the pod via API if not found in the informer cache.

dougbtv

lgtm! thank you!

dcbw force-pushed the shared-informer branch from b07f965 to bab52cb Compare September 8, 2023 21:07

dcbw force-pushed the shared-informer branch 2 times, most recently from 8663398 to 2053f6b Compare September 11, 2023 15:18

dcbw mentioned this pull request Sep 11, 2023

OCPBUGS-18307: Performance and efficiency improvements in daemon/server mode openshift/multus-cni#180

Merged

github-advanced-security bot found potential problems Sep 12, 2023

View reviewed changes

dcbw force-pushed the shared-informer branch from 4be8d0f to f7ed810 Compare September 13, 2023 02:27

dcbw changed the title ~~server: use a shared informer pod cache rather than direct apiserver access~~ Performance and efficiency improvements in daemon/server mode Sep 13, 2023

dcbw force-pushed the shared-informer branch 3 times, most recently from 065803b to 93a9089 Compare September 13, 2023 04:27

dcbw added 6 commits September 13, 2023 07:54

k8sclient: make InClusterK8sClient() call GetK8sClient()

fff8519

We want the in-cluster client that the multus server uses to use the same client config (QPS, protobuf, grpc, etc) as the regular client. Signed-off-by: Dan Williams <[email protected]>

k8sclient: bump QPS to 50

752f28c

Multus is a pretty critical piece of infrastructure, so it shouldn't be subject to the same lower QPS limits as most components are. Signed-off-by: Dan Williams <[email protected]>

server: make CmdAdd/Del/Check struct member functions

6b8d24c

Then we can just use the Server struct kube client and exec rather than passing them through the function parameters. Signed-off-by: Dan Williams <[email protected]>

vendor: add client-go and more apimachinery modules

7c68481

We'll need these for the next commit. Signed-off-by: Dan Williams <[email protected]>

daemon: remove unused done channel

1605ffc

Signed-off-by: Dan Williams <[email protected]>

server: simplify server start

cec1a53

Move server start code to a common function that both regular and test code can use. Also shut down the server from the testcases. Signed-off-by: Dan Williams <[email protected]>

dcbw changed the title ~~Performance and efficiency improvements in daemon/server mode~~ [wip] Performance and efficiency improvements in daemon/server mode Sep 13, 2023

dcbw force-pushed the shared-informer branch 3 times, most recently from fbfe495 to 8212539 Compare September 13, 2023 17:49

dcbw changed the title ~~[wip] Performance and efficiency improvements in daemon/server mode~~ Performance and efficiency improvements in daemon/server mode Sep 13, 2023

dougbtv reviewed Sep 13, 2023

View reviewed changes

dougbtv approved these changes Sep 13, 2023

View reviewed changes

dcbw force-pushed the shared-informer branch from 8212539 to 04170cd Compare September 14, 2023 13:05

dcbw added 3 commits September 14, 2023 08:57

server/config: simplify ConfigManager creation

4ade856

A couple of the setup variables for NewManager*() are already in the multus config that it gets passed, so use those instead of passing explicitly. Signed-off-by: Dan Williams <[email protected]>

dcbw added 4 commits September 14, 2023 08:58

server/config: un-export some functions no longer used outside the mo…

fb4f4aa

…dule Signed-off-by: Dan Williams <[email protected]>

server/config: use filepath.Join()

b0df7dd

Signed-off-by: Dan Williams <[email protected]>

dcbw force-pushed the shared-informer branch from 04170cd to d9c06e9 Compare September 14, 2023 13:58

dougbtv approved these changes Sep 15, 2023

View reviewed changes

dougbtv merged commit ddb977f into k8snetworkplumbingwg:master Sep 15, 2023
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and efficiency improvements in daemon/server mode #1154

Performance and efficiency improvements in daemon/server mode #1154

dcbw commented Sep 8, 2023 •

edited

Loading

coveralls commented Sep 8, 2023 •

edited

Loading

dcbw Sep 13, 2023

dougbtv Sep 13, 2023

dougbtv left a comment

dcbw commented Sep 14, 2023 •

edited

Loading

maiqueb commented Sep 14, 2023

dcbw commented Sep 14, 2023

maiqueb commented Sep 14, 2023

dougbtv left a comment

Performance and efficiency improvements in daemon/server mode #1154

Performance and efficiency improvements in daemon/server mode #1154

Conversation

dcbw commented Sep 8, 2023 • edited Loading

coveralls commented Sep 8, 2023 • edited Loading

dcbw Sep 13, 2023

Choose a reason for hiding this comment

dougbtv Sep 13, 2023

Choose a reason for hiding this comment

dougbtv left a comment

Choose a reason for hiding this comment

dcbw commented Sep 14, 2023 • edited Loading

maiqueb commented Sep 14, 2023

dcbw commented Sep 14, 2023

maiqueb commented Sep 14, 2023

dougbtv left a comment

Choose a reason for hiding this comment

dcbw commented Sep 8, 2023 •

edited

Loading

coveralls commented Sep 8, 2023 •

edited

Loading

dcbw commented Sep 14, 2023 •

edited

Loading