server start failed for reason 'oversized message' #102

yylt · 2024-08-13T06:39:27Z

nri-daemon log

time="2024-08-10T06:23:28Z" level=info msg="Configuring plugin 05-Controller for runtime containerd/v1.7.16"
time="2024-08-10T06:23:28Z" level=info msg="Started plugin 05-Controller..."
time="2024-08-10T06:23:29Z" level=error msg="error receiving message" error="failed to discard after receiving oversized message: cannot allocate memory"

containerd log

time="2024-08-10T14:25:55.098248241+08:00" level=info msg="synchronizing plugin 05-Controller"
time="2024-08-10T14:25:57.098681078+08:00" level=info msg="failed to synchronize plugin: context deadline exceeded"
time="2024-08-10T14:25:57.098762853+08:00" level=info msg="plugin \"05-Controller\" connected"

Sure the issue is likely within the ttrpc repository, there is a check on the window during the reception process within ttrpc. https://github.com/containerd/ttrpc/blob/655622931dab8c39a563e8c82ae90cdc748f72a1/channel.go#L126

However, there are questions:

Is ttrpc the appropriate RPC framework for this context?
Should exit when encountering such an error?

/cc @klihub

The text was updated successfully, but these errors were encountered:

klihub · 2024-08-13T07:33:53Z

@yylt Is that a reproducible problem in your environment ? Can you give a bit more details ? I'd be interested at least in the number of pods and containers you have running in your system.

yylt · 2024-08-13T08:33:27Z

There are 150 pods, primarily distributed in the same namespace, such as "default", might be related to the number of env.

The issue can be consistently reproduced in my environment, maybe a certain number of pods would be required when reproduced.

klihub · 2024-08-13T08:46:29Z

And how many containers do you have altogether in those pods ?

yylt · 2024-08-13T08:59:09Z

And how many containers do you have altogether in those pods ?

The number of containers per pod should not seem to affect this, as each sync operation is independent of the others.

klihub · 2024-08-13T10:47:03Z

Not per pod. The number of total containers in all pods. I assume we hit the ttrpc messageLengthMax limit with the sync request, so what matters is both the total number of pods and the total number of containers. That's why I'd like to know it. IOW what does crictl ps | grep -v CONTAINER | wc -l tell on the failing host ?

klihub · 2024-08-13T13:08:26Z

Also it would be interesting to see the result of these:

crictl pods -o json | wc -c
crictl ps -o json | wc -c

yylt · 2024-08-13T13:23:52Z

# ctr -n k8s.io c ls  |wc -l
643

# crictl ps | grep -v CONTAINER | wc -l
156

# crictl pods -o json | wc -c
256810

# crictl ps -o json | wc -c
200467

klihub · 2024-08-14T11:43:59Z

@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.

I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.

yylt · 2024-08-14T12:07:41Z

@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.

I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.

ok, Is https://github.com/klihub/nri/tree/fixes/yylt-sync-failure, right?

klihub · 2024-08-14T12:09:01Z

@yylt Yes, but I have a directly patched 1.17.16 containerd tree pointing at that nri version and re-vendored here, so it's easier to just compile and use that...

https://github.com/klihub/containerd/tree/fixes/yylt-sync-failure

klihub · 2024-08-14T12:10:22Z

Oh, and you will need to recompile your plugin against that NRI tree as well. Otherwise the runtime-side will detect that the plugin does not have the necessary support compiled in and will kick it out during synchronization.

yylt · 2024-08-14T12:56:33Z

After replacing both nri-daemon and containerd, the sync error no longer occurs upon restart.

If nri-daemon is replaced individually, the issue still persists.

klihub · 2024-08-14T14:23:16Z

After replacing both nri-daemon and containerd, the sync error no longer occurs upon restart.

If nri-daemon is replaced individually, the issue still persists.

Yes, that is the expected behavior. And if you only update containerd, but run with an old plugin ('nri-daemon' I believe in your case), then the plugin should get disconnected during synchronization...

mikebrow added this to the 1.0 milestone Aug 22, 2024

mikebrow added the v1.0 required label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server start failed for reason 'oversized message' #102

server start failed for reason 'oversized message' #102

yylt commented Aug 13, 2024

klihub commented Aug 13, 2024

yylt commented Aug 13, 2024

klihub commented Aug 13, 2024

yylt commented Aug 13, 2024

klihub commented Aug 13, 2024 •

edited

Loading

klihub commented Aug 13, 2024

yylt commented Aug 13, 2024 •

edited

Loading

klihub commented Aug 14, 2024 •

edited

Loading

yylt commented Aug 14, 2024

klihub commented Aug 14, 2024

klihub commented Aug 14, 2024 •

edited

Loading

yylt commented Aug 14, 2024

klihub commented Aug 14, 2024

server start failed for reason 'oversized message' #102

server start failed for reason 'oversized message' #102

Comments

yylt commented Aug 13, 2024

klihub commented Aug 13, 2024

yylt commented Aug 13, 2024

klihub commented Aug 13, 2024

yylt commented Aug 13, 2024

klihub commented Aug 13, 2024 • edited Loading

klihub commented Aug 13, 2024

yylt commented Aug 13, 2024 • edited Loading

klihub commented Aug 14, 2024 • edited Loading

yylt commented Aug 14, 2024

klihub commented Aug 14, 2024

klihub commented Aug 14, 2024 • edited Loading

yylt commented Aug 14, 2024

klihub commented Aug 14, 2024

klihub commented Aug 13, 2024 •

edited

Loading

yylt commented Aug 13, 2024 •

edited

Loading

klihub commented Aug 14, 2024 •

edited

Loading

klihub commented Aug 14, 2024 •

edited

Loading