-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server start failed for reason 'oversized message' #102
Comments
@yylt Is that a reproducible problem in your environment ? Can you give a bit more details ? I'd be interested at least in the number of pods and containers you have running in your system. |
There are 150 pods, primarily distributed in the same namespace, such as "default", might be related to the number of The issue can be consistently reproduced in my environment, maybe a certain number of pods would be required when reproduced. |
And how many containers do you have altogether in those pods ? |
The number of containers per pod should not seem to affect this, as each sync operation is independent of the others. |
Not per pod. The number of total containers in all pods. I assume we hit the ttrpc messageLengthMax limit with the sync request, so what matters is both the total number of pods and the total number of containers. That's why I'd like to know it. IOW what does |
Also it would be interesting to see the result of these:
|
|
@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails. I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes. |
ok, Is https://github.com/klihub/nri/tree/fixes/yylt-sync-failure, right? |
@yylt Yes, but I have a directly patched 1.17.16 containerd tree pointing at that nri version and re-vendored here, so it's easier to just compile and use that... https://github.com/klihub/containerd/tree/fixes/yylt-sync-failure |
Oh, and you will need to recompile your plugin against that NRI tree as well. Otherwise the runtime-side will detect that the plugin does not have the necessary support compiled in and will kick it out during synchronization. |
After replacing both If |
Yes, that is the expected behavior. And if you only update containerd, but run with an old plugin ('nri-daemon' I believe in your case), then the plugin should get disconnected during synchronization... |
nri-daemon log
containerd log
Sure the issue is likely within the ttrpc repository, there is a check on the window during the reception process within ttrpc. https://github.com/containerd/ttrpc/blob/655622931dab8c39a563e8c82ae90cdc748f72a1/channel.go#L126
However, there are questions:
ttrpc
the appropriate RPC framework for this context?/cc @klihub
The text was updated successfully, but these errors were encountered: