-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for alternative runtimes #111
Comments
need --runtime flag support |
Hi, There are two problems, in this example I use the containerd-shim-spin:
As a solution to the first problem I introduced the For the second problem, I had no better idea than to store the runtime handler as a label of the pause docker container which represents the pod. May someone has a better idea. What do you think? |
Ok, so:
From a practical POV, the kind of validation in your PR doesn't really belong in The |
@evol262 This is a bit more complicated for two reasons:
In short the config for setting a default runtime is currently a bit busted as the code is so brittle that it's blocked on a major refactor of config-handling. So for now, |
Thanks, @evol262! I just realized that I was confused by the term "alternative runtimes", it means other OCI compliant runtimes than What I am working on is using alternative containerd shims like described in https://docs.docker.com/desktop/wasm/. And alternative containerd shims are also passed via Thanks @neersighted, would it be an option to pass the
That would address @evol262 concern that docker should be the leading system to manage the runtimes (and shims). |
I think it would make sense to pass values through without any opinion in cri-dockerd. I just explained why daemon.json is not an option right now -- that will take some time, upstream, to fix. I also want to push back on your config snippet -- (or as @corhere explained it to me, the shim is the OCI runtime and runc is just an implementation detail; this makes sense when you consider the Kata and gVisor process models) In short, no opinions in cri-dockerd makes sense, and the work on making alternate runtimes easier to use continues upstream, though most of the people involved in making it happen are working on other issues right now. I can bring it up in today's maintainer's call to see where people are at on returning to it. |
Understood, I can see that there is no way to solve this problem in Thanks for the discussion 😊 |
Blindling passing a containerd-compatible runtime name is the correct way to use it. Any shimv2 runtime on containerd's
I disagree. Mixing runtimes for containers in a pod could break Kata and other runtimes which aren't namespace-based. Kata, for instance, runs all containers in a pod inside a single VM and likely wouldn't take too kindly to being asked to start a container in a sandbox it didn't also create.
FTFY. Kubernetes |
Looks like I missed the follow-up PR, thanks for clarifying @corhere! |
Blindly passing a containerd-compatible runtime name is the correct way to use it for moby. cri-dockerd can take it as a given that the administrator has hopefully set up the right node annotations so the scheduler will place it on somewhere that it's configured, but the feedback loop around failing to schedule/start when blindly passing is terrible, and a richer way to do it is much better.
It's not mixing runtimes for containers. The only purpose of the pod-infra-container-image is to create a namespace which the containers can graft onto, which is completely irrelevant with kata anyway, since the isolation is via KVM, and support for Kata with pods is finicky enough anyway, because the container's config needs to be annotated to tell Kata how to handle it regardless -- something which cri-dockerd can't do anyway, since docker/moby handles that passthrough to containerd. It's not in the scope here. |
So should cri-dockerd validate the name of the runtime it passes to the daemon, or not?
What were you suggesting, then? I read it as "don't bother starting the pause container with the pod's configured runtime; using the daemon's default runtime to start pause containers is always fine, even when the workload containers are to be started with a different runtime." Using a different runtime to start the sandbox container than the pod's application containers would lead to broken behaviour with some runtimes which use isolation techniques other than Linux namespaces.
That's the only purpose of the pause container with namespace-based runtimes. The image may be irrelevant to Kata, but the act of starting a pause container is necessary for other purposes.
This feature request is titled "Support for alternative runtimes." Kata is an alternative runtime. While it may be the case that extra work above and beyond what is being considered here may be needed before As an aside, I have been working on adding first-class support for Kata to moby. While cri-dockerd can't ask the daemon to add OCI annotations today, it'd be fairly trivial to add that functionality into the daemon. Kata is already ready for us, too. |
No, but frankly, "just pass it blindly and hope it doesn't bomb with no meaningful feedback" when we pass unsanitized input to Docker isn't great either.
I'm suggesting that this is not even remotely the right place to have that discussion, that k8s doesn't actually care about the "sandbox container" except insofar as it providing somewhere the rest of them can hook onto (which could also be a VM), but that the discussion is utterly moot until we can ask the daemon to add OCI annotations. I'm suggesting "don't bother getting into thorny discussions about edge cases which are not even technically fulfillable yet and cherry pick a driver which has a different isolation model as an edge case. That discussion can happen once the rest of the pieces are there.
@corhere. Stop. That is the purpose of the pause container in the kubernetes model. It provides a holding point for a shared process/network namespace and a nominal PID1. The "act of starting a pause container" is functionally equivalent to "tell Kata via annotations to start a new VM and stick this container in it". Really, this discussion is not appropriate here. It is mixing paradigms in a bad way between what k8s does/expects and other container stuff.
Again, stop. Step away from the GH issue. "We're not going to do anything right now because there is not enough in Docker to hook onto to provide a user experience or even develop user stories" is not actively hostile. It is, quite literally, the status quo with the current state.
Great. We'll wait for it. |
To be perfectly clear, the scope of this is small. Hypothetical discussions about what could or could not (or should or should not) be done about firecracker, kata, and other container drivers with different isolation models is a waste of time and energy for everyone involved until whatever time the support is actually in the daemon, exposed in the API, and widely-available enough that there's an actual user request for fulfilling it. Kata/Firecracker need to start some container just for coherency with what the k8s domain model says should be there, but "these other container drivers which aren't in a state where cri-dockerd could meaningfully consume them anyway don't use namespaces/cgroups for isolation" is both true and meaningless at this point in time. In the meantime, the questions are narrow:
If we actually get back some |
The
|
Thanks @corhere. Getting an |
|
I agree that cri-dockerd should not have a mapping and I just hacked something together for a demo. But in my example I created a Pod
That leads to the error that the handler is not valid @corhere |
That's unfortunate. I misread the Kubernetes docs and overlooked that dots are disallowed in handler strings. So there will need to be a way to configure aliases for shimv2 runtimes in the daemon to unblock this. I've opened moby/moby#45030 to track any necessary daemon-side work. |
That is a blocker, yes. The other blocker is "other than a single query for discussion six months ago, there was no work done on this because the amount of development time for it is small, and 'nice to have' questions for discussion about eventual implementation with no concrete use case don't get prioritized." Patches welcome. When the daemon-side work is done would be a good time. Not to put too fine a point on it, but the speed of development here is often driven by the fact that there's only a very small amount of people writing patches, and the discrete needs (rather than asks, mostly) tend to get the focus. moby/moby#45032 got out really fast (thanks!), but it's a poor solution for us, even after moby/moby#45032 is applied. It's true that that more or less matches the containerd experience, and it's a fine starting point, but it's also true that we have room to do a little bit better even by surfacing |
The daemon runtimes config is surfaced to the client at # curl -s --unix-socket /var/run/docker.sock http://./v1.42/info | jq .Runtimes
{
"crun": {
"runtimeType": "io.containerd.runc.v2",
"options": {
"BinaryName": "/usr/local/bin/crun"
}
},
"runc": {
"path": "runc"
}
} |
Yes, but speaking of potential RPC overhead, it's not exactly GraphQL, and there's a lot of fields to serialize/deserialize if we hit potentially that every single time a container is scheduled with some custom Or take the risk of caching it aside in |
any update? default-runtime is
|
Docker recently started supporting alternative runtimes with the
--runtime
flag.This allows users to use alternative runtimes to runc such as Kata (which I'd like to use on some pods due to additional benefits of kernel isolation on external facing applications).
Additionally, cri-dockerd support was just merged into K3s, allowing users to continue to use Docker post upgrade to Kubernetes 1.24 and above.
Unfortunately, those who choose to keep using Docker with Kubernetes will be constrained to the lack of alternative runtime support meaning native Kubernetes support in form of RuntimeClasses is made redundant as the CRI layer is a bottleneck.
I think adding support for passing through the requested runtime should not be too difficult so I was wondering if it's something that could be implemented?
The text was updated successfully, but these errors were encountered: