Support device mapping in kube play similar do Kubernetes device plugins #17833

rhatdan · 2023-03-17T15:57:22Z

Discussed in #14934

^{Originally posted by bachp July 14, 2022}
Kubernetes provides ways to map devices into pods and containers.

One way is to map devices into containers via volumes but this requires privileged containers.

The more flexible way to add devices tvia Kubernetes Device Plugins

It abstracts the provisioning and mapping of devices to containers. So when specifying a Pod it's only needed to add an abstract name of the hardware, e.g. hardware-vendor.example/foo and the amount of it. The device plugin will then take care of making this available to containers. In YAML this will look like:

resources:
        limits:
          hardware-vendor.example/foo: 2

For podman kube play it would be useful to also support this resources.limits to map devices. As this would allow to deploy the same pod spec on both Kubernetes and podman and it would be transparent to the user.

However I don't think it makes sense to implement the full device plugin interface as it assumes a daemon (kubelet) running.

So for podman I think a better option is to implement a kind of hook mechanism that would get resource.limits and outputs the required --devices parameters, that would be applied before the pod is run.

The text was updated successfully, but these errors were encountered:

rhatdan · 2023-03-17T15:57:44Z

@umohnani8 PTAL

github-actions · 2023-04-17T00:06:38Z

A friendly reminder that this issue had no activity for 30 days.

andrew-kennedy · 2023-05-14T09:00:54Z

just want to note that this support would be welcomed, I ran into this issue as noted here: #18266

fpoirotte · 2023-06-14T17:04:13Z

Hello,

I would like to voice support in favor of this feature as well.

(damn, I wish I knew enough about Go and podman's codebase to implement this 😢)

rhatdan · 2023-06-20T18:38:28Z

@ygalblum @umohnani8 Anyone else PTAL
@fpoirotte interested in opening a PR?

bblenard · 2023-11-07T04:06:45Z

I'm trying to understand this issue to see if I would be up to the task and I have a few questions and I want to also make sure I understand what is going on here. Basically we want to make podman kube play .... sensitive to the spec.containers[*].resources.limits entry in a Podspec and automatically apply the devices listed there in the same way that running podman run --device .... would?

One point of confusion I have is that in the original comment it says:

One way is to map devices into containers via volumes but this requires privileged containers.

Is this accurate or does it actually depend on the device itself and whether or not elevated permissions are required to interact with it. When tracing through the code I wasn't really seeing that the --device option was handled much differently than a normal --volume. In fact unless I'm mistaken --device is really just a prebaked --volume with some default options specified for the user (based off of pkg/specgen/generate/config_linux.go)

rhatdan · 2023-11-08T21:54:04Z

--device works much differently in rootless and rootful containers. In rootful containers, it actually creates the device within the containers mountspace with the correct major/minor number and labels the new device with the correct SELinux label. This requires CAP_SYS_MKNOD so it only available to rootful containers.

In rootless containers we just bind mount the device from the host into the container, which is all we can. In some cases the device has an SELinux label, that would prevent it from being used in the container, so you need to disable SELinux use of the device. So in rootless mode there is little different from --device /dev/foobar and --volume /dev/foobar;/dev/foobar

In certain cases the UID/GID access to the device is not available to processes inside of the container. For example if you have group access based on being in the eng group. A container will not be in the eng group, since in rootless mode, there is a user namespace that does not include eng. In this case you can leak the eng group access into the container using the --group-add keep-groups flag.

bblenard · 2023-11-09T03:09:48Z

Okay I see. Here is how I understand this issue so far and I'm still tracing through the code so let me know if I am missing something as I explain my thoughts.

It seems to me that for this change I wouldn't actually need to worry too much about how the devices are setup by the underlying runtimes and instead the task mostly just consists of:

Update how Podman understands the K8s PodSpec.Containers[*].Resources.Limits so that a k8s device plugin like name can be recognized
When handling handling the various K8s play stuff (Pods, Deployments, Daemonsets) make sure the Device resources in the updated PodSpec get passed into the SpecGen code before it is passed to the underlying runtime / generate.MakeContainer

As long as the information correctly makes it into the SpecGen I think all of the privileged/nonprivileged dev mapping magic is all taken care of for me already

My next question would be since the K8s device plugins work by registering a name to kubelet and there is the whole notion of kubelet + the rpc plugin provider that podman does not have:

How should the limits specified in the podspec be matched to devices on the system? It looks like the generic-device-plugin seems to run with some CLI flags so maybe we provide a similar configuration in one of the podman config files?

bblenard · 2023-11-14T02:07:42Z

I started to try to implement this but I do think I need some clarification on my last question. Without that information I feel that it will be hard to modify the ResourceLimits in a way that makes sense.

In order to hit the

For podman kube play it would be useful to also support this resources.limits to map devices. As this would allow to deploy the same pod spec on both Kubernetes and podman and it would be transparent to the user.

part of this issue the resource limit type, specifically the resource name part, will need to carry extra meaning so that something like

hardware-vendor.nvidia/gpus: 2

gives the podman runtime enough information when generating the container spec to resolve the name to devices on the system. My thought would be to allow the podman user to define a device map of some sort in the system podman configuration so that hardware-vendor.nvidia/gpus -> /dev/nvidia* (maybe video cards aren't a great example but...)

Also from what I understand podman's configuration files are parsed with code from other repositories (https://github.com/containers/podman/blob/9c954739e9555c0940238f71ba3cc205deaa0e5e/docs/tutorials/rootless_tutorial.md?plain=1#L139C58-L139C94) so that to me means that in order to add the extra device mapping information to podman the configuration file structure would have to be modified in those other repos first

Maybe I'm way off here but thats why I'm seeking clarification :)

bachp · 2023-12-21T14:08:34Z

@bblenard I think the CDI configuration would contain the necessary info to map the name to the device.

So for example, if we use CDI support from Nvidia

resources:
  limits:
    nvidia.com/gpu=0:1

What's not yet entirely clear to me is how the CDI devices are reflected in the limits: section. As there seems to be an indirection via the device plugin. At least this is how I understand KEP-4009: Add CDI devices to device plugin API.
So I think the CDI name is not directly given but the kublet does a request for a a device type, e.g. nvidia.com/gpu: 2 and the device plugin would then return two unique CDI devices (e.g. nvidia.com/gpu=0 and nvidia.com/gpu=1).

If we say that we allow to directly specify the unique name, as in the example above, we can map it 1:1 to the --device parameter. I'm not sure if this leads to compatibility problems with Kubernetes tough.

bblenard · 2024-01-03T23:01:06Z

@bachp -- I need to familiarize myself with the things you referenced, just letting you know that I saw your message ( finally ;) )

bblenard · 2024-01-11T03:20:33Z

Okay so update with what I currently understand about how these things relate. As @bachp pointed out the CDI seemingly contains all the information the runtime would need to take a CDI kind(?) and map it to a device path on the system. I also think I see what @bachp is saying with it not quite fitting into the limits section.

I could be wrong but it looks like podman's type for PodSpec.Containers[*].Resources has diverged from Kubernetes' Container.Resources type. Kubernetes' ResourceClaim has an addition field that seems to contain the "name" described in the KEP-4009 referenced above.

I've tried to confirm my understanding by tracing how I believe Kubernetes handles the CDIDevice message that eventually gets passed to the container runtime.

So currently I would potentially purpose re-syncing Podman's container.resources type to match Kubernetes and then pass that down to the podspec similar to how I mentioned it earlier. Hopefully that makes sense :)

I'm not sure if additional support needs to be added so podman can handle CDI things (parsing the config under /etc/cdi/ / /var/run/cdi) but the Nvidia documentation @bachp linked indicates that podman can handle it.

phew... with all that being said I'd love some input on all that

bachp · 2024-01-11T08:52:56Z

@bblenard Podman already supports CDI device. You can pass a CDI device name via the --device command line flag. I tested this with both the nvidia and my own CDI passing in a TTY device.

It also works in podman-compose via the devices section:

devices:
  - nvidia.com/gpu=all

bblenard · 2024-01-25T02:09:47Z

I'm gonna at @rhatdan b/c he is one of the Containers folks that had some activity on this issue previously. Before I dig in and start on this issue properly I just want to make sure someone official is able to add their 2 cents.

Now that we've sorted out the podman support side and the major question (as I see it) is just how implement this into the Podman Resource Spec. Assuming I understand everything correctly I figure we can either:

Grab just the fields we need from the K8s ResourceClaim data structure and mirror it in the Podman version (probably the easiest unless we have good test coverage in that part of the codebase)
Update the entire Resource data structure to match up with the current K8s version of it (more "unrelated" work but potentially the right way to do it... I'm not sure)

rhatdan · 2024-02-03T12:35:33Z

This SGTM.

bblenard · 2024-02-08T03:22:06Z

Okay I have a way forward I think. I'll give this feature my best shot :)

bblenard · 2024-03-07T02:58:12Z

Update:

Still working on this just have been busy lately :)

I have code that I think works in a way that makes sense, I want to clean it up a bit though. Once its in a halfway decent state I'll probably push it up to my fork of podman.

bachp · 2024-03-12T08:17:30Z

@bblenard If you like to push something early I would love to review and give some feedback.

bblenard · 2024-03-13T03:16:08Z

@bachp -- I added some code here: https://github.com/bblenard/podman/tree/issues-17833-draft

The bulk of the code I wrote is here: pkg/k8s.io/api/resource/dynamic_resources_podman_types.go

My idea was to make the podman kube play feel like what I understand Kubernetes will be doing with dynamic device stuff based on KEP-4009 and KEP-3063. Obviously the differences here are that Kubernetes has a whole controller ecosystem that would typically handle the details when K8s ingests those types whereas Podman does not.

My "work around" to this was to add an additional argument to the playKube* functions (dynamicDeviceManager), this type is defined in pkg/k8s.io/api/resource/dynamic_resources_podman_types.go and is basically a bare bones struct that can tie these K8s types together that would normally be tracked by a resource controller. Other than that once the ResourceClaims are resolved to either a simple device path /dev/.... or a cdi device nvidia.com/gpu=1 those devices are just injected into the specGen.Devices field where podman handles them how it does with any other --device argument (unless I made some mistake ;) )

Let me know what you think. I was finding it hard to keep things straight in my mind while reviewing KEP-3063 especially following the YAML examples.

PS: One big thing I have to do still is write tests but I have to figure that out still

CircuitCipher · 2024-11-10T03:59:15Z

Any planned movement on this? I recently started down the path of setting up a JellyFin server with HW acceleration (via nvidia and CDI) and prefer the descriptiveness of the kube files for my deployments. Interesting that --device is easily accessible via run commands but functionality is missing in kube. Anywhere extra hands can used? Or are we just waiting to hear on some form of direction?

robertgzr · 2024-11-19T13:19:04Z

@bblenard i just took a look at your code and rebased it onto main feel free to just take the commits from here: https://github.com/robertgzr/podman/tree/issues-17833-draft--main

if i understand correctly, you implemented support for DRA as documented here: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/, which looks like this when expressed in YAML:

apiVersion: simpledevice.resource.podman.io/v1
kind: ResourceClaimParameters
metadata:
  name: kmsg-parameters
spec:
  hostpath: /dev/kmsg  
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  name: kmsg-template
spec:
  resourceClassName: PodmanResourceClass
  parametersRef: kmsg-parameters
---
apiVersion: v1
kind: Pod
[...]
resources:
  claims:
  - name: kmsg
resourceClaims:
- name: kmsg
  resourceClaimTemplateName: kmsg-template

That is quite verbose compared to what I imagined (using a podman-specific CDI string):

resources:
  limits:
    io.podman.device/kmsg: 1

the bits that support the configuration as shared by @bachp #17833 (comment) are also still missing, correct?

bachp · 2024-12-19T08:42:52Z

As a workaround for this and future similar cases, would it be acceptable to have a podman specific annotation that would allow to specify argument directly to podman? Basically, an annotation that could be set and allows to set parameters like --device directly.

This would provide an escape hatch until features are properly implemented in the yaml.

robertgzr · 2025-01-14T13:12:03Z

@rhatdan @bachp i gave this another shot. my pr tries to be more pragmatic and omits the resourceClaim implementation in favor of relying on the existing support for handling CDI names. ptal :)

#25010

rhatdan added the kube label Mar 17, 2023

github-actions bot added the stale-issue label Apr 17, 2023

fpoirotte mentioned this issue Apr 28, 2023

Support CDI devices in containers.conf in podman play kube #18266

Closed

github-actions bot removed the stale-issue label May 15, 2023

rhatdan added the Good First Issue This issue would be a good issue for a first time contributor to undertake. label May 15, 2023

rhatdan added the volunteers-wanted Issues good for community/volunteer contributions label Jun 20, 2023

robertgzr mentioned this issue Jan 14, 2025

Add kube play support for CDI resource allocation #25010

Merged

openshift-merge-bot bot closed this as completed in #25010 Jan 14, 2025

openshift-merge-bot bot closed this as completed in 0d0a78c Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support device mapping in kube play similar do Kubernetes device plugins #17833

Support device mapping in kube play similar do Kubernetes device plugins #17833

rhatdan commented Mar 17, 2023

rhatdan commented Mar 17, 2023

github-actions bot commented Apr 17, 2023

andrew-kennedy commented May 14, 2023

fpoirotte commented Jun 14, 2023

rhatdan commented Jun 20, 2023

bblenard commented Nov 7, 2023 •

edited

Loading

rhatdan commented Nov 8, 2023

bblenard commented Nov 9, 2023

bblenard commented Nov 14, 2023

bachp commented Dec 21, 2023

bblenard commented Jan 3, 2024

bblenard commented Jan 11, 2024

bachp commented Jan 11, 2024

bblenard commented Jan 25, 2024 •

edited

Loading

rhatdan commented Feb 3, 2024

bblenard commented Feb 8, 2024

bblenard commented Mar 7, 2024

bachp commented Mar 12, 2024

bblenard commented Mar 13, 2024

CircuitCipher commented Nov 10, 2024

robertgzr commented Nov 19, 2024 •

edited

Loading

bachp commented Dec 19, 2024

robertgzr commented Jan 14, 2025 •

edited

Loading

Support device mapping in kube play similar do Kubernetes device plugins #17833

Support device mapping in kube play similar do Kubernetes device plugins #17833

Comments

rhatdan commented Mar 17, 2023

Discussed in #14934

rhatdan commented Mar 17, 2023

github-actions bot commented Apr 17, 2023

andrew-kennedy commented May 14, 2023

fpoirotte commented Jun 14, 2023

rhatdan commented Jun 20, 2023

bblenard commented Nov 7, 2023 • edited Loading

rhatdan commented Nov 8, 2023

bblenard commented Nov 9, 2023

bblenard commented Nov 14, 2023

bachp commented Dec 21, 2023

bblenard commented Jan 3, 2024

bblenard commented Jan 11, 2024

bachp commented Jan 11, 2024

bblenard commented Jan 25, 2024 • edited Loading

rhatdan commented Feb 3, 2024

bblenard commented Feb 8, 2024

bblenard commented Mar 7, 2024

bachp commented Mar 12, 2024

bblenard commented Mar 13, 2024

CircuitCipher commented Nov 10, 2024

robertgzr commented Nov 19, 2024 • edited Loading

bachp commented Dec 19, 2024

robertgzr commented Jan 14, 2025 • edited Loading

bblenard commented Nov 7, 2023 •

edited

Loading

bblenard commented Jan 25, 2024 •

edited

Loading

robertgzr commented Nov 19, 2024 •

edited

Loading

robertgzr commented Jan 14, 2025 •

edited

Loading