Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support device mapping in kube play similar do Kubernetes device plugins #17833

Closed
rhatdan opened this issue Mar 17, 2023 Discussed in #14934 · 23 comments · Fixed by #25010
Closed

Support device mapping in kube play similar do Kubernetes device plugins #17833

rhatdan opened this issue Mar 17, 2023 Discussed in #14934 · 23 comments · Fixed by #25010
Labels
Good First Issue This issue would be a good issue for a first time contributor to undertake. kube volunteers-wanted Issues good for community/volunteer contributions

Comments

@rhatdan
Copy link
Member

rhatdan commented Mar 17, 2023

Discussed in #14934

Originally posted by bachp July 14, 2022
Kubernetes provides ways to map devices into pods and containers.

One way is to map devices into containers via volumes but this requires privileged containers.

The more flexible way to add devices tvia Kubernetes Device Plugins

It abstracts the provisioning and mapping of devices to containers. So when specifying a Pod it's only needed to add an abstract name of the hardware, e.g. hardware-vendor.example/foo and the amount of it. The device plugin will then take care of making this available to containers. In YAML this will look like:

resources:
        limits:
          hardware-vendor.example/foo: 2

For podman kube play it would be useful to also support this resources.limits to map devices. As this would allow to deploy the same pod spec on both Kubernetes and podman and it would be transparent to the user.

However I don't think it makes sense to implement the full device plugin interface as it assumes a daemon (kubelet) running.

So for podman I think a better option is to implement a kind of hook mechanism that would get resource.limits and outputs the required --devices parameters, that would be applied before the pod is run.

@rhatdan rhatdan added the kube label Mar 17, 2023
@rhatdan
Copy link
Member Author

rhatdan commented Mar 17, 2023

@umohnani8 PTAL

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@andrew-kennedy
Copy link

just want to note that this support would be welcomed, I ran into this issue as noted here: #18266

@rhatdan rhatdan added the Good First Issue This issue would be a good issue for a first time contributor to undertake. label May 15, 2023
@fpoirotte
Copy link
Contributor

Hello,

I would like to voice support in favor of this feature as well.

(damn, I wish I knew enough about Go and podman's codebase to implement this 😢)

@rhatdan
Copy link
Member Author

rhatdan commented Jun 20, 2023

@ygalblum @umohnani8 Anyone else PTAL
@fpoirotte interested in opening a PR?

@rhatdan rhatdan added the volunteers-wanted Issues good for community/volunteer contributions label Jun 20, 2023
@bblenard
Copy link

bblenard commented Nov 7, 2023

I'm trying to understand this issue to see if I would be up to the task and I have a few questions and I want to also make sure I understand what is going on here. Basically we want to make podman kube play .... sensitive to the spec.containers[*].resources.limits entry in a Podspec and automatically apply the devices listed there in the same way that running podman run --device .... would?

One point of confusion I have is that in the original comment it says:

One way is to map devices into containers via volumes but this requires privileged containers.

Is this accurate or does it actually depend on the device itself and whether or not elevated permissions are required to interact with it. When tracing through the code I wasn't really seeing that the --device option was handled much differently than a normal --volume. In fact unless I'm mistaken --device is really just a prebaked --volume with some default options specified for the user (based off of pkg/specgen/generate/config_linux.go)

@rhatdan
Copy link
Member Author

rhatdan commented Nov 8, 2023

--device works much differently in rootless and rootful containers. In rootful containers, it actually creates the device within the containers mountspace with the correct major/minor number and labels the new device with the correct SELinux label. This requires CAP_SYS_MKNOD so it only available to rootful containers.

In rootless containers we just bind mount the device from the host into the container, which is all we can. In some cases the device has an SELinux label, that would prevent it from being used in the container, so you need to disable SELinux use of the device. So in rootless mode there is little different from --device /dev/foobar and --volume /dev/foobar;/dev/foobar

In certain cases the UID/GID access to the device is not available to processes inside of the container. For example if you have group access based on being in the eng group. A container will not be in the eng group, since in rootless mode, there is a user namespace that does not include eng. In this case you can leak the eng group access into the container using the --group-add keep-groups flag.

@bblenard
Copy link

bblenard commented Nov 9, 2023

Okay I see. Here is how I understand this issue so far and I'm still tracing through the code so let me know if I am missing something as I explain my thoughts.

It seems to me that for this change I wouldn't actually need to worry too much about how the devices are setup by the underlying runtimes and instead the task mostly just consists of:

  1. Update how Podman understands the K8s PodSpec.Containers[*].Resources.Limits so that a k8s device plugin like name can be recognized
  2. When handling handling the various K8s play stuff (Pods, Deployments, Daemonsets) make sure the Device resources in the updated PodSpec get passed into the SpecGen code before it is passed to the underlying runtime / generate.MakeContainer

As long as the information correctly makes it into the SpecGen I think all of the privileged/nonprivileged dev mapping magic is all taken care of for me already

My next question would be since the K8s device plugins work by registering a name to kubelet and there is the whole notion of kubelet + the rpc plugin provider that podman does not have:

  • How should the limits specified in the podspec be matched to devices on the system? It looks like the generic-device-plugin seems to run with some CLI flags so maybe we provide a similar configuration in one of the podman config files?

@bblenard
Copy link

I started to try to implement this but I do think I need some clarification on my last question. Without that information I feel that it will be hard to modify the ResourceLimits in a way that makes sense.

In order to hit the

For podman kube play it would be useful to also support this resources.limits to map devices. As this would allow to deploy the same pod spec on both Kubernetes and podman and it would be transparent to the user.

part of this issue the resource limit type, specifically the resource name part, will need to carry extra meaning so that something like

hardware-vendor.nvidia/gpus: 2

gives the podman runtime enough information when generating the container spec to resolve the name to devices on the system. My thought would be to allow the podman user to define a device map of some sort in the system podman configuration so that hardware-vendor.nvidia/gpus -> /dev/nvidia* (maybe video cards aren't a great example but...)

Also from what I understand podman's configuration files are parsed with code from other repositories (https://github.com/containers/podman/blob/9c954739e9555c0940238f71ba3cc205deaa0e5e/docs/tutorials/rootless_tutorial.md?plain=1#L139C58-L139C94) so that to me means that in order to add the extra device mapping information to podman the configuration file structure would have to be modified in those other repos first

Maybe I'm way off here but thats why I'm seeking clarification :)

@bachp
Copy link

bachp commented Dec 21, 2023

@bblenard I think the CDI configuration would contain the necessary info to map the name to the device.

So for example, if we use CDI support from Nvidia

resources:
  limits:
    nvidia.com/gpu=0:1

What's not yet entirely clear to me is how the CDI devices are reflected in the limits: section. As there seems to be an indirection via the device plugin. At least this is how I understand KEP-4009: Add CDI devices to device plugin API.
So I think the CDI name is not directly given but the kublet does a request for a a device type, e.g. nvidia.com/gpu: 2 and the device plugin would then return two unique CDI devices (e.g. nvidia.com/gpu=0 and nvidia.com/gpu=1).

If we say that we allow to directly specify the unique name, as in the example above, we can map it 1:1 to the --device parameter. I'm not sure if this leads to compatibility problems with Kubernetes tough.

@bblenard
Copy link

bblenard commented Jan 3, 2024

@bachp -- I need to familiarize myself with the things you referenced, just letting you know that I saw your message ( finally ;) )

@bblenard
Copy link

Okay so update with what I currently understand about how these things relate. As @bachp pointed out the CDI seemingly contains all the information the runtime would need to take a CDI kind(?) and map it to a device path on the system. I also think I see what @bachp is saying with it not quite fitting into the limits section.

I could be wrong but it looks like podman's type for PodSpec.Containers[*].Resources has diverged from Kubernetes' Container.Resources type. Kubernetes' ResourceClaim has an addition field that seems to contain the "name" described in the KEP-4009 referenced above.

I've tried to confirm my understanding by tracing how I believe Kubernetes handles the CDIDevice message that eventually gets passed to the container runtime.

So currently I would potentially purpose re-syncing Podman's container.resources type to match Kubernetes and then pass that down to the podspec similar to how I mentioned it earlier. Hopefully that makes sense :)

I'm not sure if additional support needs to be added so podman can handle CDI things (parsing the config under /etc/cdi/ / /var/run/cdi) but the Nvidia documentation @bachp linked indicates that podman can handle it.

phew... with all that being said I'd love some input on all that

@bachp
Copy link

bachp commented Jan 11, 2024

@bblenard Podman already supports CDI device. You can pass a CDI device name via the --device command line flag. I tested this with both the nvidia and my own CDI passing in a TTY device.

It also works in podman-compose via the devices section:

devices:
  - nvidia.com/gpu=all

@bblenard
Copy link

bblenard commented Jan 25, 2024

I'm gonna at @rhatdan b/c he is one of the Containers folks that had some activity on this issue previously. Before I dig in and start on this issue properly I just want to make sure someone official is able to add their 2 cents.

Now that we've sorted out the podman support side and the major question (as I see it) is just how implement this into the Podman Resource Spec. Assuming I understand everything correctly I figure we can either:

  1. Grab just the fields we need from the K8s ResourceClaim data structure and mirror it in the Podman version (probably the easiest unless we have good test coverage in that part of the codebase)
  2. Update the entire Resource data structure to match up with the current K8s version of it (more "unrelated" work but potentially the right way to do it... I'm not sure)

@rhatdan
Copy link
Member Author

rhatdan commented Feb 3, 2024

This SGTM.

@bblenard
Copy link

bblenard commented Feb 8, 2024

Okay I have a way forward I think. I'll give this feature my best shot :)

@bblenard
Copy link

bblenard commented Mar 7, 2024

Update:

Still working on this just have been busy lately :)

I have code that I think works in a way that makes sense, I want to clean it up a bit though. Once its in a halfway decent state I'll probably push it up to my fork of podman.

@bachp
Copy link

bachp commented Mar 12, 2024

@bblenard If you like to push something early I would love to review and give some feedback.

@bblenard
Copy link

@bachp -- I added some code here: https://github.com/bblenard/podman/tree/issues-17833-draft

The bulk of the code I wrote is here: pkg/k8s.io/api/resource/dynamic_resources_podman_types.go

My idea was to make the podman kube play feel like what I understand Kubernetes will be doing with dynamic device stuff based on KEP-4009 and KEP-3063. Obviously the differences here are that Kubernetes has a whole controller ecosystem that would typically handle the details when K8s ingests those types whereas Podman does not.

My "work around" to this was to add an additional argument to the playKube* functions (dynamicDeviceManager), this type is defined in pkg/k8s.io/api/resource/dynamic_resources_podman_types.go and is basically a bare bones struct that can tie these K8s types together that would normally be tracked by a resource controller. Other than that once the ResourceClaims are resolved to either a simple device path /dev/.... or a cdi device nvidia.com/gpu=1 those devices are just injected into the specGen.Devices field where podman handles them how it does with any other --device argument (unless I made some mistake ;) )

Let me know what you think. I was finding it hard to keep things straight in my mind while reviewing KEP-3063 especially following the YAML examples.

PS: One big thing I have to do still is write tests but I have to figure that out still

@CircuitCipher
Copy link

Any planned movement on this? I recently started down the path of setting up a JellyFin server with HW acceleration (via nvidia and CDI) and prefer the descriptiveness of the kube files for my deployments. Interesting that --device is easily accessible via run commands but functionality is missing in kube. Anywhere extra hands can used? Or are we just waiting to hear on some form of direction?

@robertgzr
Copy link
Contributor

robertgzr commented Nov 19, 2024

@bblenard i just took a look at your code and rebased it onto main feel free to just take the commits from here: https://github.com/robertgzr/podman/tree/issues-17833-draft--main

if i understand correctly, you implemented support for DRA as documented here: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/, which looks like this when expressed in YAML:

apiVersion: simpledevice.resource.podman.io/v1
kind: ResourceClaimParameters
metadata:
  name: kmsg-parameters
spec:
  hostpath: /dev/kmsg  
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  name: kmsg-template
spec:
  resourceClassName: PodmanResourceClass
  parametersRef: kmsg-parameters
---
apiVersion: v1
kind: Pod
[...]
resources:
  claims:
  - name: kmsg
resourceClaims:
- name: kmsg
  resourceClaimTemplateName: kmsg-template

That is quite verbose compared to what I imagined (using a podman-specific CDI string):

resources:
  limits:
    io.podman.device/kmsg: 1

the bits that support the configuration as shared by @bachp #17833 (comment) are also still missing, correct?

@bachp
Copy link

bachp commented Dec 19, 2024

As a workaround for this and future similar cases, would it be acceptable to have a podman specific annotation that would allow to specify argument directly to podman? Basically, an annotation that could be set and allows to set parameters like --device directly.

This would provide an escape hatch until features are properly implemented in the yaml.

@robertgzr
Copy link
Contributor

robertgzr commented Jan 14, 2025

@rhatdan @bachp i gave this another shot. my pr tries to be more pragmatic and omits the resourceClaim implementation in favor of relying on the existing support for handling CDI names. ptal :)

#25010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue This issue would be a good issue for a first time contributor to undertake. kube volunteers-wanted Issues good for community/volunteer contributions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants