-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pass through hostdevice by pcie path #11815
Comments
/cc @vladikr @victortoso |
attempted to work around it with a hook sidecar.
i think there needs to be another hook for modifying the pod. but i'm not even sure what to add to the pod. |
i implemented my own device-plugin now that passes
the same host works fine with related issues have been closed without resolution so i cant figured out what the underlying issue is |
finally figured it out. the plugin is here: https://github.com/kraudcloud/vf-device-plugin but i dont think this can be upstreamed. in order to distinguish devices by path, i had to create a k8s resource for each one of them. the request to the plugin already contains a pick. the plugin doesnt have an influence on which one is chosen. this seems like a pretty bad design from k8s itself. there's also no way to pass any annotations from pod to device plugin so you cant even do any preparations on behalf of the pod. it could probably be upstreamed by making it less specific to ethernet, i.e. litteraly create a resource per path, as i originally posted. but i'm not sure if thats really all that useful. for storage we will instead create yet another plugin that creates a k8s resource for each chassis bay, instead of hardcoding all the pcie paths |
@aep yes, this is a limitation that we can overcome with DRA. We have a research project which will investigate the integration between DRA and kubevirt: kubevirt/community#254. However, this won't happen soon. You can try to have a look to the akri project. They should have solved the same problem. Unfortunately, I'm not very familiar with the project and it isn't directly compatible with kubevirt because of environmental variables. AFAIU, they also create a single device resource name per device in order to identify the single device and not to have this random assignement. |
thanks for the input. do you think i should prepare this for upstreaming into kubevirt ( pciPathSelector, create one resource per pcie path in config ) or would it be rejected anyway because DRA is the better long term solution? |
Hard to say. DRA doesn't directly depend on us. It might still make sense to add it. |
@aep if you can, you could attend the community meeting on Wend and present the problem. I think, it will be the fastest way to receive feedback |
...
That's my understanding too. Each plugin has an ID and kubectl's device manager will request that ID. If you have two nvme on the same resource name, we can't guarantee which one is requested. The solution using device-plugins is really around more specific selectors but that ends up being worse experience to the user, as admin will need to populate it (e.g: by path or some other unique metadata) To my knowledge, this should be solved by DRA but that might take some time to be adopted in KubeVirt.
IMHO, yes. An optional selector that would solve your problem and not affect current use cases should be considered. |
Network devices are handled correctly when Multus (CNI) is used. I do not know enough to comment regarding storage. |
For storage we are completely missing this mapping, and it is a general problem for all PCI passthrough devices. But it is particularly relevant for storage since the devices definitely have a state and data :) |
I believe network devices are NOT handled by Multus when trying to do pcie passthrough in a Kubevirt VM ("type": "host-device" in the NetworkAttachmentDefinition). Specially because there is not possible frontend to choose that allows it. In my particular case, my network devices do not support SR-IOV. So I am able to pass them to the vm with pciHostDevices but every deployment gets the nics in a different order, which makes it unusable. |
Unfortunately couldn't really figure out how the community meeting works. Sound was incredibly bad. Anyway I still think the easiest solution is to just have a PCIe path selector in the config. It's clunky but will probably get most users around the issue until the proper solution is in k8s |
Well, if you try to work with network devices without Kubevirt knowing it is a network device, then it is indeed the case that Multus or any other mechanism available is not involved. The solution to this case is most likely to create a custom network binding plugin [1] for yourself. You will also need a DP and a CNI to reflect the data through Multus, beyond the need to create a binding plugin. [1] https://github.com/kubevirt/kubevirt/blob/main/docs/network/network-binding-plugin.md |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@kubevirt-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
currently a pcie hostdevice is select by vendor/deviceid.
this is fine if you only have one per host, but as soon as you have more than one it falls apart
For example given two vms, with two nvme devices passed through to them, will suddenly swap storage devices at random.
we actually have nvme only nodes with up to 48 nvmes per host, as well as mellanox cards with 32 VFs each.
i would suggest to have something like
it seems likely that this wasnt implemented because this makes very little sense for GPUs, where they are fungible and the hosts might not actually all have the same PCI layout. However storage and l3 network are non fungible, and really need to be mapped to specific vms
The text was updated successfully, but these errors were encountered: