Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Host Network Attachment Definition #192

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

SamD2021
Copy link

TL;DR

This PR makes some changes to how we handle Network Attachment Definitions (NAD) in the DPU Operator. The goal is to centralize the NAD setup, and automate the deployment of the NAD for the Host. I'm also hoping to discuss if these changes should be moved to the daemon so we can support auto mode as well.

What’s Changed

Host NAD YAML:

Added a manifest for Host NAD to centralize its configuration in the operator itself.
The namespace follows what we use in Cluster Deployment Automation (CDA), so the default resources get nad properly.

NAD Reconciliation:

Changed the check for NAD to ensure both Host and DPU mode have deployed its NADs while skipping it for Auto mode.

Auto Mode Handling:

If the configuration is set to auto, the operator will skip NAD deployment since the controller can’t detect the hardware in this mode. This keeps things cleaner and avoids unnecessary errors.

Why These Changes Matter

By centralizing the Host NAD setup and improving how we handle reconciliation, this makes the operator more reliable and lessens the burden of configuration from the user. I also hope to have a discussion about whether NAD management should be the job of the controller or the daemon, since the latter is the one that can auto detect the hardware.

root added 3 commits October 21, 2024 11:44
…t definition

By including the Host nad yaml file we are able to centralize the NAD
setup wihtin the operator itself

Currently the Namespace follows the CDA implementation where it uses the
Default Namespace
…oNAD to include host setup as well

- The idea is to move the check from outside the nad and always
  reconcile the NAD and apply the appropriate manifest but currently the
auto mode can't be used to apply the Host NAD since it requires the
daemon so we keep a check to skip it.
- This implementation has problems with auto since this mode relies on
  the daemon, which opens a bigger discussion on whether managing the
NAD is the job of the daemon or the controller.
We should skip deploying the NAD ourselves if the configuration spec
mode is `auto`, since this implementation can't auto detect the hardware in the
controller.
Copy link

openshift-ci bot commented Oct 21, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SamD2021
Once this PR has been reviewed and has the lgtm label, please assign vrindle for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 21, 2024
Copy link

openshift-ci bot commented Oct 21, 2024

Hi @SamD2021. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@@ -99,14 +99,13 @@ func (r *DpuOperatorConfigReconciler) Reconcile(ctx context.Context, req ctrl.Re
return ctrl.Result{}, err
}

if dpuOperatorConfig.Spec.Mode == "dpu" {
if dpuOperatorConfig.Spec.Mode != "auto" {
Copy link
Contributor

@SalDaniele SalDaniele Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bn222 @wizhaoredhat discussed this with Sam. Should we move creating the net-attach-def for the host to the daemon as well?

The issue being that the top level operator does not have access to the platform info to detect whether it should create a NAD for the host-side or dpu-side, other than the dpu operator config spec.

However I don't think we can rely on this spec in the case we auto detect mode.

I know Balazs is working on a refactor to the CRs, so maybe this will clear this point up.

@SalDaniele
Copy link
Contributor

As part of this PR I think we should remove creating the nad in CDA, and add a bump to CDA as a commit to make sure e2e tests are still working.

As part of bringing Host DPU NAD creation into the DPU-Operator,we need
to also remove the steps where the cda creates the host dpu nad.
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2024
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

"name": "dpu-cni",
"ipam": {
"type": "host-local",
"subnet": "10.56.217.0/24"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how useful this change is for the host is.

  1. As you noticed "ipam" can be configured by the nad. The IP Address "10.56.217.0/24"" is not relevant for most if not all customer setups. Yes it is relevant for CDA, but we aren't building this for CDA.
  2. Also there are different types of IPAMs
  3. dpu-cni could configure parameters like MTU on the host.
  4. In short, the reason why I only did the NAD for the DPU is that it is assumed the DPU's Network Functions will be created by the operator only. But the host NAD would be consumed by workload pods that are outside our control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host NAD will be created by the operator as a consequence of creating a DpuNetwork CR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the DpuNetwork CR must contain IPAM settings for the NAD in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants