Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update UDEV rules for 5 V100's in OpenShift Prod #735

Open
4 of 5 tasks
joachimweyl opened this issue Sep 17, 2024 · 9 comments
Open
4 of 5 tasks

Update UDEV rules for 5 V100's in OpenShift Prod #735

joachimweyl opened this issue Sep 17, 2024 · 9 comments
Assignees
Labels
openshift This issue pertains to NERC OpenShift

Comments

@joachimweyl
Copy link
Contributor

joachimweyl commented Sep 17, 2024

Motivation

5 of the 7 V100's are not working properly, they require additional UDEV rules to function properly. They have been Cordoned until they are given the proper rules.

Completion Criteria

UDEV rules updated for these 5 V100's

Description

  • Provide a list of 5 V100s
  • Update UDEV rules
  • test that they are working
  • unCordon them
  • Merge PR during a Maintenance window

Completion dates

Desired - 2024-10-02
Required - TBD

@joachimweyl joachimweyl added the openshift This issue pertains to NERC OpenShift label Sep 17, 2024
@joachimweyl
Copy link
Contributor Author

@msdisme
Copy link

msdisme commented Sep 18, 2024

Goal is to do this as soon as possible. Identify blackout dates for courses/presentations. Identify who to notify and whether in mailing lists.

  1. Students. 2. TA/TFs including NEU Hema and cloud computing course. Details for courses in Find out what courses will be using NERC Fall 2024 #576

@msdisme
Copy link

msdisme commented Sep 25, 2024

@jtriley, you mentioned several possible ways to make this not require a system outage in the future - if they're not already written down in another issue, could you please add them here, and if they are just a pointer?

@msdisme
Copy link

msdisme commented Sep 25, 2024

@joachimweyl , is this meant to be in the icebox?

@jtriley
Copy link

jtriley commented Sep 25, 2024

@jtriley, you mentioned several possible ways to make this not require a system outage in the future - if they're not already written down in another issue, could you please add them here, and if they are just a pointer?

Couple of ways around this:

  1. Custom machine config pools (e.g. https://access.redhat.com/solutions/5688941). Each node could be added to custom pools as needed and changes to those pools only apply to those hosts vs all worker nodes in the system. Custom worker pools inherit the base worker pool so they would still get the same updates that are meant to be applied cluster-wide. The downside is that over time we might be juggling a bunch of these and it creates an extravagant setup compared to the stock 2 pool setup that comes out of the box (ie one controller pool, one worker pool).

  2. Abandon the udev rule approach (ie in order to get consistent nic1 and nic2 device names across the cluster) and use the devices as they're named by the kernel. The downside is that we'd need to manage custom NNCP configs for each host to handle differences in device names. The upside is we wouldn't need to reboot all worker nodes in the cluster or manage custom machine config pools when new/different NIC devices show up in the cluster.

@jtriley
Copy link

jtriley commented Sep 30, 2024

Just noting the list of hosts from OCP-on-NERC/nerc-ocp-config#536:

wrk-10[2,3,6,7,8]

@joachimweyl
Copy link
Contributor Author

@jtriley with the manual steps you just did are we able to close this issue or would you rather leave it open for cleanup?

@jtriley
Copy link

jtriley commented Oct 9, 2024

@joachimweyl I suppose we could but we still need to merge OCP-on-NERC/nerc-ocp-config#536 during a maintenance window to be fully complete.

@joachimweyl
Copy link
Contributor Author

joachimweyl commented Oct 9, 2024

then, I will extend it and push it out to the next sprint. thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openshift This issue pertains to NERC OpenShift
Projects
None yet
Development

No branches or pull requests

4 participants