-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 results in failed networking #1169
Comments
It would be helpful if you are able to create a |
hi @fortinj66 , of course - I’m on an ipad at the mo, trying to figure out best way to do it. |
must-gather - The password for this is openshift |
Just a thought:
|
Fun test - I manually placed the same config into /host/var/run/multus/cni/net.d/80-openshift-network.conf copied from a working node. Multus immediately recognised that the config existed, before the config was wiped. Can you point me towards which container would ordinarily be managing the creation of that config file? Is it the Multus additional plugins container?
|
I've been able to establish 2 things: TLDR:
Slightly longer explanation:
|
After a short time running with this configuration, I'm still seeing issues with the freshly provisioned workers where the nodes are flapping ready/not ready, so this doesn't totally solve the issue - just a step forwards. |
hi @abaxo would you be able to grab the worker node logs and an updated must-gather? I'd like to see how the failure mechanism has changed. |
Hi @rvanderp3 - latest must gather is located at must-gather password is openshift |
I tried replacing the template in my environment with the RHCOS latest stable for 4.10 and ran into the issue I had earlier in the process where the SDN didn't come up, but I could also see that the cluster was still pivoting from that template to the defined fcos machine content. Not really an improvement, but could be useful. Probably worth noting as well that this cluster has been running for over 2 years and I think I initially deployed it at 4.1 so could be representative of a customer journey with a long running cluster. |
@rvanderp3 - I have just applied the user/group config as per #1168 (comment) and it appears to have done the trick and allowed me to restart the SDN service on the host, and SDN pod successfully. Weirdly, I've only needed to do it on nodes that are flapping, it doesn't seem to be all nodes. Remember that these are all fresh nodes, provisioned from a base template of fcos coreos-35.20220227.3.0-stable. I wonder if there is some config being pushed out from the upgrade still which changes permissions and causes a restart of the SDN. |
I am leaning further towards this issue being environmental - I have just discovered an issue with the garbage collection caused by a broken CRD which is absolutely not going to have helped here. |
Okay so having spent a few hours cleaning up and remediating the garbage collection, I've had the cluster in a good state to progress with swapping the vsphere csi driver from the vmware provided driver to the built in openshift version. As part of that process I did a full cluster stop and start primarily to clear the disk attachments cache. Doing this reboot has caused the SDN errors to reoccur, and has caused all but the worker which I applied the permission fix to, to go into dracut emergency mode. Vmware shows that the guest OS halted the CPU. So none of those workers are bootable. Oh dear. I think, but am not sure how to troubleshoot, that I am running into this issue coreos/fedora-coreos-tracker#784 or similar. Not certain it is the same because this looks to have been fixed in an earlier cos version. |
I've updated the machine config api configmap This seems to have solved my SDN and OS instability issues, at least until its time to try a fixed OS version. |
Do you have SELinux modifications? The AVC messages above could indicate this. We had to make some and for the newer upgrades we suddenly get failures on the nodes, this is because coreos does not handle this gracefully yet. Run |
Describe the bug
After an upgrade from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 existing nodes failed to update and ended up in dracut emergency mode with no ability to recover back to an older version. I am not certain that this event is related to the network issues that I am seeing in the cluster, but I am noting it in case it is relevant. This impacted 6 of 8 worker nodes, the 2 that were not affected were held by the upgrade process.
The upgrade of the cluster itself as far as the control plane and cluster operators was successful, including master node upgrades and reboots, so I provisioned 6 new worker nodes in the same machine set to replace the failed nodes gracefully on the assumption that if the config is somehow broken on the old nodes, fresh configuration would be applied to fresh nodes. The new nodes came up and show as provisioned in the cluster, and joins the cluster but do not enter a ready state. It appears that the SDN is not coming up. The cluster is configured with OpenshiftSDN. The multus pods have started on the new nodes, the multus-additional-plugins container which has successfully copied plugins to /host/opt/cni/bin/. The SDN pod
Version
4.10.0-0.okd-2022-03-07-131213 - IPI on vSphere
How reproducible
Every new node provisioned by the cloud controller has this issue - have provisioned 6 nodes.
Log bundle
Openshift SDN container logs:
Multus container logs:
Node Logs:
The text was updated successfully, but these errors were encountered: