-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wait for cloud-init execution to finish breaks previous behavior #684
Comments
Thank you for reporting this issue. An internal bug has been filed. |
@vitality411 , Thanks for the detailed description of your issue and mentioning the issue in canonical/cloud-init#4188 reported by @lethargosapatheia First of all, there is a behaviour change with cd995a5 since VMware Tools version 12.1.5 when users want to customize a Linux VM(update hostname, network and etc) and at the same time feed userdata to cloud-init to apply in the same boot. The reason we made this change is users reported the reboot triggered by guest customization (as @lethargosapatheia mentioned: by calling /etc/teliniit 6) interupts cloud-init execution. In your working case, I see reboot happened right after init-local stage, but in some other cases, reboot could happen in init-network or even modules-config/modules-final stage which is applying userdata, so we want to let cloud-init finish its work before reboot vm. Now I think it's tricky to set default value of
@lethargosapatheia , For your case, you mentioned userdata can be applied in 30 seconds, but packer (and terraform for that matter) will connect to the virtual machines over ssh before a last reboot, and the provisioning will be interrupted. I understand that ssh is not ready before reboot before this behaviour change, but now ssh is ready before reboot. The workaround provided by @vitality411 shall work for you. In the meanwhile, can you hold provisioning until the reboot happenes? There is always a reboot. Want to understand your workflow more, share your packer configuration here? We have an internal bug tracking this issue, thanks @jonathanvmw . Best regards, |
@PengpengSun Thanks. When I set |
@PengpengSun The short answer is "no", I cannot stop the provisioning, there is no way packer or terraform can automatically know when the right moment to connect to the virtual machine over ssh is. They just try until the ssh server is up and running. So, for instance, the whole process starting from the cloning of the cloudimage with packer is automatic. The only thing I could probably do is maybe delay the ssh connectivity, which, of course, is a bad solution, because that interval will be relative.
I beg to differ and I think the crux of the problem is exactly the fact that users don't have a choice here. Just like cloud-init doesn't require rebooting the machine if you don't tell it to, there is no reason why vmware-tools should do that or better said, there is no reason why the user shouldn't be able to decide whether the virtual machine reboots or not. And the proof for that is exactly my workaround, which is removing the telinit symlink, which in turn makes the whole process much faster without any apparent issues. vitaly411's workaround (setting So the real solution is just this: give users the possibility of not rebooting the virtual machine. You don't need to change the default. |
@vitality411 and @lethargosapatheia Thanks for your reply. I understand your workflows a bit more, customizing network and applying cluod-init userdata together in a single boot is a scenario which we want to make it work smoothly.
And for the not rebooting suggestion, we have few blockers to resolve, ex: how to make sure customized network settings take effect(Not only ubuntu, but all Linux distroes we support), what if a hardware change happenes(adding/removing network adapters) before customize vm network settings. Back to this behavior change, this is another solution we want to provide, see solution 1 in KB 90331. As I mentioned, rebooting could happen at any stages of cloud-init execution before this change. |
@PengpengSun I am running on VMware Cloud Director 10.4. According to vmware/cluster-api-provider-cloud-director#506 and the linked Slack thread it is currently to possible to use native cloud-init customization. |
@vitality411 Yes, this is a long standing issue, but we are working on addressing it recently. One workaround descripbed in this KB: https://kb.vmware.com/s/article/71264, in your case, you can add one more runcmd in userdata to add the setting |
@PengpengSun I would appreciate if you could share the progress on this issue. |
@vitality411 We are working on a change to add a new field into customization specification, when people creating a spec, they can set this new field, the value(number in seconds) of this field will overwrite the value of |
Could this be related to my problem ? The behaviour sound quite similar to my problem.. |
After looking at your issue, it doesn't seem to be related. Have you taken a look at |
@vitality411 yeah thanks, it looks like a change in cloud-init`s ds-identify causes my problem |
@PengpengSun Any news on the change you were working on? |
@vitality411 The adding a new field into customization specification solution is not available yet, but I think the issue mentioned vmware/cluster-api-provider-cloud-director#506 has been addressed recently, please check canonical/cloud-init#4997. For now, I still suggest you try one of below solutions as I mentioned:
|
@PengpengSun Any news on the change you were working on?
These "solutions" are unacceptable. I just want to be able to use the cloudimg without having to build my own template where I only have to set Not even a few packages can be installed undisturbed:
|
Firstly, I want to share this long standing issue has been resolved by commit canonical/cloud-init@9929a00, cloud-init 24.2 release contains this commit.
I totally understand your case, the
Here is the thing, if I update Linux customization spec, it also requires VMware Cloud Director change which is similar with vmware/cluster-api-provider-cloud-director#506 So I prefer to a solution which changes open-vm-tools only, when ubuntu cloud image updates bundled open-vm-tools version, people can set this |
@PengpengSun thanks for sharing. When do I think your prefered solution will be implemented? |
@vitality411 I will update here when I have the release info. |
Describe the bug
Hello,
thanks to @lethargosapatheia and issue canonical/cloud-init#4188 in cloud-init repository I was able to find the root cause of the following issue.
Since Ubuntu 22.04 20230602 cloud image version the behavior of cloud-init has unexpectedly changed. Until this version the virtual machine would start, run cloud-init init-local stage, reboot and run the remaining cloud-init stages correctly. In this version cloud-init starts additional stages besides 'init-local' during first boot (see attached cloud-init analyze show output). During these stages it is terminated prematurely by deployPkg. I found out that this is due to cd995a5 which changed deployPkg plugin behavior to wait for cloud-init execution to finish. If the cloud-init execution is not finished during default timeout 30s it is killed.
This behavior disturbs automatic provisioning, which rely on correct application of cloud-init settings. In my case, kubermatic/machine-controller starts doing the provisioning through userdata and it's interrupted by the reboot, making it impossible to automatically provision new Kubernetes nodes.
I can confirm the provisioning works properly with version 20230518 of Ubuntu 22.04, where cloud-init is executed correctly, without being terminated prematurely.
Environment details
Cloud-init versions are identical on both cloudimage versions:
open-vm-tools
Operating System Distribution: Ubuntu 22.04.2 LTS
Cloud provider, platform or installer type: VMware Cloud Director/OVA
Logs
I'm uploading the relevant logs for both images.
broken.tar.gz
working.tar.gz
Best regards!
Reproduction steps
Working
Deploy VM using Ubuntu 22.04 20230518 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.
Verify cloud-init was able to execute only the init-local stage on first boot:
Broken
Deploy VM using Ubuntu 22.04 20230602 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.
Verify cloud-init tried to execute multiple stages besides init-local stage on first boot:
Expected behavior
No breaking change
I understand that this change was required to resolve issues where users want to set a vm's networking and apply cloud-init userdata together before the vm is booted. But still I find it bad practice to change the previous default and break previous working configuration for others. Previously I was able to use the cloud image without modification. Now I have to build my own image with
wait-cloudinit-timeout=0
just to restore the previous behavior.Additional context
No response
The text was updated successfully, but these errors were encountered: