Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait for cloud-init execution to finish breaks previous behavior #684

Open
vitality411 opened this issue Aug 28, 2023 · 18 comments
Open

wait for cloud-init execution to finish breaks previous behavior #684

vitality411 opened this issue Aug 28, 2023 · 18 comments
Labels

Comments

@vitality411
Copy link

Describe the bug

Hello,
thanks to @lethargosapatheia and issue canonical/cloud-init#4188 in cloud-init repository I was able to find the root cause of the following issue.

Since Ubuntu 22.04 20230602 cloud image version the behavior of cloud-init has unexpectedly changed. Until this version the virtual machine would start, run cloud-init init-local stage, reboot and run the remaining cloud-init stages correctly. In this version cloud-init starts additional stages besides 'init-local' during first boot (see attached cloud-init analyze show output). During these stages it is terminated prematurely by deployPkg. I found out that this is due to cd995a5 which changed deployPkg plugin behavior to wait for cloud-init execution to finish. If the cloud-init execution is not finished during default timeout 30s it is killed.

This behavior disturbs automatic provisioning, which rely on correct application of cloud-init settings. In my case, kubermatic/machine-controller starts doing the provisioning through userdata and it's interrupted by the reboot, making it impossible to automatically provision new Kubernetes nodes.

I can confirm the provisioning works properly with version 20230518 of Ubuntu 22.04, where cloud-init is executed correctly, without being terminated prematurely.

Environment details
Cloud-init versions are identical on both cloudimage versions:

ii  cloud-guest-utils                    0.32-22-g45fe84a5-0ubuntu1              all          cloud guest utilities
ii  cloud-init                           23.1.2-0ubuntu0~22.04.1                 all          initialization and customization tool for cloud instances
ii  cloud-initramfs-copymods             0.47ubuntu1                             all          copy initramfs modules into root filesystem for later use
ii  cloud-initramfs-dyn-netconf          0.47ubuntu1                             all          write a network interface file in /run for BOOTIF

open-vm-tools

Working: VMware Tools Version: 12.1.0.37487 (build-20219665)
Broken: VMware Tools Version: 12.1.5.39265 (build-20735119)

Operating System Distribution: Ubuntu 22.04.2 LTS
Cloud provider, platform or installer type: VMware Cloud Director/OVA

Logs
I'm uploading the relevant logs for both images.
broken.tar.gz
working.tar.gz

Best regards!

Reproduction steps

Working

  1. Deploy VM using Ubuntu 22.04 20230518 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.

  2. Verify cloud-init was able to execute only the init-local stage on first boot:

cloud-init analyze show
-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->no cache found @00.00300s +00.00000s
|`->found local data from DataSourceOVF @00.01200s +00.00700s
Finished stage: (init-local) 00.25800 seconds

Total Time: 0.25800 seconds

-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.00900s
|`->found local data from DataSourceOVF @00.01400s +00.00700s
Finished stage: (init-local) 00.06100 seconds

Starting stage: init-network
...
  1. Verify /var/log/vmware-imc/toolsDeployPkg.log does not contain the following lines:
[2023-08-25T08:18:48.556Z] [    info] Do not trigger reboot if cloud-init is executing.
[2023-08-25T08:18:48.857Z] [    info] Cloud-init status is 'running'.
...
[2023-08-25T08:19:16.029Z] [    info] Cloud-init status is 'running'.
[2023-08-25T08:19:21.029Z] [    info] Timed out waiting for cloud-init execution done.

Broken

  1. Deploy VM using Ubuntu 22.04 20230602 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.

  2. Verify cloud-init tried to execute multiple stages besides init-local stage on first boot:

-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->no cache found @00.00500s +00.00000s
|`->found local data from DataSourceOVF @00.12200s +00.00800s
Finished stage: (init-local) 00.42600 seconds

Starting stage: init-network
...
Finished stage: (init-network) 01.55400 seconds

Starting stage: modules-config
...

Total Time: 14.33200 seconds

-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.01000s
|`->found local data from DataSourceOVF @00.01500s +00.00700s
Finished stage: (init-local) 00.06700 seconds

Starting stage: init-network
...
  1. Check /var/log/vmware-imc/toolsDeployPkg.log for the following lines:
[2023-08-25T08:18:48.556Z] [    info] Do not trigger reboot if cloud-init is executing.
[2023-08-25T08:18:48.857Z] [    info] Cloud-init status is 'running'.
...
[2023-08-25T08:19:16.029Z] [    info] Cloud-init status is 'running'.
[2023-08-25T08:19:21.029Z] [    info] Timed out waiting for cloud-init execution done.
[2023-08-25T08:19:21.130Z] [    info] Trigger reboot.
[2023-08-25T08:19:22.049Z] [    info] Rebooting.
[2023-08-25T08:19:23.150Z] [    info] Reboot has been triggered.

  1. Check journalctl for prematurely termination of cloud-init services
Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE
Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Aug 25 08:19:21 broken systemd[1]: Stopped Execute cloud user/final scripts.
-- Boot 1673d9c3f266400881bfd1b73933b494 --
Aug 25 08:19:54 broken systemd[1]: Starting Execute cloud user/final scripts...

Expected behavior

No breaking change

I understand that this change was required to resolve issues where users want to set a vm's networking and apply cloud-init userdata together before the vm is booted. But still I find it bad practice to change the previous default and break previous working configuration for others. Previously I was able to use the cloud image without modification. Now I have to build my own image with wait-cloudinit-timeout=0 just to restore the previous behavior.

Additional context

No response

@jonathanvmw
Copy link
Contributor

@vitality411

Thank you for reporting this issue. An internal bug has been filed.

@PengpengSun
Copy link

PengpengSun commented Aug 29, 2023

@vitality411 , Thanks for the detailed description of your issue and mentioning the issue in canonical/cloud-init#4188 reported by @lethargosapatheia

First of all, there is a behaviour change with cd995a5 since VMware Tools version 12.1.5 when users want to customize a Linux VM(update hostname, network and etc) and at the same time feed userdata to cloud-init to apply in the same boot.

The reason we made this change is users reported the reboot triggered by guest customization (as @lethargosapatheia mentioned: by calling /etc/teliniit 6) interupts cloud-init execution. In your working case, I see reboot happened right after init-local stage, but in some other cases, reboot could happen in init-network or even modules-config/modules-final stage which is applying userdata, so we want to let cloud-init finish its work before reboot vm.

Now I think it's tricky to set default value of wait-cloudinit-timeout, 30 seconds are not enough for some userdata. @vitality411 What if you set wait-cloudinit-timeout=300 to let cloud-init finish execution before the reboot, will it break your workflow? I see in your working case, the total cloud-init execution time is 290 seconds. I know this still can NOT resolve your problem, still need manually build image.

Total Time: 290.29000 seconds

2 boot records analyzed

@lethargosapatheia , For your case, you mentioned userdata can be applied in 30 seconds, but packer (and terraform for that matter) will connect to the virtual machines over ssh before a last reboot, and the provisioning will be interrupted. I understand that ssh is not ready before reboot before this behaviour change, but now ssh is ready before reboot. The workaround provided by @vitality411 shall work for you. In the meanwhile, can you hold provisioning until the reboot happenes? There is always a reboot. Want to understand your workflow more, share your packer configuration here?

We have an internal bug tracking this issue, thanks @jonathanvmw .

Best regards,
Pengpeng

@vitality411
Copy link
Author

@PengpengSun Thanks. When I set wait-cloudinit-timeout=300 cloud-init is able to finish execution before the reboot and the workflow is not terminated prematurely by deployPkg. Still requires a custom image.

@lethargosapatheia
Copy link

@PengpengSun The short answer is "no", I cannot stop the provisioning, there is no way packer or terraform can automatically know when the right moment to connect to the virtual machine over ssh is. They just try until the ssh server is up and running.

So, for instance, the whole process starting from the cloning of the cloudimage with packer is automatic. The only thing I could probably do is maybe delay the ssh connectivity, which, of course, is a bad solution, because that interval will be relative.

There is always a reboot.

I beg to differ and I think the crux of the problem is exactly the fact that users don't have a choice here. Just like cloud-init doesn't require rebooting the machine if you don't tell it to, there is no reason why vmware-tools should do that or better said, there is no reason why the user shouldn't be able to decide whether the virtual machine reboots or not. And the proof for that is exactly my workaround, which is removing the telinit symlink, which in turn makes the whole process much faster without any apparent issues.

vitaly411's workaround (setting wait-cloudinit-timeout to 0) will probably solve the issue (I haven't tested it) in that the vmware-tools will reboot earlier. But as far as I understand, this relies on what is, for all intents and purposes, a hack, where vmware-tools would more or less randomly reboot the virtual machine just before cloud-init reaches the init-network stage. It's not as if vmware-tools explicitly wait for init-local stage to finish and then reboot, right? It just happens that it reboots before the init-network is initiated.

So the real solution is just this: give users the possibility of not rebooting the virtual machine. You don't need to change the default.

@PengpengSun
Copy link

@vitality411 and @lethargosapatheia Thanks for your reply. I understand your workflows a bit more, customizing network and applying cluod-init userdata together in a single boot is a scenario which we want to make it work smoothly.
I see @vitality411 is putting userdata to ovfenv.xml format and set in guestinfo per DataSourceOVF [seed=com.vmware.guestInfo], I'm not sure which data transfort @lethargosapatheia is using. But I have 2 recommendations:

  1. Have you checked https://cloudinit.readthedocs.io/en/latest/reference/datasources/vmware.html#guestinfo-keys, you can put network configuration into guestinfo.metadata and userdata into guestinfo.userdata, in this way, all networking and userdata will be handled by cloud-init, no need customization task involved, no rebooting.
  2. You can set both network configuration and userdata in customization spec, check the solution 3 in this KB 90331, now you are using solution 1. This API doc is telling how to put both network configuration and userdata into customization spec "CloudInitPrep", no rebooting.

And for the not rebooting suggestion, we have few blockers to resolve, ex: how to make sure customized network settings take effect(Not only ubuntu, but all Linux distroes we support), what if a hardware change happenes(adding/removing network adapters) before customize vm network settings.

Back to this behavior change, this is another solution we want to provide, see solution 1 in KB 90331. As I mentioned, rebooting could happen at any stages of cloud-init execution before this change.
Now we have an internal bug to trace this issue, will post here if I have any updates.

@vitality411
Copy link
Author

@PengpengSun I am running on VMware Cloud Director 10.4. According to vmware/cluster-api-provider-cloud-director#506 and the linked Slack thread it is currently to possible to use native cloud-init customization.

@PengpengSun
Copy link

@vitality411 Yes, this is a long standing issue, but we are working on addressing it recently. One workaround descripbed in this KB: https://kb.vmware.com/s/article/71264, in your case, you can add one more runcmd in userdata to add the setting manual_cache_clean: True to the /etc/cloud/cloud.cfg during the first boot, then reboot won't set network to fallback settings.

@vitality411
Copy link
Author

@PengpengSun I would appreciate if you could share the progress on this issue.

@PengpengSun
Copy link

@vitality411 We are working on a change to add a new field into customization specification, when people creating a spec, they can set this new field, the value(number in seconds) of this field will overwrite the value of wait-cloudinit-timeout in vm. For your case, you can set this field to 0 or 300.
How does this change sound to you?

@erSitzt
Copy link

erSitzt commented Feb 29, 2024

Could this be related to my problem ?

canonical/cloud-init#4977

The behaviour sound quite similar to my problem..

@vitality411
Copy link
Author

After looking at your issue, it doesn't seem to be related.
To verify you can try if it works with Ubuntu 22.04 20230518 and no longer works with Ubuntu 22.04 20230602.

Have you taken a look at cloud-init analyze show for both cases?

@erSitzt
Copy link

erSitzt commented Mar 6, 2024

@vitality411 yeah thanks, it looks like a change in cloud-init`s ds-identify causes my problem

@vitality411
Copy link
Author

@PengpengSun Any news on the change you were working on?

@PengpengSun
Copy link

@vitality411 The adding a new field into customization specification solution is not available yet, but I think the issue mentioned vmware/cluster-api-provider-cloud-director#506 has been addressed recently, please check canonical/cloud-init#4997.

For now, I still suggest you try one of below solutions as I mentioned:

  1. https://cloudinit.readthedocs.io/en/latest/reference/datasources/vmware.html#guestinfo-keys, you can put network configuration into guestinfo.metadata and userdata into guestinfo.userdata, in this way, all networking and userdata will be handled by cloud-init, no need customization task involved, no rebooting.
  2. You can set both network configuration and userdata in customization spec, check the solution 3 in this KB 90331, now you are using solution 1. This API doc is telling how to put both network configuration and userdata into customization spec "CloudInitPrep", no rebooting.

@vitality411
Copy link
Author

@PengpengSun Any news on the change you were working on?

For now, I still suggest you try one of below solutions as I mentioned:

  1. https://cloudinit.readthedocs.io/en/latest/reference/datasources/vmware.html#guestinfo-keys, you can put network configuration into guestinfo.metadata and userdata into guestinfo.userdata, in this way, all networking and userdata will be handled by cloud-init, no need customization task involved, no rebooting.
  2. You can set both network configuration and userdata in customization spec, check the solution 3 in this KB 90331, now you are using solution 1. This API doc is telling how to put both network configuration and userdata into customization spec "CloudInitPrep", no rebooting.

These "solutions" are unacceptable. I just want to be able to use the cloudimg without having to build my own template where I only have to set wait-cloudinit-timeout 0 to get it working again. Is that really too much to ask? cloud-init can't be used alone until vmware/cluster-api-provider-cloud-director#506 is implemented.

Not even a few packages can be installed undisturbed:

Aug 06 15:18:07 ubuntu-jammy-2204-cloudimg-20240802-001 cloud-init[1531]: Get:33 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 c-n-f Metadata [672 B]
Aug 06 15:18:07 ubuntu-jammy-2204-cloudimg-20240802-001 cloud-init[1531]: Get:34 http://archive.ubuntu.com/ubuntu jammy-backports/multiverse amd64 c-n-f Metadata [116 B]
Aug 06 15:18:16 ubuntu-jammy-2204-cloudimg-20240802-001 cloud-init[1531]: Fetched 27.1 MB in 9s (3161 kB/s)
Aug 06 15:18:17 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE
Aug 06 15:18:17 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Aug 06 15:18:17 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: cloud-final.service: Unit process 1530 (sh) remains running after unit stopped.
Aug 06 15:18:17 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: cloud-final.service: Unit process 1531 (tee) remains running after unit stopped.
Aug 06 15:18:17 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: cloud-final.service: Unit process 1532 (apt-get) remains running after unit stopped.
Aug 06 15:18:17 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: Stopped Execute cloud user/final scripts.
Aug 06 15:18:17 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: cloud-final.service: Consumed 11.321s CPU time.
Aug 06 15:18:19 ubuntu-jammy-2204-cloudimg-20240802-001 cloud-init[1531]: Reading package lists...
-- Boot 5ba9c264c7f045888626576c9be2d18f --
Aug 06 15:18:48 ubuntu-jammy-2204-cloudimg-20240802-001 systemd[1]: Starting Execute cloud user/final scripts...

@PengpengSun
Copy link

@vitality411 Yes, this is a long standing issue, but we are working on addressing it recently. One workaround descripbed in this KB: https://kb.vmware.com/s/article/71264, in your case, you can add one more runcmd in userdata to add the setting manual_cache_clean: True to the /etc/cloud/cloud.cfg during the first boot, then reboot won't set network to fallback settings.

Firstly, I want to share this long standing issue has been resolved by commit canonical/cloud-init@9929a00, cloud-init 24.2 release contains this commit.

These "solutions" are unacceptable. I just want to be able to use the cloudimg without having to build my own template where I only have to set wait-cloudinit-timeout 0 to get it working again.

I totally understand your case, the wait-cloudinit-timeout shall be configurable without image editing. And yes, native cloud-init (CloudInitPrep customization spec) can't be used alone until vmware/cluster-api-provider-cloud-director#506 is implemented.

We are working on a change to add a new field into customization specification, when people creating a spec, they can set this new field, the value(number in seconds) of this field will overwrite the value of wait-cloudinit-timeout in vm. For your case, you can set this field to 0 or 300.

Here is the thing, if I update Linux customization spec, it also requires VMware Cloud Director change which is similar with vmware/cluster-api-provider-cloud-director#506
Step 1: Puts a new configurable wait-cloudinit-timeout into Linux customization spec.
Step 2: VMware Cloud Director need implement to support this new configurable data in Linux customization spec.

So I prefer to a solution which changes open-vm-tools only, when ubuntu cloud image updates bundled open-vm-tools version, people can set this wait-cloudinit-timeout somewhere and it just takes effect, no VMware Cloud Director change needed.

@vitality411
Copy link
Author

@PengpengSun thanks for sharing. When do I think your prefered solution will be implemented?

@PengpengSun
Copy link

@vitality411 I will update here when I have the release info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants