Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init changes behaviour with ubuntu cloud image ova starting from version 20230602 #4188

Closed
lethargosapatheia opened this issue Jun 15, 2023 · 16 comments
Labels
bug Something isn't working correctly incomplete Action required by submitter

Comments

@lethargosapatheia
Copy link

Bug report

Since the Ubuntu 22.04 20230602 (https://cloud-images.ubuntu.com/jammy/20230602/) cloud image version (tested on OVA) the behaviour of cloudinit has changed. Until this version appeared the virtual machine would start, apply the cloudinit settings, but it wouldn't boot fully, that is to say, I wasn't getting a login prompt and, more importantly, the SSH server didn't start. This is the expected behaviour.

Since 20230602 this behaviour has unexpectedly changed. In this intermediary phase, the virtual machine does boot fully, I get a login prompt and the SSH starts.

For me this looks like a bug, because I expect the SSH server not to start before all the cloud-init settings have been applied, meaning, before the virtual machine actually reboots.

Of course, this behaviour completely disturbs automatic provisioning with tools like Packer, which rely on SSH server being available at the right time, when everything is ready. In my case, packer (1.7.9) starts doing the provisioning through ansible (2.14.6) and it's interrupted by the reboot, making it impossible to automate the process.

I can confirm the provisioning works properly with version 20230524 (https://cloud-images.ubuntu.com/jammy/20230524/) of Ubuntu 22.04, where the VM doesn't boot fully and restarts correctly, without starting the SSH server before rebooting.

I am using vSphere 6.7.0.54000 to provision my virtual machines.

Environment details

  • Cloud-init version on both cloudimage versions:
ii  cloud-guest-utils               0.32-22-g45fe84a5-0ubuntu1              all          cloud guest utilities
ii  cloud-init                      23.1.2-0ubuntu0~22.04.1                 all          initialization and customization tool for cloud instances
ii  cloud-initramfs-copymods        0.47ubuntu1                             all          copy initramfs modules into root filesystem for later use
ii  cloud-initramfs-dyn-netconf     0.47ubuntu1                             all          write a network interface file in /run for BOOTIF
  • Operating System Distribution: Ubuntu 22.04.2
  • Cloud provider, platform or installer type: vSphere/OVA

cloud-init logs

I'm uploading the cloudinit logs for both images using cloud-init collect-logs and those from var/log/cloud-init.log.

cloud-init-20230602.tar.gz
var-log-cloud-init-20230602.tar.gz

cloud-init-20230524.tar.gz
var-log-cloud-init-20230524.tar.gz

Please let me know if I need to add any information.

Best regards!

@lethargosapatheia lethargosapatheia added bug Something isn't working correctly new An issue that still needs triage labels Jun 15, 2023
@blackboxsw
Copy link
Collaborator

Thank you @lethargosapatheia for filing a bug and trying to improve cloud-init. Given that cloud-init is the same version in both images it is highly unlikely that cloud-init is the problem, but something outside of cloud-init in images affecting the behavior you are seeing with login prompt changes and reboots.

The initial difference I see in cloud-init behavior between 5/24 and 06/02 is that the first boot in 0524 only runs cloud-init's init-local stage (you can see the boot stages run per boot with cloud-init analyze show which exits with a NOOP with the message in /var/log/cloud-init.log:

2023-06-15 11:47:06,317 - main.py[DEBUG]: [local] Exiting. datasource DataSourceOVF [seed=iso] not in local mode.

On 0602 image that first boot actually runs both init-local and init stages, so an additional cloud-init stage was able to run on 0602 images that wasn't run on 0524.

Something seems to have triggered reboot of the machine later on 0624, or for some reason systemd didn't include cloud-init.service as a boot goal in the earlier 0524 image first boot.

Unfortunately, the journal.txt included by default in cloud-init collect-logs is only the most recent boot (from journalctl -b 0) what I think we'd need to see is journalctl -b 1 output from both images. that'll give us service information run on previous boot to make sure there isn't some sort of systemd ordering cycle that kicked out cloud-init system services from the boot goals.

@blackboxsw blackboxsw added incomplete Action required by submitter and removed new An issue that still needs triage labels Jun 15, 2023
@lethargosapatheia
Copy link
Author

lethargosapatheia commented Jun 15, 2023

Thank you for your fast response. I'm adding here the two logs (these are newly created templates using the same two images, but I'm guessing it shouldn't be a problem, as the behaviour is consistent).
Let me know if this is provides enough information.
journald-b1-20230602.log
journal-b1-20230524.log

When using cloud-init analyze show I see init-local and init-network both running during the first boot in the 06/02 image. So when you say 'init stages', I'm guessing you mean also the other stages, which in the case of 06/02 it seems to be only init-network. Then during the second boot goes through these two stages again and then the rest.
So I'm guessing that the init-network stage is actually the difference.

@lethargosapatheia lethargosapatheia changed the title cloud-init changes behaviour with ubuntu cloudinit image ova starting from version 20230602 cloud-init changes behaviour with ubuntu cloud image ova starting from version 20230602 Jun 15, 2023
@lethargosapatheia
Copy link
Author

lethargosapatheia commented Jun 26, 2023

I've also tried using bootcmd: systemctl stop sshd and runcmd: systemctl start sshd, but that leads to a very high delay which doubles the time of provisioning, so that's not very practical. I wonder if I'm the only one who uses the templates that way (e.g. with packer), it seems so weird that people haven't come across this issue. Packer relies exactly on this trigger, that the ssh is activated, and this way Packer knows that it can connect.

Maybe there's a way to force ssh not to start at first boot.

@lethargosapatheia
Copy link
Author

Ok, I seem to have found the solution. I was able to tell cloud-init to stop the ssh service only once, before rebooting, and then not run the command again, as it did when simply running bootcmd:

bootcmd:
  - [ cloud-init-per, once, systemctl, stop, sshd ]

@holmanb
Copy link
Member

holmanb commented Jun 29, 2023

Ok, I seem to have found the solution.

Glad you found a workaround to your issue. I'm going to close this issue. Please reopen with more details if you see further issue related to cloud-init.

@holmanb holmanb closed this as completed Jun 29, 2023
@lethargosapatheia
Copy link
Author

lethargosapatheia commented Jun 29, 2023

Well, my solution is a completely different story than the fact that cloudinit has changed its behaviour in the ova ubuntu cloudimages. So someone made a really important change, which now seems to be ignored because I have found a solution.
I find it a little bit hard to understand this approach.

@holmanb
Copy link
Member

holmanb commented Jun 29, 2023

cloudinit has changed its behaviour in the ova ubuntu cloudimages

Are you sure? @blackboxsw already pointed out that this is the same version of cloud-init, so I'd be surprised if it was a cloud-init behavior change.

Something changed, agreed. Perhaps a service that cloud-init depends on didn't start for some reason?

@lethargosapatheia
Copy link
Author

lethargosapatheia commented Jun 30, 2023

Well, initially I mentioned this on the #ubuntu-server (not on the cloud-init one) irc channel and @blackboxsw told me that I should file a bug here, which I was a little bit surprised of, also because he himself said that he suspected the cloud-init versions are the same.

I also kind of expected a feedback related to the logs that I added afterwards, but I guess there's nothing to infer from there, as far as I understand, so I'm putting it down to a lack of communication.

That said, where could I then file a bug related to this issue?

This is where I've also posted, but the status now is 'expired' :D
https://answers.launchpad.net/ubuntu/+question/706971

@lethargosapatheia
Copy link
Author

I'm still at a loss as to why this issue is ignored. The change is really important and no one seems to be bothered by this. I've actually underestimated the issue, because this doesn't only affect the packer templates. When cloning them and using cloud-init again, I get the same behaviour where the virtual machine restarts after terraform connects to it.

@lethargosapatheia
Copy link
Author

This seems to be related to vmtools (perl guest os costumization in vSphere) which by default reboots the virtual machine after cloud-init is finished. It's kind of a mess and rather difficult to get rid of.

@vitality411
Copy link

Hello @lethargosapatheia,
thanks to you and this issue I was able to find out the cause for my issues with Ubuntu 22.04 20230602 cloud image and newer. Please see the linked issue in the open-vm-tools repo for details. I am certain that it is also the cause for this issue. As a workaroud I set wait-cloudinit-timeout to 0 in order to archive the previous behaviour in my cloud-init userdata:

bootcmd:
    - echo '[deployPkg]\nwait-cloudinit-timeout=0' >> /etc/vmware-tools/tools.conf

Best regards

@lethargosapatheia
Copy link
Author

lethargosapatheia commented Aug 28, 2023

Hi @vitality411. Thanks for reaching back!
I also wrote about the issue on the vmware forums (https://communities.vmware.com/t5/Virtual-Machine-Guest-OS-and-VM/how-to-turn-off-vm-toools-rebooting-ubuntu-VM-based-on/m-p/2979757#M49207), but there was no reply. Posting on github, like you did, is probably a better idea, indeed.

What I ended up doing was completely removing the symlink /sbin/telinit that the vmware tools use in order to reboot the server. vmware-tools complains that it doesn't find the command andthat's about it. It's a horrid solution to my mind, but it was the necessary at the point.

Your solution seems much more elegant, of course, because at least it hopefully observes the logic of vmware tools.
Can you explain the consequences of your solution? Intuitively I'd expect it to mean vmware-tools won't wait for cloud-init to finish at all and it would restart immediately, but I'm guessing it doesn't do that, right? Does it simply stop it from rebooting the system?

P.S. I also want to reiterate that, in contrast to your issue as you describe it here (vmware/open-vm-tools#684), in my case the cloud-init scripts do run successfully, but that packer (and terraform for that matter) will connect to the virtual machines over ssh before a last reboot, and the provisioning will be interrupted. On the other hand I can't say for certain that my cloud-init scripts exceed 30 seconds, but I'm not ruling out anything.

I use ansible with packes to provision the templates and I use ssh for terraform to retrieve certain pieces of information. In both cases the process is interrupted.

@vitality411
Copy link

The consequences of my workaround is that vmware-tools behaves like before: just executing gosc script without waiting for cloud-init and rebooting. In my case cloud-init is able to only execute the init-local stage. After the reboot cloud-init will run again with all stages and finish the customization.

Even when your case is a little different I am sure that the cause is the same as ssh server will only start after cloud-init is able to finish init-network and the following stages. Before 20230602 it was only executing the init-local stage on the first boot.

I was able to see the difference clearly by running cloud-init analyze show on both images and comparing the output side by side.

@vitality411
Copy link

order to archive the previous behaviour in my cloud-init userdata:

bootcmd:
    - echo '[deployPkg]\nwait-cloudinit-timeout=0' >> /etc/vmware-tools/tools.conf

I have to correct my workaroud. This does not work on an upstream image as open-vm-tools.service starts before cloud-init and so is running with the default timeout (30s). The only working workaround I have is building my own image with wait-cloudinit-timeout=0 already set.

@lethargosapatheia
Copy link
Author

For me changing the cloudimage beforehand is a no-go. That's exactly why I use the cloud-image, so that I have a baseline image that I can change afterwards as I like, without having to edit it beforehand and that's the reason why I switched from ubuntu iso images where I'd go through the whole install process to the cloud-image. I also rely on the updates that cloud-images provide and the image is ready-made.

I know this has become rather off-topic, but I think this is important for people who come across this issue nonetheless.

@vitality411
Copy link

I couldn't agree more. For me changing cloudimage beforehand is also a no-go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly incomplete Action required by submitter
Projects
None yet
Development

No branches or pull requests

4 participants