-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 fix: properly restart cloud-init #5116
base: main
Are you sure you want to change the base?
Conversation
@faiq: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test ? |
@faiq: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-provider-aws-e2e |
/test pull-cluster-api-provider-aws-e2e |
/retest |
/test pull-cluster-api-provider-aws-e2e-eks |
efc8257
to
48c7db2
Compare
/retest |
/test pull-cluster-api-provider-aws-e2e |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't work :(
It definitely rebooted the instance at the expected time, but apparently cloud-init doesn't like it when /var/lib/cloud/instance
hangs around.
Let me test a couple potential options and see which one works.
Sep 05 20:41:24 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:24+00:00] appending data to temporary file /etc/secret-userdata.txt.gz
Sep 05 20:41:24 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:24+00:00] getting userdata from AWS Secrets Manager
Sep 05 20:41:24 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:24+00:00] getting secret value from AWS Secrets Manager
Sep 05 20:41:25 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:25+00:00] AWS CLI reported successful execution for SecretsManager::GetSecretValue
Sep 05 20:41:25 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:25+00:00] appending data to temporary file /etc/secret-userdata.txt.gz
Sep 05 20:41:25 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:25+00:00] deleting secret from AWS Secrets Manager
Sep 05 20:41:26 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:26+00:00] AWS CLI reported successful execution for SecretsManager::DeleteSecret
Sep 05 20:41:26 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:26+00:00] deleting secret from AWS Secrets Manager
Sep 05 20:41:27 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:27+00:00] AWS CLI reported successful execution for SecretsManager::DeleteSecret
Sep 05 20:41:27 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:27+00:00] decompressing userdata to /etc/secret-userdata.txt
Sep 05 20:41:27 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:27+00:00] restarting cloud-init
Sep 05 20:41:28 ip-10-80-81-103 cloud-init[587]: Failed to connect to bus: No such file or directory
Sep 05 20:41:31 ip-10-80-81-103 passwd[1309]: password for 'ubuntu' changed by 'root'
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: cloud-init.service: Main process exited, code=exited, status=120/n/a
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: cloud-init.service: Failed with result 'exit-code'.
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: Stopped cloud-init.service - Cloud-init: Network Stage.
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: cloud-init.service: Consumed 6.430s CPU time.
-- Boot ff2115d67d254a9ca580ea0d70ae67b1 --
Sep 05 20:41:51 ip-10-80-81-103 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: Cloud-init v. 24.2-0ubuntu1~24.04.2 running 'init' at Thu, 05 Sep 2024 20:41:52 +0000. Up 9.45 seconds.
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | ens5 | True | 10.80.81.103 | 255.255.254.0 | global | 0e:74:35:1c:fa:fd |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | ens5 | True | fe80::c74:35ff:fe1c:fafd/64 | . | link | 0e:74:35:1c:fa:fd |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | 0 | 0.0.0.0 | 10.80.80.1 | 0.0.0.0 | ens5 | UG |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | 1 | 10.80.80.0 | 0.0.0.0 | 255.255.254.0 | ens5 | U |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | 2 | 10.80.80.1 | 0.0.0.0 | 255.255.255.255 | ens5 | UH |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | 3 | 10.80.80.2 | 0.0.0.0 | 255.255.255.255 | ens5 | UH |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+---------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | Route | Destination | Gateway | Interface | Flags |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+---------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | 0 | fe80::/64 | :: | ens5 | U |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | 1 | local | :: | ens5 | U |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | 2 | multicast | :: | ens5 | U |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+---------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: 2024-09-05 20:41:52,520 - main.py[ERROR]: failed stage init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: Traceback (most recent call last):
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 797, in status_wrapper
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ret = functor(name, args)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 436, in main_init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: iid = init.instancify()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 541, in instancify
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: return self._reflect_cur_instance()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 461, in _reflect_cur_instance
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: util.del_file(self.paths.instance_link)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2069, in del_file
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: os.unlink(path)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: IsADirectoryError: [Errno 21] Is a directory: '/var/lib/cloud/instance'
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: failed run of stage init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ------------------------------------------------------------
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: Traceback (most recent call last):
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 797, in status_wrapper
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ret = functor(name, args)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 436, in main_init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: iid = init.instancify()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 541, in instancify
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: return self._reflect_cur_instance()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 461, in _reflect_cur_instance
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: util.del_file(self.paths.instance_link)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2069, in del_file
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: os.unlink(path)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: IsADirectoryError: [Errno 21] Is a directory: '/var/lib/cloud/instance'
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ------------------------------------------------------------
Sep 05 20:41:52 ip-10-80-81-103 systemd[1]: cloud-init.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 05 20:41:52 ip-10-80-81-103 systemd[1]: cloud-init.service: Failed with result 'exit-code'.
Sep 05 20:41:52 ip-10-80-81-103 systemd[1]: Failed to start cloud-init.service - Cloud-init: Network Stage.
77b4080
to
ffb8254
Compare
/test pull-cluster-api-provider-aws-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After trying everything and getting nowhere, I tried just rebooting the machine out of desperation. It worked. So maybe we just do this.
rm -rf /var/lib/cloud/instances | ||
cloud-init clean --reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm -rf /var/lib/cloud/instances | |
cloud-init clean --reboot | |
reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried the following that worked without reboot:
rm -rf /var/lib/cloud/instances
cloud-init clean
systemctl restart cloud-init-local
systemctl restart cloud-init
systemctl restart cloud-config
systemctl restart cloud-final
rm -rf /var/lib/cloud/instances | ||
cloud-init clean --reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm -rf /var/lib/cloud/instances | |
cloud-init clean --reboot | |
reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im going to try again with some updates i made to failing tests #5118
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SriRamanujam the tests are all passing - mind building and trying again locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also hit this issue and we found out that reboot is not needed after cloud-init clean
. We just needed to restart all cloud-init services in the order.
rm -rf /var/lib/cloud/instances | |
cloud-init clean --reboot | |
rm -rf /var/lib/cloud/instances | |
cloud-init clean | |
systemctl restart cloud-init-local | |
systemctl restart cloud-init | |
systemctl restart cloud-config | |
systemctl restart cloud-final |
/retest |
1 similar comment
/retest |
ffb8254
to
66cc7fa
Compare
/retest |
/test pull-cluster-api-provider-aws-e2e |
/retest |
1 similar comment
/retest |
the e2e tests seem to pass- does anyone have suggestions on what other tests i should run? |
/test ? |
@richardcase: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Lets also run the eks e2e just in case: /test pull-cluster-api-provider-aws-e2e-eks |
Until the eks e2e passes: /hold I think this looks good to me. @faiq would you be able to add a note on any manual testing you have done with this? |
Both the non-eks and eks e2e tests are passing with this change. |
/cherrypick release-2.6 |
@richardcase: once the present PR merges, I will cherry-pick it on top of release-2.6 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: richardcase The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I ran another test, using each of the three variants discussed above. tl;dr As before, rebooting is the only thing that works for me. Test notesKubernetes: 1.29.6 Procedure: build and deploy PR branch to k8s cluster, then attempt to stand up a CAPI cluster, wait for the first control plane machine to come up, Expectation: the control plane comes up and the node goes ready with no intervention.
|
@SriRamanujam what AMI are you using to test? Is it public? Myself and @phoban01 have tested these two AMI with the 3 listed methods and all work:
These two are the new CAPA AMI located in We did receive a warning to run |
@Nalum - I'm using an internally built AMI that's based on Ubuntu AMI @richardcase @faiq - If others are seeing success and the CI is passing, please don't block this on me. I think there are enough confounding variables in my specific case that there's probably something else going on. |
I think we're safe to merge this. Thank you @SriRamanujam and @Nalum for verifying the solution! |
I must be doing something wrong (or perhaps there is an issue with the AMIs i'm using) but non of the proposed solutions are working for me 😢 |
Hello @faiq @SriRamanujam @Nalum @zarcen @richardcase / all Cloud-init developer here, I'd like to try to help if possible. I don't think that restarting cloud-init or rebooting the instance is desirable, nor should it be necessary. See my reasoning below. How things currently workIf I understand correctly, the goal of the bootscript is: Get user-data from the secret store, then make cloud-init run it. As implemented, this script:
A little bit about cloud-initRestarting cloud-init before it is completed may have unexpected consequences. Similarly, restarting the instance may break things in unexpected ways (and is obviously slower than a non-reboot solution). Cloud-init has a code concept called "datasources". These classes define where user-data comes from. This is what makes it possible to run the same cloud-init package on EC2 and on other clouds. From my perspective, a custom datasource would be the preferred solution. The current boothook script attempts to do the same thing that a cloud-init datasource does, but in a way that assumes things about cloud-init that it probably shouldn't. I wrote a hackish proof of concept of a custom datasource which appears to work from my limited testing. I wouldn't propose it as-is, but perhaps we can use it as a straw-man to find a more robust and maintainable path forward. The commit message explains how it works and how to install it. Please let me know if you have any questions. Some questionsHow is the AWS CLI installed? |
Wow! that sounds amazing. Restarting cloud-init is definitely hack-y and im glad you came up with something that doesn't require it
AWS CLI is installed through image builder and we get the images built via that. code linked here: https://github.com/kubernetes-sigs/image-builder/blob/2f188e738f961730645269fe942cfcbb0925db7a/images/capi/ansible/roles/providers/tasks/awscliv2.yml#L76 |
@holmanb This is super interesting, thanks for writing that up! @dlipovetsky this may be relevant to our interests re: getting rid of cloud-init hackiness entirely |
That's great, thanks for the info and example @holmanb 👍 |
Fantastic, thank you @holmanb ❤️ Having the sample is excellent, i will give it a go today. @chrischdi - you may like this as well....especially based on your suggestion yesterday after the CAPI meeting. |
@holmanb - also to add further to @faiq response to the questions
|
@faiq @SriRamanujam - i'm going to build a custom AMI with the ds and config in for testing this morning. I will post the ami id if you want to try it. |
Thanks for the feedback @richardcase @Nalum @SriRamanujam @faiq! Based on @faiq and @richardcase's responses, it sounds like including a drop-in cloud-init datasource should work. @richardcase Thanks for testing! I don't know what exactly the image build process includes (I didn't read all of the ansible bits linked), but I just want to note that for cloud-init to consider it a "first boot", you'll need to run |
Thanks @holmanb . We created a test AMI with these changes and have been testing them with CAPA from this PRs branch. The image-builder changes so far are on this wip pr: kubernetes-sigs/image-builder#1583. We're running into an error when the boothook script runs in the local stage as the network and AWS creds are not setup yet and so the AWS cli calls fail. However, with this CAPA branch there is still a reboot and so when the machine comes up the second time the boothook script runs (as the network and creds are setup) and we get k8s coming up. Not ideal but its working, a lot further than any previous attempt 😄 Tomorrow we'll look at changing the logic of boothook script logic (and maybe the "local" datasource) to handle this situation better and remove the reboot. I'll start staying logged into irc.....i shut it down last night. Maybe time to look at quassel or something similar. Thanks again @holmanb 🙇♂️ |
Right, boothooks normally run in network stage. I saw that failure during testing, but I didn't look too hard at the warning since it succeeded when it tried again during network stage. It should be trivial to disable trying during local stage. I commented on your PR with a suggestion.
I saw your comment but not until after you had left. We do have channel logs but typically don't bother replying if the person asking has already left. FWIW I run quassel-core on a cheap cloud instance and I'm happy with it. I can connect using quasssel-client from any computer (the android app isn't bad either) without loosing history.
Happy to help! |
What type of PR is this?
/kind fix
What this PR does / why we need it:
It seems that
systemctl restart clout-init
is no longer sufficient to start the cloud-init process again with the secret userdata. After some googling I found the following steps to restart it without needing to reboot the machine. The step comes from here https://cloudinit.readthedocs.io/en/latest/howto/rerun_cloud_init.htmlWe should move towards an approach that doesn't require a restart of cloud-init to handle secret userdata.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #5115
Special notes for your reviewer:
Checklist:
Release note: