Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move first boot to new service #215

Merged
merged 11 commits into from
Jan 10, 2024

Conversation

RamLavi
Copy link
Collaborator

@RamLavi RamLavi commented Jan 9, 2024

When creating the vm-under-test and traffic-gen container-disk images, the first-boot script running by the virt-builder command is run on a service created by virt-builder.

This service is guaranteed to run in the final stages of the boot process [0].
This behavior may create a race with the checkup, that is waiting on the
agentConnected condition [1] being added in order to run the checkup's
executor package.
The race is happening since the guest-agent service also runs during the final stages of the boot process.

This PR is solving this race by moving the content of the first-boot script of both VMs to a new service that is guaranteed to run before the guest-agent service.

Fixes #210

[0] https://www.libguestfs.org/virt-builder.1.html
[1]

if condition.Type == kvcorev1.VirtualMachineInstanceAgentConnected && condition.Status == k8scorev1.ConditionTrue {

Moving the snippet disabling the services to a function.

Signed-off-by: Ram Lavi <[email protected]>
Moving the snippet setting hugepages to a function.

Signed-off-by: Ram Lavi <[email protected]>
@RamLavi RamLavi force-pushed the move_first_boot_to_new_service branch 2 times, most recently from d274be6 to 3357895 Compare January 10, 2024 10:42
@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 10, 2024

passes e2e on CNV 4.15 cluster:

make test/e2e
podman run --rm \
           --volume /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup:/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup:Z \
           --volume /home/ralavi/.kube/sno01-cnvqe2-rdu2:/root/.kube:Z,ro \
           --workdir /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup \
           -e KUBECONFIG=/root/.kube/kubeconfig \
           -e TEST_CHECKUP_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup:latest \
           -e TEST_NAMESPACE=dpdk-checkup-ns-1 \
           -e NETWORK_ATTACHMENT_DEFINITION_NAME=dpdk-sriovnetwork-ns-1 \
           -e TRAFFIC_GEN_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup-traffic-gen:latest \
           -e VM_UNDER_TEST_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup-vm:latest \
           docker.io/library/golang:1.20.12 go test ./tests/... -test.v -test.timeout=1h -ginkgo.v -ginkgo.timeout=1h
=== RUN   TestKubevirtDpdkCheckup
Running Suite: KubevirtDpdkCheckup Suite - /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests
==============================================================================================================
Random Seed: 1704886098

Will run 1 of 1 specs
------------------------------
[BeforeSuite] 
/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests/test_suite_test.go:56
[BeforeSuite] PASSED [0.003 seconds]
------------------------------
Execute the checkup Job should complete successfully
/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests/checkup_test.go:83
• [394.488 seconds]
------------------------------

Ran 1 of 1 Specs in 394.490 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestKubevirtDpdkCheckup (394.49s)
PASS
ok  	github.com/kiagnose/kubevirt-dpdk-checkup/tests394.507s

logs show the correct linux cmdline for both VMs:

2024/01/10 11:32:45 VMI under test guest kernel Args: cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-365.el8.x86_64 root=UUID=a9332d7d-1762-41cd-a702-6b2cc556c248 ro console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d4ac9572-e828-4c87-9b76-e59c9fa6e426 console=ttyS0,115200 default_hugepagesz=1GB hugepagesz=1G hugepages=1 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on nohz_full=2-7 rcu_nocbs=2-7 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup
[root@vmi-under-test-pf728 cloud-user]# 
2024/01/10 11:32:47 traffic generator guest kernel Args: cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-365.el8.x86_64 root=UUID=a9332d7d-1762-41cd-a702-6b2cc556c248 ro console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d4ac9572-e828-4c87-9b76-e59c9fa6e426 console=ttyS0,115200 default_hugepagesz=1GB hugepagesz=1G hugepages=1 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on nohz_full=2-7 rcu_nocbs=2-7 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup
[root@dpdk-traffic-gen-pf728 cloud-user]#

Copy link
Member

@orelmisan orelmisan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

Comment on lines 32 to 35
TREX_URL=https://trex-tgn.cisco.com/trex/release
TREX_VERSION=v3.03
TREX_ARCHIVE_NAME=${TREX_VERSION}.tar.gz
TREX_DIR=/opt/trex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables are considered global, since they are not prefixed with local.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

vms/vm-under-test/scripts/customize-vm Show resolved Hide resolved
Moving trex setup to a function

Signed-off-by: Ram Lavi <[email protected]>
Moving setting nsafe_no_io_mmu_mode to a function.

Signed-off-by: Ram Lavi <[email protected]>
The first-boot script running by the virt-builder command is run on a
service created by virt-builder. This service is guaranteed to run in
the final stages of the boot process [0].
This may create a race with the checkup, that is waiting on the
agentConnected condition [1] being added in order to run the checkup's
executor package. The race is happening since the guest-agent service
also runs during the final stages of the boot process.

In order to eliminate this race, removing the first-boot script in favor
of the new service, and moving the first boot script content
into a new service. This service is
- manually created on the customize-vm script.
- explicitly scheduled to run before the guest-agent service runs.
- running the content of the former first-boot script.
- running the tuned-adm command formerly set on the customize-vm script.
The reason is that this command needs to run on a running guest.
Moreover, since this command needs to only run on the first boot of the
image, the script is checking the existence of a marker file, then only
creating the file after the first run of the snippet.

[0] https://www.libguestfs.org/virt-builder.1.html
[1]
https://github.com/kiagnose/kubevirt-dpdk-checkup/blob/b9b6a472fe92583a0db46361289239e0e8d06284/pkg/internal/checkup/checkup.go#L237
[2] https://www.freedesktop.org/software/systemd/man/latest
systemd.unit.html#ConditionPathExists=

Signed-off-by: Ram Lavi <[email protected]>
Moving the snippet disabling the services to a function.

Signed-off-by: Ram Lavi <[email protected]>
Moving the snippet setting hugepages to a function.

Signed-off-by: Ram Lavi <[email protected]>
Moving setting nsafe_no_io_mmu_mode to a function.

Signed-off-by: Ram Lavi <[email protected]>
The first-boot script running by the virt-builder command is run on a
service created by virt-builder. This service is guaranteed to run in
the final stages of the boot process [0].
This may create a race with the checkup, that is waiting on the
agentConnected condition [1] being added in order to run the checkup's
executor package. The race is happening since the guest-agent service
also runs during the final stages of the boot process.

In order to eliminate this race, removing the first-boot script in favor
of the new service, and moving the first boot script content
into a new service. This service is
- manually created on the customize-vm script.
- explicitly scheduled to run before the guest-agent service runs.
- running the content of the former first-boot script.
- running the tuned-adm command formerly set on the customize-vm script.
The reason is that this command needs to run on a running guest.
Moreover, since this command needs to only run on the first boot of the
image, the script is checking the existence of a marker file, then only
creating the file after the first run of the snippet.

[0] https://www.libguestfs.org/virt-builder.1.html
[1]
https://github.com/kiagnose/kubevirt-dpdk-checkup/blob/b9b6a472fe92583a0db46361289239e0e8d06284/pkg/internal/checkup/checkup.go#L237
[2] https://www.freedesktop.org/software/systemd/man/latest
systemd.unit.html#ConditionPathExists=

Signed-off-by: Ram Lavi <[email protected]>
It makes sense to log the guest's kernel args of both VMIs for debugging
purposes.

Signed-off-by: Ram Lavi <[email protected]>
@RamLavi RamLavi force-pushed the move_first_boot_to_new_service branch from 3357895 to a7fbd81 Compare January 10, 2024 13:28
@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 10, 2024

Change: Review fixes

@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 10, 2024

passed e2e with CNV 4.15 cluster:

make test/e2e
podman run --rm \
           --volume /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup:/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup:Z \
           --volume /home/ralavi/.kube/sno01-cnvqe2-rdu2:/root/.kube:Z,ro \
           --workdir /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup \
           -e KUBECONFIG=/root/.kube/kubeconfig \
           -e TEST_CHECKUP_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup:latest \
           -e TEST_NAMESPACE=dpdk-checkup-ns-1 \
           -e NETWORK_ATTACHMENT_DEFINITION_NAME=dpdk-sriovnetwork-ns-1 \
           -e TRAFFIC_GEN_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup-traffic-gen:latest \
           -e VM_UNDER_TEST_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup-vm:latest \
           docker.io/library/golang:1.20.12 go test ./tests/... -test.v -test.timeout=1h -ginkgo.v -ginkgo.timeout=1h
=== RUN   TestKubevirtDpdkCheckup
Running Suite: KubevirtDpdkCheckup Suite - /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests
==============================================================================================================
Random Seed: 1704907747

Will run 1 of 1 specs
------------------------------
[BeforeSuite] 
/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests/test_suite_test.go:56
[BeforeSuite] PASSED [0.002 seconds]
------------------------------
Execute the checkup Job should complete successfully
/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests/checkup_test.go:83
• [378.229 seconds]
------------------------------

Ran 1 of 1 Specs in 378.232 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestKubevirtDpdkCheckup (378.23s)
PASS
ok  	github.com/kiagnose/kubevirt-dpdk-checkup/tests378.254s

kernel Args are correctly updated in the log:

2024/01/10 17:33:22 VMI under test guest kernel Args: cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-365.el8.x86_64 root=UUID=a9332d7d-1762-41cd-a702-6b2cc556c248 ro console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d4ac9572-e828-4c87-9b76-e59c9fa6e426 console=ttyS0,115200 default_hugepagesz=1GB hugepagesz=1G hugepages=1 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on nohz_full=2-7 rcu_nocbs=2-7 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup
[root@vmi-under-test-52qlz cloud-user]# 
2024/01/10 17:33:24 traffic generator guest kernel Args: cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-365.el8.x86_64 root=UUID=a9332d7d-1762-41cd-a702-6b2cc556c248 ro console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d4ac9572-e828-4c87-9b76-e59c9fa6e426 console=ttyS0,115200 default_hugepagesz=1GB hugepagesz=1G hugepages=1 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on nohz_full=2-7 rcu_nocbs=2-7 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup
[root@dpdk-traffic-gen-52qlz cloud-user]# 

@RamLavi RamLavi merged commit 1f60b85 into kiagnose:main Jan 10, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VM container-disk: VMI does not include tuned-adm kernel-args
2 participants