Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kmt: fixes for local macOS VMs #1183

Merged
merged 4 commits into from
Oct 14, 2024
Merged

Conversation

gjulianm
Copy link
Contributor

@gjulianm gjulianm commented Oct 11, 2024

What does this PR do?

Two fixes for macOS local VMs:

  • Explicitly set dumpCore to on. This is required by newer QEMU versions but libvirt isn't setting the defaults correctly yet. Will likely be needed in Linux too, in any case it just affects behavior of core dumps performed if the VM crashes.
  • Make network ID depend on the stack name and not on the VMset tag. This might be the cause of the problems seen where the local KMT VMs would lose connection randomly.

Which scenarios this will impact?

Motivation

Additional Notes

Validated in agent pipeline: https://gitlab.ddbuild.io/DataDog/datadog-agent/-/pipelines/46393212

@gjulianm gjulianm self-assigned this Oct 11, 2024
@gjulianm gjulianm marked this pull request as ready for review October 14, 2024 08:47
@gjulianm gjulianm requested a review from a team as a code owner October 14, 2024 08:47
// We have to use QEMU network devices because libvirt does not support the macOS
// network devices.
netID := pulumi.Sprintf("net%s", set.ID)
netID := pulumi.Sprintf("net%s", e.Ctx().Stack())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this not make VMs on the same set to get the same ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for the network, which I think is what we should do. Previously all VMs had the same network ID (set.ID ended up being always distro_local or something to that effect) and I think that might have been confusing the macOS virtual networking, maybe it didn't properly restart a network that was previously used and was disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about that. These values are local to a running QEMU instance. -netdev id= specifies an identifier for the network backend, which can be used to connect devices to it. This id should not be visible or have any meaning outside the current QEMU instance.

Atleast this is my understanding. Maybe mac is doing some weird stuff parsing these value...?

In any case, rather than using the pulumi generated stack name, maybe you can add an extra tag (like the KMT stack name) in the vmconfig generated by the python code, so the IDs get dedupped. We derive names and ids everywhere from the vmset itself. It would be better to keep that consistent here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think macOS does some weird stuff, yes. Because I can't really reproduce the problem, I'm still at the stage of trying things a bit blindly.

maybe you can add an extra tag (like the KMT stack name) i

Maybe using the same schema we use for the domain name? like domainName := libvirtResourceName(e.Ctx().Stack(), domain.domainID), then netID := libvirtResourceName(domainName, 'net') ?

Copy link
Contributor

@usamasaqib usamasaqib Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think macOS does some weird stuff, yes

Maybe try setting it to a static value to see if it event effects anything.

Maybe using the same schema we use for the domain name?

That would work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe try setting it to a static value to see if it event effects anything.

The main problem is that I don't have a way to reproduce the bug :D so I can't really know if it's changing things or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this change if there is no indication that it does anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the only way to test this is to see if the same issue happens again in dev laptops...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only "clue" I have is a flare from @brycekahle where he had a VM that was turned on and another that was turned off, so a "network conflict" was a possible theory.

@gjulianm
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Oct 14, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 20m.

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit c3726ba into main Oct 14, 2024
9 checks passed
@dd-mergequeue dd-mergequeue bot deleted the guillermo.julian/macos-vm-fixes branch October 14, 2024 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants