Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Hypervisor per ESXi #233

Closed
wants to merge 7,522 commits into from
Closed

WIP: Hypervisor per ESXi #233

wants to merge 7,522 commits into from

Conversation

fwiesel
Copy link
Member

@fwiesel fwiesel commented Aug 26, 2021

Currently, the code only collects stats for the different nodes... Now I have to look how to represent it to placement.
After that, the question is, how to migrate the VMs between the hypervisor nodes.

imitevm and others added 14 commits August 8, 2022 15:46
Whiile nova already cleaned up its original image-cache, our patches
also added images as templates into the mix that were not cleaned up
prior to this patch and thus filling up the datastores.

Change-Id: I2fc631d6ce0a9339be9237d63ae7e86d94779dcc
(cherry picked from commit 97780bb)
(cherry picked from commit ce5f026)
If we sort the by creation-time, we can stop looping over tasks once we
find an event that's too old. All other VMs, we didn't find an event
for, yet, must be expired anyways.

Change-Id: I196bb9ab48867f314d9a5f3b7566384fa72778df
(cherry picked from commit 642bcc7)
(cherry picked from commit 2190382)
Since we want to keep images in the owner's folder again instead of
uploading them to all projects using them, we need to also make sure we
clean up everywhere - even in projects not having instances.

Change-Id: I9344a9bb8c0436cb5c87f3328fa3b9d31dedbdbc
(cherry picked from commit 943babe)
(cherry picked from commit b1af943)
We need to support Python3 in the future and this is a low-hanging
fruit.

Change-Id: I86804af29f4c6aff14fa5495e5344377488ba8fe
(cherry picked from commit 95c177f)
(cherry picked from commit bce80c6)
We want to make sure we only clean up image-cache templates and no other
VMs that might happen to lie in the "Images" folder.

Change-Id: I55b7fa7ebbe14b13f579ebc39dcae2549ddedc9a
(cherry picked from commit 67fda8f)
(cherry picked from commit d18fb11)
Since every nova-compute node will be running the image-cache cleanup,
we have to make sure this works in parallel. Therefore, we limit the
cleanup of image-cache templates to the datastores the nova-compute node
is responsible for - the ones configured in the datastore_regex config
option.

Change-Id: I5fae822a08bcc06565c64f959553cf7082bb2423
(cherry picked from commit 0d8d760)
(cherry picked from commit 5f21c7d)
Even with image-as-template disabled we have to run the image-template
cleanup to ensure removal after the setting is switched off and because
we switched all our image-upload to using those image-templates.

Change-Id: I924fbc42f014ecf4c0342246d48f27b6ba1d5c77
(cherry picked from commit 44b8620)
(cherry picked from commit 7331446)
Otherwise, we might copy the same image to the datastore multiple times,
just because VMs from multiple projects are deployed from that (shared)
image.

Change-Id: I02721da655ec505d7c3c5d3c9faca1be77dce813
(cherry picked from commit 95b6176)
(cherry picked from commit aea1268)
Every image we upload automatically ends up as template-VM. We can
leverage this in normal deployments and copy the VMDK from the
image-template, if it still exists. If it doesn't exist anymore, we can
try to copy another image-template from another datastore and use its
VMDK instead - that should be still faster than copying stuff from
glance again.

If all fails, we turn back to the original approach of loading the image
from glance.

Change-Id: I659d22e494a86fe4e567306062784313432d11ee
(cherry picked from commit cfa3eb8)
(cherry picked from commit a0d9db9)
Until now, the logs during a clean-up showed "Destroyed VM" messages,
but one could not find out, which VM was destroyed. Therefore, we add an
additional log message to point out that and which image-template VM
we're going to destroy.

Change-Id: I7429fca0175ec4593689466be6dcc0cb2482cb9f
(cherry picked from commit 557e34e)
(cherry picked from commit 2d6f122)
There are cases, where we don't want to blindly try and delete a VM, but
instead know the outcome for certain types of errors at least. For
supporting this case, vm_util.destroy_vm() has to re-raise certain types
of exceptions. In the case we're looking for, that's the
"InvalidArgument" fault. For this to work like before, code calling
destroy_vm() needs to catch these exceptions, too.

Change-Id: I28d2f1b94b8439cfea88146746ae6e59d61f087c
(cherry picked from commit e112c5e)
(cherry picked from commit 73a9a27)
While we don't know why, yet, we've seen image-cache templates missing
their directory on the datastore.

For the image-template cleanup this means, that Nova cannot destroy
those templates, as this raises an InvalidArgument fault on VMware's
side, tellung us that ConfigSpec.files.vmPathName is invalid. Since we
need to clean those templates up anyways to create a new, usable one, we
catch those errors and call UnregisterVM as a fallback.

For getting templates from another datastores or re-using the already
existing template to copy its disk, we have to catch
FileNotFoundException. If this situation occurs, we have to clean that
broken template up and also let nova continue the search for a source of
the disk for the new VM.

Change-Id: Id6d1cc1cd7a50958c77a1417e3f2aed7b9672e15
(cherry picked from commit 1b342cd)
(cherry picked from commit 13940c5)
If the DRS policy is not set to "fullyAutomated", i.e. if DRS will not
take action on existing, running VMs, we have not change to free up a
host and thus don't have to try to find one. Instead, we error out to
tell nova-bigvm to search for another cluster.

Change-Id: Idbdfe82b4057844401e710fb9d87141478bb3353
(cherry picked from commit e98d6ae)
(cherry picked from commit 0e8f17e)
Support for the E1000 driver is phasing out in guest operating systems,
so we switch to E1000e, which is also faster and has more hardware
offloading capabilities.

Change-Id: I08ac32f914a57d3eb7328351a07a20a2ef212cb8
(cherry picked from commit 5e7556d)

fix unit tests for vmware: Switch default VIF model

(cherry picked from commit 425070f)
(cherry picked from commit a3b173a)
grandchild and others added 25 commits January 26, 2023 12:01
Original code can be found at: https://opendev.org/x/nova-mksproxy

Integrate the code into nova as a builtin command.

Subclass `base.SecurityProxy` into `MksSecurityProxy`. Because MKS'
`connect_info.internal_access_path` contains more than just a `path`
(namely a JSON object with "ticket", "cfgFile" & "thumbprint") add a
new `parse_internal_access_path()` method in
`NovaProxyRequestHandler`. This method tries to parse
`internal_acess_path` as JSON and if that fails, puts the contents in
the "path" key of a new `internal_access_path_data` dict.

Co-authored-by: Johannes Kulik <[email protected]>
Has to be "nova-{NAME}proxy".
The old script had a --verbose flag that we are using in the k8s
deployment definition for all regions. This will include both Xena
and older Rocky deployments. The new name of this flag is
--mks-verbose.

Instead of adding an image version check in the helm-chart, add an
alias for the flag instead, for the duration of the update, and revert
this after the Xena upgrade is through everywhere.
Physical CPUs are faster than hyperthreaded CPUs. So by default VMware
spreads a VM's CPU cores onto physical cores, spanning multiple NUMA
nodes if needed although the vCPUs would fit onto a smaller number of
NUMA nodes.

HANA VMs prefer low-latency (and thus NUMA-locality) over raw
performance, so they set the VMware config `preferHT` to not get
spread out. Before this setting was only applied if the VM's flavor
qualified as a "Big VM" flavor (i.e. >1GiB) in size. This excluded
smaller HANA flavors and a single-NUMA-node VM got spread over two
nodes.

Make `preferHT` depend on whether the flavor requires one of our
`CUSTOM_NUMASIZE_*` traits instead.
Another best-practice recommendation for HANA-on-VMware.

The code pretty much duplicates the one for setting the DRS partially-
automated behavior. I decided against deduplication and abstraction in
favor of readability and at the cost of LOC.
Testing showed that single- and half-numa-node VMs get spawned across
multiple NUMA nodes due to `numa.autosize.vcpu.maxPerVirtualNode`
being too low. Set the value explictly to the same value that's used
for the `CoresPerSocket` cluster-vm setting.
Introduced in 'vmware: Restart BigVMs with "high" priority', the
function "update_cluster_das_vm_override()" did not work, because it
created a "ClusterDasVmConfigInfo" instead of a "ClusterDasVmConfigSpec"
object. This lead to us not being able to spawn BigVMs with the
following error:

    Exception in ReconfigureComputeResource_Task.
    Cause: Type not found: 'operation'

Change-Id: If9acf9ee07e373b7b24c14c642d0d99fe2a41db1
Copy the relevant code from the libvirt driver.

This reintroduces the code that was changed (presumably by accident) in
  fd4f43b VmWare: Refactored resource and inventory collection
Fixup that commit with this one when forward porting!
`remove_unused_original_minimum_age_seconds` along with several other
config options related to image cache management were moved to their
own config group `image_cache` in Ussuri (nova 21), but the cherry-
pick in
    9de18bd "vmware: Update expired image-template-VM handling"
failed to account for this.
Was removed in
    f3cc311 "vmware: Remove vestigial nova-network support"
but not taken into account by
    aba0dbb "VMware: Image as VM template".
In
    daeeafd "baremetal quota handling"
the `_build_per_flavor_limits()` method is incorrectly
rewritten to match surrounding code, esp. `_build_absolute_limits()`,
and uses not only the limit, but limit and in_use as value for the
resulting limit dict.

The original commit,
    16857ed "reintroduce baremetal quota handling"
used `absolute_limits` directly as method parameter.
For historic reasons, there is a mapping between
the volume-uuid and the device[].backing.uuid in vmware
through extraConfig.
In most cases they are the same, but sometimes they
are not.
And apparently, the extraProperties can even hold
wrong information.

Since the risk of a collision of uuids is quite low,
and lookup is the iteration over a small list of volumes
we first try it that way, avoiding possibly incorrect
information in the extraConfig.
Failing that, we still do the additional api call
to get the mapping, and try that.

Work around type errors mypy raises in the changed file

Change-Id: Ifcdf96cfc6d00473299c1f2a7cb9d23d03294027
The ironic-python-agent uses qemu-img to convert
various image formats to the raw disk format.
QCow2 is already correctly marked accordingly,
but vmdk was missing

Change-Id: Ifd868e6951452b291184bd848d4244d37649f1ed
We need to seperate the arguments for tox
from the ones to the linter

Change-Id: I6d7ee1d95f0ca17fa6c04a3546a7f907cc6393f1
Nested `ClusterDasVmSettings` inside new `ClusterDasVmConfigInfo`
objects don't get created properly. This seems to have worked on Rocky
but fails on Xena.
Otherwise will try and decode as UTF-8-encoded text during read().
We have added custom filters and weighers for our common
use-case VMware, and excluded there already baremetal.

For libvirt as additional hypervisor, we also want
to skip those filters/weighers, so we test now
if the request is scheduled for vmware either
due to a flavor extra_spec or an image property
If nothing is set, we default to vmware for backwards
compatibility

Change-Id: I3223aee433ba9009d2cd6387eeda8e70dbbb3cde
The code assumes that the libvirt version reported
by the hypervisor matches the version of the python library installed.
Since we run the agent in a container, that is not correct,
and the attribute will not be in the library as it has been
build for an older version.

Change-Id: I156047b1f9829a49b429d51ca7f7777606b10a56
Change-Id: I3edf0d3c6deb8385adc1095c7e3c09526e2d4d65
This command detects changes in the flavors and syncs the
instances flavor-info with the new changes.

`nova-manage sap sync_instances_flavor`
The vcenter doesn't like a reconfiguration call with an empty spec, and
raises an error. So we skip that and save ourselves from doing some work
on top.

Change-Id: I1da3309500a2cd384c5d7cd431e71297ef5d5a3a
You can configure now
[vmware]
hypervisor_mode=<cluster,esxi_to_cluster,cluster_to_esxi>

to chose to expose the individual esxi hosts instead of the cluster.
It will only show the esxi-host, as hypervisor node, but will not
honour the node parameter for placeing the instance for instance-creation,
migration, etc...

Change-Id: I264fd242c0de01ae8442c03bc726a0abfbe176ef
The code assumes that there is a single compute-node per host,
which is incorrect for ironic. By summarising over all hosts,
we get a correct response also for compute-hosts with more than
one compute-node.

Change-Id: Iaf6e2a72f6649e234660de47bec8b1da1ea1571e
@grandchild
Copy link

that's quite few commits. is this meant for xena?

@fwiesel fwiesel closed this May 2, 2023
@fwiesel
Copy link
Member Author

fwiesel commented May 2, 2023

The PR was meant for rocky, but I've updated the branch for xena.
I've closed this in favour of #423

@grandchild
Copy link

fyi, it should be possible to change the target branch within the existing PR -- but all good, let's continue in #423.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants