Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port Rocky patches onto Xena #350

Merged
merged 284 commits into from
Aug 8, 2022
Merged

Port Rocky patches onto Xena #350

merged 284 commits into from
Aug 8, 2022

Conversation

joker-at-work
Copy link

This includes some master fixes, because the tests wouldn't run through otherwise.

notandy and others added 30 commits August 8, 2022 15:40
(cherry picked from commit bd338c3)
(cherry picked from commit 7976a56)
(e.g. when all hosts in maintainence)

(cherry picked from commit 4a32b79)
(cherry picked from commit fa208ae)
(cherry picked from commit a78c03e)
(cherry picked from commit 70e9676)
…onsole - raising exception.ConsoleTypeUnavailable if the AcquireTicket operation throws InvalidPowerState

(cherry picked from commit 7c6c326)
(cherry picked from commit 248ab92)
This fix adds configuration flag for avoiding migrations across
clusters. Setting `always_resize_on_same_host` config option
will put the instance host in `force_hosts` resulting in
resize the instance on the same host.

(cherry picked from commit 935060e)
(cherry picked from commit 1d38c7f)
Nova regularly checks the power state of every VM on the hypervisor. The
default behaviour for VMs not matching the requested power state is to
call the stop API. But since vmware hypervisors are actually a cluster
of hypervisors that do HA on their own. In the process of doing HA
failover, especially if that failover fails for some time e.g. because
of a still held lock, VMs are reported as SHUTDOWN by the API. If we
call the stop API then, nova will shut down the VM once it's RUNNING
again, because the stop API sets the expected/user-wanted VM state to
SHUTDOWN. This would effectively disable the VMware HA.

Therefore, we're introducing a new setting
`sync_power_state_unexpected_call_stop` which can be used to disable the
call to the stop API.

(cherry picked from commit d915a1b)

Sync `vm_state` on power sync, too

This ensures, the customer can call "server start" again, if he manually
shut down the VM from inside. To handle this, we also have to disable
calling the stop API on finding a VM unexpectedly running.

(cherry picked from commit 525380a)

Sync SUSPENDED/PAUSED state into `vm_state`, too

If we do it for all other places where we handle
`CONF.sync_power_state_unexpected_call_stop`, we should be consistent
and sync PAUSED and SUSPENDED, too, even though those states should only
occur if someone clicks around in the vcenter.

(cherry picked from commit beef5df)

Add option to reserve all memory

If we don't reserve all memory, there's a swap file pre-created for
every VM on the ephemeral storage - even if it's unused.

But this option comes at the price of not being able to over-commit
memory.

Related issue: CCM-9589

(cherry picked from commit a435012)

Fix fetch image incorrectly returning folder path instead VMDK path to address OVA image deploy error for incorrect parameter fileType

(cherry picked from commit f93a4b3)

Handle VM name duplication for OVA image import

(cherry picked from commit 90f1629)

tests: Fix `test_sync_power_states_instance_not_found`

This test fails from time to time because we added a random sleep time
in commit
	> flatten power_sync_state to intervall duration

(cherry picked from commit 977552b)
(cherry picked from commit 5569c64)
… instead of from the shadow VM

(cherry picked from commit 18b4d49)

fix detaching volume created from image

(cherry picked from commit 446ac66)

fix unit tests and feedback for attach/detach volume created from image

(cherry picked from commit c71d177)
(cherry picked from commit ed19142)
Add global setting `bigvm_mb`

This is necessary in multiple places e.g. the scheduler and
compute-driver to handle big VMs differently. We cannot currently use
flavors for that, because there are still custom flavors in the system
that we don't control.

(cherry picked from commit 881889d)

[scheduler] Add BigVmFilter

The filter filters out hosts having more than
`bigvm_memory_utilization_max` memory utilization. This is necessary,
because it takes too long to find free space on a host for the big VM
otherwise.

(cherry picked from commit d284fc3)

[vmware] Add DRS override for "big" VMs

Since DRS doesn't play well with large differences in VM size and HANA
doesn't play well with vMotion, we're disabling automatic re-balancing
actions for "big" VMs by the DRS. On spawn, we add a
`ClusterDrsVmConfigSpec` for instances having more than by default 1 TB
of memory - configurable via `bigvm_mb`` configuration variable. The
vCenter deletes the overrides automatically when a VM is deleted.

One can view current overrides in the vCenter UI via $cluster (e.g.
productionbb098) -> Configure -> VM Overrides.

We're always adding the rule in `update_cluster_drs_vm_override`,
because we only call it in the driver's `spawn` method which seems to be
called only when a new vm object is created. Therefore, there should not
be any existing rule for the VM.

(cherry picked from commit ae68374)

[vmware] Always reserve all memory for big VMs

Since there are still custom flavors in use for big VMs, we cannot rely
on the flavors to set the reserved memory for big VMs, which is
necessary because otherwise small VMs will compete with the big VMs'
HANA, which does not work well.

We're using `CONF.bigvm_mb` here to identify a big VM and reserve all
memory regardless of `CONF.reserve_all_memory`.

(cherry picked from commit 706bcd8)

[scheduler] Make BigVmFilter more dynamic

Instead of having a static threshold for the `BigVmFilter`, we compute
the threshold depending on the hypervisor size and the requested RAM of
the big VM. The reasoning behind this is, that on bigger hypervisors a
small big VM doesn't matter as much as a big big VM, e.g. 1.4 TB on a 6
TB hypervisor vs. 5.9 TB on a 6 TB hypervisor. We basically aim at 50 %
of the requested RAM free on average over all hosts in the cluster.

The filter thus currently only works if a compute-node is in a host
aggregate defining the `_AGGREGATE_KEY` (currently "hv_size_mb") having
an integer defining the hypervisor's RAM size as a value.

(cherry picked from commit 6ad6a2c)

[scheduler] Add BigVmHypervisorRamFilter

Since we've got multiple BigVM-filters now, this also renames
`BigVmFilter` to `BigVmClusterUtilizationFilter` for clarity.

With having more diversity in the hypervisor size (e.g. 6 TB hypervisors
right around the corner), we have to make sure big VMs actually fit the
hypervisor. For that, we introduce a new scheduler filter -
`BigVmHypervisorRamFilter` - that prohibits scheduling of big VMs not
fitting the hypervisor.

In KVM-deployments, this would be the job of the `RamFilter`, but since
we run VMware and the scheduler only sees the cluster's and not a single
host's resources, we use a special host aggregate attribute -
"hv_size_mb" - to retrieve the hypervisor size as already done for
`BigVmClusterUtilizationFilter`.

(cherry picked from commit 4b6867a)

[scheduler] pep8 cleanup

(cherry picked from commit 39231e4)

Add `is_big_vm` utility function

This function should be used to check whether one is handling a big VM.
There, all necessary conditions are gathered in on place so one cannot
forget a condition. This also makes it easy to change the conditions
later on.

(cherry picked from commit cc46cf4)

[scheduler] Remove BigVmHypervisorRamFilter

We actually don't need it, because our filter queries the placement API
for resource providers capable of fulfilling the request. With the
vmwareapi driver setting the `max_unit` value for `MEMORY_MB` to the
cluster's maximum host size, we don't have to add another filter into
the scheduler, as too small hosts won't even end up as scheduling
options.

(cherry picked from commit c2fc281)

[scheduler] Use placement API to get HV size

We recently discovered, that the hypervisor size is actually available
in the placement API. Every compute host has a resource provider with an
inventory. There, the vmwareapi virt driver sets the `max_unit` for the
`MEMORY_MB` resource class to the biggest host found in the cluster.
This means, we don't have to use aggregates to get the hypervisor size
in the BigVm filters, but can retrieve it from the placement API.

We're building a cache of the value, that expires after 10 min. This
guards the case, where a lot of VMs are scheduled in a short period of
time - even though tests showed, that querying the inventory of all
hosts in QA did not take longer than a second. Cache expiration also
guards against stale values for newly-created hypervisors that did not
set their inventory, yet.

(cherry picked from commit 965d3b4)

[vmware] Make `partiallyAutomated` a constant

We usually use variables as constants instead of directly embedding
strings.

(cherry picked from commit 81097b0)
(cherry picked from commit fbf0a2d)
Fix for bug where the ClusterInfoEx has no group attribute despite
the fact that the VM should be in a server group

(cherry picked from commit da99dc3)
(cherry picked from commit a784bd2)
This allos the cloud-provider to give the user a possibility to identify
the cloud-platform she is running on. We want this so AS ABAP can do
group licensing for CCloud.

(cherry picked from commit 044d78b)
(cherry picked from commit 58ed345)
We have different datastores for VM/ephemeral and volume storage. Moving
the volume from to the VMs datastore would destroy the purpose of having
differente datastores and would make a cinder volume end up on ephemeral
storage. This cannot be intended behaviour and thus we don't do it.

(cherry picked from commit 69f2fd2)
(cherry picked from commit 8a6a814)
When booting from volume, we want the resize to work regardless of the
new flavor's root disk size, as we don't touch the root-disk anyways.

(cherry picked from commit ddfcfc2)
(cherry picked from commit 11b49cd)

[vmware] Skip disk resize for boot from volume
When booting from volume, the flavor's root disk size doesn't matter, so
we don't have to resize the volume.

(cherry picked from commit ddfcfc2)
(cherry picked from commit 11b49cd)
When booting from volume, the driver should ignore the volume size and
not error out of the root-disk is bigger than what the flavor would have
created as ephemeral disk.

(cherry picked from commit 533c503)
(cherry picked from commit bae6465)
When retrieving a volume from cinder, we translate the object to a dict.
This dict now also includes an `owner` attribute read from
`os-vol-tenant-attr:tenant_id` - if it exists. It might not exist, if
the cinder policy does not allow the user to see that extension
attribute.

We need that attribute for boot-from-volume where the image-metadata is
read from the volume instead of the image. Image metadata contains an
owner attribute.

(cherry picked from commit 63af5a0)
(cherry picked from commit 154acce)
When reporting the memory and cpu resources of a cluster, we must not
include failover hosts as they can only be used by the HA process to
start VMs from failing hosts.

(cherry picked from commit 1be80cb)
(cherry picked from commit d62ed1a)
This can be used to have different CPU/memory reservations for hosts
depending on the hostgroup they're in. We can also set defaults in
percent with this instead of only static values as supported by
CONF.reserved_host_memory_mb and CONF.reserved_host_cpus.

NOTE: Setting these settings too differently per hostgroup for the same
cluster might result in placement API getting confused and scheduling
being impossible, as the `max_unit` is still set to the maximum per
cluster. In case of our HANA HVs placement would therefore always
consider the HANA HVs max unit. That should be no problem as smaller VMs
should always fit.

(cherry picked from commit 5250436)
(cherry picked from commit 28f4eb8)
While the placement-api already adhered to the reservations set per
hostgroup, the os-hypervisors/details endpoint did still only use the
reserved values from the config file. With this commit nova also uses
the reservations to compute a hypervisor's used cpus/memory in the
hypervisor-view. This is important to us, as we get the quota capacity
from there.

(cherry picked from commit e1c4020)

Fix hostgroup reservations for production configurationEx

The tests worked fine, but production actually has a configurationEx
that does not support `.get()` which thus raises an AttributeError on
nova-compute start.

(cherry picked from commit 927cb88)

Fix tests for "tests: Set more fake HostSystem attributes"

We now actually return stuff in the tests and thus comparing to "null"
is not correct anymore.

(cherry picked from commit fa46579)
(cherry picked from commit 2329086)
As the comment states, it's only relevant for XML and should be removed
in the future. The future is now, as we want fully colored logs :D

(cherry picked from commit b332328)

pep8 for "[api] Don't remove ANSI from console output"

(cherry picked from commit c71cab2)
(cherry picked from commit 7c70c9f)
Instead of checking for the flavor attribute `quota:separate` we now use
the same flavor attribute as the scheduler: `capabilities:cpu_arch`.
This is done for consistency reasons and to make sure we can also use
separate quota for the big VMs if we want to.

(cherry picked from commit 551a457)
(cherry picked from commit e3ba555)
We don't show the total values for RAM/CPU anymore, but only the ones
reduced by the reserved resources. We need this, because limes retrieves
the full quota capacity from the total values.

I opted for changing the resources (for that single call, hence
copy.deepcopy) instead of subtracting the reserved values from the
totals afterwards, because if an attribute is unset on the `ComputeNode`
object, we cannot access it - not even with `getattr`. It just raises a
`NotImplementedError` because it cannot fetch the attribute from
somewhere (e.g. the DB). But since we need to subtract from the total,
we would have to access the attribute. Catching the
`NotImplementedError` didn't seem like a good option to me.

(cherry picked from commit 9ca7315)
(cherry picked from commit fd2c02e)
The function could not handle deleted flavors and thus prohibited quota
handling for certain projects by raising an exception. Additionally, it
was inefficiently using multiple queries to retrieve missing flavors.

The new function now retrieves the flavor information from the
`InstanceExtra` object, which saves the previous and current flavor in
JSON format, in case the flavor gets deleted. If it cannot find that
information - which should basically never happen, as that information
is saved there since basically forever - it falls back to the old
behaviour of querying the cell DB.

(cherry picked from commit 8e377a6)
(cherry picked from commit 0a18154)
This filter is supposed to only let full- or half-node size big VMs on a
host. This is our quickfix to get some NUMA-alignment for big VMs going
until the vmwareapi driver properly supports NUMA-aware scheduling.

(cherry picked from commit a320feb)
(cherry picked from commit fc725d8)

[scheduler] Make big VM host fractions configurable

Since the original implementation only supported full- and half-node
sizes manually in the code and we now only want full-node size to be
supported, we just make the host fractions configurable via settings.
The value set in the settings for a host fractioni will be multiplied by
the hypervisor's memory.

(cherry picked from commit f4978ab)
(cherry picked from commit 5a7fd2a)

[scheduler] Support multiple extra_specs values in big VM filter

We want to define flavors, that can be used both for a full and a half
hypervisor, so customers don't have to juggle too many flavors.

(cherry picked from commit b19e37c)
(cherry picked from commit 347d7c8)

[scheduler] Fix host_fractions setting in big VM filter

As the ini-style config-file allows the user to only provide
string-values, we need to convert the values to floats explicitly. The
tests could not find that case, because we set a dict there and could
use floats already.

(cherry picked from commit 6e3692a)

[nova][scheduler] Add logging for no matching host fraction

That way we can easier see if everything is configured correctly.

(cherry picked from commit b788650)
(cherry picked from commit 9082948)
It's possible to create a boot-from-volume instance with
instance.image_ref set to something - even something different than the
actually booted volume's image according to
https://bugs.launchpad.net/nova/+bug/1648501. Therefore, we cannot rely
on an empty image_ref for BFV detection and instead use
nova.compute.utils.is_volume_backed_instance now. This prohibits the
vmware driver from creating a disk on ephemeral store in addition to
using the volume and booting from the volume.

(cherry picked from commit 395c76d)
(cherry picked from commit b08bf77)
instead of returning error on duplicate found with pagination enabled,
we fetch the full amount of instances from cells regardless of open
buildrequests and slice the result at the end to the requested length.

(cherry picked from commit 841e2ce)
(cherry picked from commit 8b7183b)
We cannot use `orig_limit` when applying the limit to the list of
instances, because using the IP filter needs the full list. The code
handled this before by setting `limit` to `None`. Since we removed the
code subtracting the already found instances from `limit`, we can safely
use `limit` instead of `orig_limit` to make sure we return an unfiltered
list if IP-filtering is requested.

(cherry picked from commit e8862db)
(cherry picked from commit eef803e)
(cherry picked from commit dac4481)

[vmware] finish_revert_migration: relocate after the disks are updated

The relocate should happen after the _revert_migration_update_disks()
has been executed, otherwise reverting to the initial disk size could fail.

(cherry picked from commit 4a1957a)

[vmware] resize/migrate: fix RelocateVM api calls

- fix network devices backing while performing RelocateVM
- use migration.dest_host and migration.source_host to determine
  if we should move the instance to another cluster
- make use of resize_instance parameter passed to the driver

(cherry picked from commit 4e7a3de)

[vmware] _resize_disk: use instance.old_flavor for comparison

Since _resize_disk is executed on the finish_migration step
we must use instance.old_flavor.root_gb to see wether we are
actually increasing the size of the disk.

(cherry picked from commit c8f96d0)

fix pep8 warnings and remove some unused code

(cherry picked from commit ed2db87)

fixes for pr comments

(cherry picked from commit 7b525c3)

fail if a vif is not found on the vm while relocating

(cherry picked from commit 5472f5e)

create an extra step for attaching volumes after relocating

RESIZE_TOTAL_STEPS is now 7

(cherry picked from commit 53c241c)

add log for Relocated VM on finish_revert_migration

(cherry picked from commit 458376c)

fix tests for '[vmware] allow to relocate a VM to another cluster'

add assertion for update_cluster_placement call

(cherry picked from commit 81f1070)

[vmware] attach the block devices by the boot_index order

When attaching the volumes after a migration, we must guarantee
the order of the boot devices, otherwise the instance may not
boot due to boot disk not being attached first.
Therefore, we attach first the volumes that have specified
a boot_index by the order of that boot_index. The volumes with
default (-1) or without boot_index will be attached afterwards
as they came from the manager.

(cherry picked from commit adff90c)

[vmware] handle relocate failure and rollback the detached volumes (#143)

In the case when a relocate fails, we must ensure the detached volumes
aren't hanging, so we must make sure they get reattached after an
exception is thrown.

(cherry picked from commit 748d0d9)
(cherry picked from commit f4b0146)
Backport of fix that can be found here:
https://review.opendev.org/#/c/678958/

ATTENTION: There are more changes in that fix, that we didn't have to
backport for queens. Please check the other changes when re-applying to
newer versions of nova.

(cherry picked from commit 4790b8f)
(cherry picked from commit a3b78af)
This patch enables us to add more DRS groups to a VM than only the
affinity/anti-affinity groups. This is supposed to be extended in future
commits to return more than one group as configured on certain
conditions, e.g. big VMs.

(cherry picked from commit 6d4ae3b)
(cherry picked from commit 61a6562)
Some of our VMs have special spawning needs, which means they need to
spawn on their own hypervisor at first. This function helps us define
these VMs in a central place.

(cherry picked from commit e9829a7)
(cherry picked from commit 6f4ed6a)
fwiesel and others added 29 commits August 8, 2022 15:40
We changed the code to ignore the file-name,
as a vmotion will result in renaming of the files
breaking the heuristic to detect the root disk.
Instead we were taking the first disk,
when the uuid parameter was set.

The uuid parameter is not set when working with shadow-vms
and vms for image import. So, no special handling is
needed, we always want the first disk in those cases too
and so we can scrap the uuid argument.

Change-Id: Ib3088cfce4f7a0b24f05d45e7830b011c4a39f42
(cherry picked from commit bd7925e)
VMwareAPISession has been moved to its own module,
and this change should reflect that in the test case.

Change-Id: Ie0878986db41887f9f0de0bc820135d5284df403
(cherry picked from commit 9854168)
The vmwareapi driver uses Managed-Object references throughout
the code with the assumption that they are stable. It is however
a database id, which may change during the runtime of the compute node
If an instance is unregistered and re-registerd in the vcenter,
the moref will change.

By wrapping a moref in a proxy object, with an additional method
to resolve the openstack object to a moref, we can hide those changes
from a caller.

For that the initial search/resolution needs to wrap the resulting
moref in such a proxy.

Change-Id: I40568d365e98359dbe90663c400e87be024df2eb
(cherry picked from commit 89b5c6e)

Vmware: MoRef implementation with closure

This should ease the transition to stable mo-refs
One simply has to pass the search function as a closure
to he MoRef intstance, and the very same method will
be called when an exception is raised for the stored
reference.

Change-Id: I98b59603a8ef3b91114f378d82cd7418d26a1c52
(cherry picked from commit c854d41)

Vmware: Implement StableMoRefProy for VM references

By encapsulating all the parameters for searching for
the vm-ref again, we can move the retry logic to the
session object, where we can try to recover the vm-ref
should it result in a ManagedObjectNotFound exception

Change-Id: Id382cadd685a635cc7a4a83f69b58075521c8771
(cherry picked from commit bc23e94)

Vmwareapi: Move equality test to tests

The equality test is only used by the tests
so it is better implemented there.

Change-Id: I51ee54265c4cc2b4f40c0b83f785a49f8a8ebce4
(cherry picked from commit 84f3e06)

Vmwareapi: Stable Volume refs

The connection_info['data'] contains the managed-object
reference (moref) as well as the uuid of the volume.

Should the moref become invalid for some reason,
we can recover it by searching for the volume-uuid
as the `config.instanceUuid` attribute of the shadow-vm.

Change-Id: I0ae008fa15a7894e485370e7b585821eeb389a93
(cherry picked from commit a71ddf0)
The clone created in a snapshot would also contain
the nvp.vm-uuid field in the extra-config.
If we delete then the original vm, the fallback mechanism
of searching for the VM by extra-config would trigger,
and find the snapshot and delete that instead.

Change-Id: I6a66fa07dfe864ad4deedc1cafe537959cd969f4
(cherry picked from commit 90a9f4e)
Remove datastore_regex from VMWareLiveMigrateData
This was a leftover of some part of the development process and never
used. Thus, we remove it again.

Change-Id: I37ce67b4773375e31f18ac809a6029aa41702a3b
(cherry picked from commit 17928f7)

vmware: ds_util.get_datastore() supports hagroups

We're going to implement hagroups of datastores and for that we need to
be able to select a datastore from a specified hagroup. This is
currently planned via matching the name of the datastore against a
regex, that can extra the hagroup from the name.

This commits adds retrieving the hagroup and checking it against the
requested one to ds_util.get_datastore().

Change-Id: Ie3432a8e0b020ca9bf41abc098c0fac059af0df9
(cherry picked from commit f8e452a)

vmware: Add setting datastore_hagroup_regex

This setting will be used to enable distribution of ephemeral root disks
between hagroups of datastores. The hagroups are found by applying this
regex onto the found datastore names and should be named "a" or "b".

Change-Id: I45da5dd5c46a4ba64ea521a0e0975f133b5801f1
(cherry picked from commit c10d4e8)

vmware: Distribute VM root disks via hagroups

We want to distribute the ephemeral root-disk of VMs belonging to the
same server-group between groups of datastores (hagroups). This commit
adds the mentioned functionality for spawning new VMs, offline and
online migration.

Change-Id: I889514432f491bac7f7b6dccc4683f414baac167
(cherry picked from commit 6feb47d)

vmware: Add method to svMotion config/root-disk

For distributing ephemeral root disks of VMs belonging in the same
server-group between 2 hagroups, we need to be able to move the
disk/config of a VM to another ephemeral datastore.

This method will do an svMotion by specifying a datastore for all disks
and the VMs. The ephemeral disks - found by using the datastore_regex -
receive the target datastore while all other disks, which should be
volumes, receive their current datastore as target.

Change-Id: Iac9f2a2e35571bef3a58a22f6d96608f2b0bf343
(cherry picked from commit 01b9876)

vmware: Ignore bfv instances for hagroups

Boot-from-volume instances do not matter for our ephemeral-root-disk
anti-affinity as Cinder manages anti-affinity for volumes and
config-files going down with a datastore do not bring the instance down,
but only make it inaccessible / unmanagable. The swap file could become
a problem if it lives on the same datastore as the config-files, but
newer compute-nodes store the swap files on node-local NVMe swap
datastores in our environment, so we ignore this for now. We could solve
this by passing in a config option that determines whether we should
ignore bfv instances or not depending on if we detect node-local swap
datastores or not.

We move the generation of hagroup-relevant members of a server-group
into its own function.

Change-Id: Id7a7186909e236b7c81b4b8c8489e84f1067f2d4
(cherry picked from commit 2c7e2cc)

vmware: Add hagroup disk placement remediation

Every time a server-group is updated through the API, we call this
method to verify and remedy the disk-placement of VMs in the
server-group according to their hagroups.

Change-Id: I7ba6b14f5c969fb77dc5ce0fed63a6d9251f556e
(cherry picked from commit cc50e0d)

vmware: Validate hagroup disk placement in server-group sync-loop

This replaces adding an additional nanny to catch when Nova missed an
update to a server-group e.g. because of a restart.

Change-Id: I9aa516bfe6be127a011539d9d22a78d1f38aba13
(cherry picked from commit 09a32e2)

vmware: Use instance lock for ephemeral svMotion

When moving the ephemeral root-disk and the VM's config files, we take
the instance-lock to serialize changes onto the VM. This makes sure,
that we don't squeeze our task between other tasks in the vCenter, which
would make use read an inconsistent state of the VM.

Change-Id: I04fc39bd48896bfd8010f17baa934f6f828edcef
(cherry picked from commit 4f5eda3)

vmware: Place VMs to hagroups more randomly

The previous implementation of placing a VM onto an hagroup based on the
index it has in the server-group has a big disadvantage for the common
use-case of replacing instances during upgrades one by one. In doing so,
every VM added to the end would end up on the same hagroup.

To work against this, we put VMs onto hagroups randomly by taking their
UUIDs first character and use this modulo 2 as the deciding factor.
These UUIDs being already generated randomly, we don't need to hash them
or anything.

Change-Id: Ib0d9f24ae7d5e0d4e2dceeb77a1513a8657976d2
(cherry picked from commit 52b5d4b)

vmware: datastore_hagroup_regex ignores case
When finding hagroups in datastores with the regex from
datastore_hagroup_regex, we use re.I to ignore the case so that an error
made by an operator in naming the datastore does not break the feature.

Change-Id: I4de760d99513abc9977f698aaba85b6456709ca6
Prior to this, the driver was performing migration/resize in a
way that could lead a VM into an inconsistent state and was not
following the way nova does the allocations during a migration.
Nova expects the driver to do the following steps
* mirate_disk_and_power_off() - copies the disk to the dest compute
* finish_resize() - powers up the VM in the dest compute
This change removes the RelocateVM_Task and introduces a new
CloneVM_Task instead, in migrate_disk_and_power_off().

The CloneVM_Task also allows now cross-vcenter migrations.

Co-Authored-by: Marius Leustean <[email protected]>
Change-Id: I9d6f715faecc6782f93a3cd7f83f85f5ece02e60
(cherry picked from commit 95f9036)
If we attach a volume to a VM, we have to set the storage-profile.
Otherwise, the VM will not be compliant to the profile and - especially
on VMFS datastores - cannot be storage-vMotioned around if the
storage-profile includes storage-IO control. With setting the profile
for each disk-attachment, the VM also shows compliant to these profiles
in the UI.

Change-Id: Idad6293dc7dfdf46fed584b9c116c03f928d44fe
(cherry picked from commit dabcbca)
If a shadow-vm is missing, we raise an AttributeError,
which is not clearly identifying the reason of the failure.
We better re-raise the original ManagedObjectNotFound exception,
so it is more clearly identifiable

Change-Id: I954c57e97961833208743bc88e3ce75ad23cfe8c
(cherry picked from commit a5a9dd9)
If multiple nics are attached, they need different device-keys
otherwise the vmwareapi will reject the request

Change-Id: I0aa58ad11c499e9423c7ecc7998325b05dd9147e
(cherry picked from commit 8ba8b32)
When spawning a VM with more than 128 cores, we set numCoresPerSocket
and some flags e.g. vvtdEnabled. We missed to add the same flags when
resizing to a VM having more than 128 cores. This patch remedies that.

Change-Id: I381a413ecf80af14dd4bf1dfde2d070976b6477a
(cherry picked from commit cfd906b)
When simply cloning the original VM, the size might not
fit on the target hypervisor.
Resizing it to the target size might not fit on the
source hypervisor.
So we simply scale it to minimal size, as we are going
to reconfigure it to the proper size on the target
hypervisor anyway.

Change-Id: Ia05e5b3a5d6913bfcef01fa97465a1aaa69872d0
(cherry picked from commit 40d6589)

Vmware: Warn about failed drs override removal

An error needs manual intervention, and an exception debugging from
a developer. This, however, is a known behaviour which potentially
can lead to problems, hence a warning

Change-Id: I9479fb6405485e763a6344e7f44a60f75891adcb
(cherry picked from commit f88a96c)
When VMs with lots of CPUs are running for a longer period of time, a
task to reconfigure the VM might end up hanging in the vCenter.

According to VMware support, this problem happens if those VMs are
running for a longer period of time and with the large number of CPUs
have accumulated enough differences between those CPUs, that getting
them all into a state where a reconfigure can be executed takes more
time than the default 2s (iirc). The advanced setting to increase this
time is "migration.dataTimeout".

For simplicity reasons and because it shouldn't hurt (according to
VMware), we set it on all big VMs. That way, we do not have to figure
out if the VM consumes enough CPUs of the hypervisor to need this
setting.

Change-Id: Id8bda847c9e48997b385d9e1079ee9e99af9b8e8
(cherry picked from commit 2f7393c)
Until now, we only kept image-template-VMs that had tasks that showed
their ussage - but VMs cloned from another image-template-VM don't have
any tasks. Thus we immediately removed VMs we cloned to another BB. This
could even happen when the copying of the disk into the cache directory
was still in progress.

To counter this, we now take the "createDate" of the VM into accound and
only delete image-cache-VMs that were created more than
CONF.remove_unused_original_minimum_age_seconds ago. Additionally, we
take the same lock we also take in deploying image-cache-VMs and copying
their files. This should protect from deleting the VM while a
copy-operation is still in progress.

Deleting the VM while copying is still in progress does not stop the
copying. Therefore, this race-condition might be responsible for a lot
of orphan vmdks of image-cache-VMs on our ephemeral datastores.

Change-Id: Ic0a694a8c4df203c8c100abf5b8d2e9ee73866f7
(cherry picked from commit d8f3ddf)
This filter enable to select same host-aggregate/shard/VC for instance
resize because it could take more time to migrate the volumes over other
shards.

(cherry picked from commit f648b9b)
Resizing to a different flavour may imply also
a different hw-version, so we need to set it
otherwise it will stay on the previous one,
which may be incompatible with the desired configuration

Only upgrade is possible though.

Change-Id: I7976a377c3e8944483a10fdada391e8c51640e30
(cherry picked from commit 28fb1a4)

Vmware: Only change hw_version by flavor
Be more strict in the upgrade policy,
and only upgrade on resize, if the flavor demands it.
Not if the default has changed

Change-Id: I25a6eb352316f986b179204199b098a418991860
When switching to filtering the AZ via placement, we need the bigvm
resource provider to be in the AZ aggregate in addition to being in the
aggregate of the host's resource provider. Therefore, we find the host
aggregate by seeing which aggregate is also a hypervisor uuid.

Change-Id: I250f203b3bb24e084ec1b499a923f7f66e638102
(cherry picked from commit 29ce312)

bigvm: Do not remove parent provider's previous aggregates

When we filter AZs in placement, we don't want the aggregates our
resource providers removed by nova-bigvm, as they represent the AZ.
Therefore, we query the aggregates of the "parent" provider and make
sure to include these aggregates, if we have to set the resource
provider's UUID as an aggregate, too.

Change-Id: If3986df022273f20e109816f2752ce0254db4f10
(cherry picked from commit 2e98cd4)

bigvm: Ignore deleted ComputeNode instances

Querying via ComputeNodeList also returns deleted ComputeNode instances.
Therefore, we might create bigvm-deployment resource providers for a
deleted instance instead of the right instance and thus for a wrong
resource provider. With ignoring deleted ComputeNode instances, this
should not happen anymore.

Change-Id: I5a4c6c5a1894d1f6f5cff6e3475670c27bb97f28
(cherry picked from commit f7f5f0c)
There can be Ironic hosts, that only have nodes assigned when those
nodes are getting repairs or getting build up. Those Ironic hosts would
come up empty when searching for ComputeNodes in the sync_aggregates
command and would be reported as a problem, which makes the command
fail with exit-code 5. Since it's no problem if an Ironic Host doesn't
have a ComputeNode, because each node is its own resource provider in
placement anyways, we ignore Ironic hosts without nodes now in the
error-reporting.

Change-Id: I163f3e46f2e375531b870a363b84bba67816954d
(cherry picked from commit 67779eb)
The DRS rules can be read from the "rule" attribute, not from the
"rules" attribute. We found this, because Nova wasn't deleting
DRS rules for no-longer-existing server-groups.

Change-Id: I86f7ca85d9b0edc1406a54a6f392bfff8f0af00d
(cherry picked from commit 562b084)
When syncing a server-group's DRS rule(s) we now also enable a found
rule in case it is disabled. We don't know how this happens, but
sometimes rules get disabled and we need them to be enabled to guarantee
the customer the appropriate (anti-)affinity settings.

Change-Id: Ibc8eb6800640855513716412266fcbb9fbc4db42
(cherry picked from commit d712c23)
When we don't find any datastores for whatever reason, we don't have the
"dc_info" variable set and thus cannot call
self._age_cached_image_templates() with it as it results in an
UnboundLocal exception.

Change-Id: I2dca6d2d6ab7ca5cbc4ef7d2c316faaf6edfee7d
(cherry picked from commit d2cf44f)
The properties may not be set, if the host is disconnected.

Change-Id: I1c53477e891b5b95859ca267fcad8cd1bff260ef
(cherry picked from commit 0cb8b61)
Most code related to vms is in vmops, not in the driver
So we move this code there too

Change-Id: I1b801c8f12b377dd74a31ef646216c564631fe7f
(cherry picked from commit ade6f4c)
This requires a change to oslo.vmware to accept a string
instead of only a cookiejar.

Depends-On: Ia9f16758c388afe0fe05034162f516844ebc6b2b
Change-Id: I34a0c275ed48489954e50eb15f8ea11c4f6b1aa6
(cherry picked from commit 726d7a2)
While we cannot live-migrate CD-Roms directly between vcenters,
we can copy the data and detach/reattach the device manually.

Change-Id: I88b4903f745e1bcfe957ddc07c6e9c040820ed6b
(cherry picked from commit 14f9a5f)
Since the mission is to delete the attachment, Cinder returning a 404 on
attachment deletion call can be ignored. We've seen this happening where
Cinder took some time to delete the attachment so Nova retried as it got
a 500 back. On this retry, Nova got a 404 and left the BDM entry as
leftover while aborting the deletion, that already happened by driver.

Change-Id: I15dd7b59a2b3c528ecad3b337b92885b4d7bd68f
(cherry picked from commit 82992a5)
Apparently, the volume-id is not consistently
stored as volume_id in connection_info.
Use the block_device.get_volume_id function to handle
the fallback.

Change-Id: If5a8527578db8e4690595524e0785ee8b4de1d79
(cherry picked from commit 607fd0d)
Since we don't explicitly set a disk as boot disk and instead rely on
the order the disks have on the VirtualMachine, we need to make sure we
attach the root disk first.

Change-Id: I3ae6b5f053a3b171ed0a80215fc4204a2bf32481
(cherry picked from commit 7e6dc54)
We've recently changed that not all large VMs need DRS disabled - only
the ones over 512 GiB memory. But we still need memory reservations for
VMs of 230 GiB - 512, which was previously handled by them being large
VMs. While we could do this via flavor, we failed to do so.
Additionally, this would limit the amount of large VMs we can spawn on a
cluster.

To keep the same behavior we previously had for large VMs, we now split
memory reservations from big/large VM detection with the following
result:
1) a big VM will get DRS disabled - big VMs are VMs bigger than 1024 GiB
2) a large VM will get DRS disabled - large VMs are VMs bigger than 512
GiB
3) all VMs defining CUSTOM_MEMORY_RESERVABLE_MB resources in their
flavor get that amount of memory reserved
4) all VMs above full_reservation_memory_mb config setting get all their
memory reserved

Therefore, is_big_vm() and is_large_vm() now only handle DRS settings
and special spawning behavior.

Side effect is, that nova-bigvm or rather the special spawning code now
doesn't consider 230 GiB - 512 GiB VMs as non-movable anymore and thus
finds more free hosts.

Change-Id: I2088afecf367efc380f9a0a88e5d18251a19e3a5
(cherry picked from commit dca6fe6)
@joker-at-work joker-at-work merged commit 6574a3b into stable/xena-m3 Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.