Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ironic rebalance (victoria) #6

Open
wants to merge 63 commits into
base: stackhpc/victora
Choose a base branch
from

Conversation

jovial
Copy link

@jovial jovial commented May 14, 2021

Porting patches from https://github.com/stackhpc/nova/tree/jg-ironic-rebalance-ussuri to victoria. I cherry-picked these fresh from gerrit.

The only conflict was in manager.py in 054b08e

sbauza and others added 6 commits October 14, 2020 21:00
Some GPU drivers like i915 don't provide a name attribute for mdev types.
As we don't use this attribute yet, let's just make sure we support the fact
it's optional.

Change-Id: Ia745ed7095c74e2bfba38379e623a3f81e7799eb
Closes-Bug: #1896741
(cherry picked from commit 416cd1a)
Bug 1853009 describes a race condition involving multiple nova-compute
services with ironic. As the compute services start up, the hash ring
rebalances, and the compute services have an inconsistent view of which
is responsible for a compute node.

The sequence of actions here is adapted from a real world log [1], where
multiple nova-compute services were started simultaneously. In some
cases mocks are used to simulate race conditions.

There are three main issues with the behaviour:

* host2 deletes the orphan node compute node after host1 has taken
  ownership of it.

* host1 assumes that another compute service will not delete its nodes.
  Once a node is in rt.compute_nodes, it is not removed again unless the
  node is orphaned. This prevents host1 from recreating the compute
  node.

* host1 assumes that another compute service will not delete its
  resource providers. Once an RP is in the provider tree, it is not
  removed.

This functional test documents the current behaviour, with the idea that
it can be updated as this behaviour is fixed.

[1] http://paste.openstack.org/show/786272/

Co-Authored-By: Matt Riedemann <[email protected]>

Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
Related-Bug: #1853009
Related-Bug: #1853159
(cherry picked from commit 4a894fefc4f99db81d967cae0c23454c5a62b8ba)
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The issue being addressed here is that if a compute node is deleted by a host
which thinks it is an orphan, then the compute host that actually owns the node
might not recreate it if the node is already in its resource tracker cache.

This change fixes the issue by clearing nodes from the resource tracker cache
for which a compute node entry does not exist. Then, when the available
resource for the node is updated, the compute node object is not found in the
cache and gets recreated.

Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
Partial-Bug: #1853009
(cherry picked from commit 8f5a078dd7bbe5b6b38cf8e04d916281dc418409)
There is a race condition in nova-compute with the ironic virt driver
as nodes get rebalanced. It can lead to compute nodes being removed in
the DB and not repopulated. Ultimately this prevents these nodes from
being scheduled to.

The issue being addressed here is that if a compute node is deleted by a
host which thinks it is an orphan, then the resource provider for that
node might also be deleted. The compute host that owns the node might
not recreate the resource provider if it exists in the provider tree
cache.

This change fixes the issue by clearing resource providers from the
provider tree cache for which a compute node entry does not exist. Then,
when the available resource for the node is updated, the resource
providers are not found in the cache and get recreated in placement.

Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
Related-Bug: #1853009
Related-Bug: #1841481
(cherry picked from commit 7dcc6bfa63c1ab06d86b07bcdd05838a8ad35dec)
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The main race condition involved is in update_available_resources in
the compute manager. When the list of compute nodes is queried, there is
a compute node belonging to the host that it does not expect to be
managing, i.e. it is an orphan. Between that time and deleting the
orphan, the real owner of the compute node takes ownership of it ( in
the resource tracker). However, the node is still deleted as the first
host is unaware of the ownership change.

This change prevents this from occurring by filtering on the host when
deleting a compute node. If another compute host has taken ownership of
a node, it will have updated the host field and this will prevent
deletion from occurring. The first host sees this has happened via the
ComputeHostNotFound exception, and avoids deleting its resource
provider.

Closes-Bug: #1853009
Related-Bug: #1841481
Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
(cherry picked from commit 65a6756fa17ebfd644c76953c943ba8a4ad3787d)
In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
restored, to ensure we can reuse ironic node UUIDs as compute node
UUIDs. While this seems to largely work, it results in some nasty errors
being generated [3]:

    InvalidRequestError This session is in 'inactive' state, due to the
    SQL transaction being rolled back; no further SQL can be emitted
    within this transaction.

This happens because compute_node_create is decorated with
pick_context_manager_writer, which begins a transaction. While
_compute_node_get_and_update_deleted claims that calling a second
pick_context_manager_writer decorated function will begin a new
subtransaction, this does not appear to be the case.

This change removes pick_context_manager_writer from the
compute_node_create function, and adds a new _compute_node_create
function which ensures the transaction is finished if
_compute_node_get_and_update_deleted is called.

The new unit test added here fails without this change.

This change marks the removal of the final FIXME from the functional
test added in [4].

[1] https://bugs.launchpad.net/nova/+bug/1839560
[2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
[3] http://paste.openstack.org/show/786350/
[4] https://review.opendev.org/#/c/695012/

Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
Closes-Bug: #1853159
Related-Bug: #1853009
(cherry picked from commit 54038b7f914d624a6684b5c0f168bdf84872a60c)
@jovial jovial requested review from JohnGarbutt and markgoddard May 14, 2021 17:02
markgoddard
markgoddard previously approved these changes May 17, 2021
Copy link

@markgoddard markgoddard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared with upstream patches, looks reasonable.

Balazs Gibizer and others added 21 commits May 17, 2021 16:35
The nova-compute fails to start if the hypervisor has PCI addresses
32bit domain.

Change-Id: I48dcb7faa17fe9f8346445a1746cff5845baf358
Related-Bug: #1897528
(cherry picked from commit 976ac72)
Nova and QEMU[1] supports PCI devices with a PCI address that has 16 bit
domain. However there are hypervisors that reports PCI addresses with
32 bit domain. While today we cannot assign these to guests this should
not prevent the nova-compute service to start.

This patch changes the PCI manager to ignore such PCI devices.

Please note that this patch does not change fact that nova does not
allow specifying PCI addresses with 32bit domain in the
[pci]/passthrough_whitelist configuration. Such configuration is still
rejected at nova-compute service startup.

Closes-Bug: #1897528

[1] https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169da97c3f2bad5/hw/core/qdev-properties.c#L993

Change-Id: I59a0746b864610b6a314078cf5661d3d2b84b1d4
(cherry picked from commit 8c9d6fc)
Our console proxies (novnc, serial, spice) run in a websockify server
whose request handler inherits from the python standard
SimpleHTTPRequestHandler. There is a known issue [1] in the
SimpleHTTPRequestHandler which allows open redirects by way of URLs
in the following format:

  http://vncproxy.my.domain.com//example.com/%2F..

which if visited, will redirect a user to example.com.

We can intercept a request and reject requests that pass a redirection
URL beginning with "//" by implementing the
SimpleHTTPRequestHandler.send_head() method containing the
vulnerability to reject such requests with a 400 Bad Request.

This code is copied from a patch suggested in one of the issue comments
[2].

Closes-Bug: #1927677

[1] https://bugs.python.org/issue32084
[2] https://bugs.python.org/issue32084#msg306545

Conflicts:
    nova/console/websocketproxy.py
    nova/tests/unit/console/test_websocketproxy.py

NOTE(melwitt): The conflicts are because the following changes are not
in Victoria:

  Ib2c406327fef2fb4868d8050fc476a7d17706e23 (Remove six.moves)
  I58b0382c86d4ef798572edb63d311e0e3e6937bb (Refactor and rename
    test_tcp_rst_no_compute_rpcapi)

Change-Id: Ie36401c782f023d1d5f2623732619105dc2cfa24
(cherry picked from commit 781612b)
(cherry picked from commit 4709256)
Awhile back, change I25baf5edd25d9e551686b7ed317a63fd778be533 moved
rbd_utils out from the libvirt driver and into a central location under
nova/storage. This move missed adding a __init__.py file under
nova/tests/unit/storage, so unit test discovery wasn't picking up the
rbd_utils tests and couldn't run them.

This adds a __init__.py file under nova/tests/unit/storage to get the
tests running again.

This also fixes a small bug introduced by change
I3032bbe6bd2d6acc9ba0f0cac4d00ed4b4464ceb in RbdTestCase.setUp() that
passed nonexistent self.images_rbd_pool to self.flags. It should be
self.rbd_pool.

Closes-Bug: #1928007

Change-Id: Ic03a5336abdced883f62f395690c0feac12075c8
(cherry picked from commit 8b647f1)
(cherry picked from commit 8f018d7)
If2608406776e0d5a06b726e65b55881e70562d18 dropped the single node
grenade job from the integrated-gate-compute template as it duplicates
the existing grenade-multinode job. However it doesn't remove the
remianing single node grenade job still present in the Nova project.

This change replaces the dsvm based nova-grenade-multinode job with the
zuulv3 native grenade-multinode based job.

Various legacy playbooks and hook scripts are also removed as they are
no longer used. Note that this does result in a loss of coverage for
ceph that should be replaced as soon as a zuulv3 native ceph based
multinode job is available.

Change-Id: I02b2b851a74f24816d2f782a66d94de81ee527b0
(cherry picked from commit 91e53e4)
(cherry picked from commit c45bedd)
Change Ifb3afb13aff7e103c2e80ade817d0e63b624604a added a nova side
config option for specifying neutron client retries that maps to the
ksa connect_retries config option to provide parity with the cinder and
glance clients that have nova side config options.

That change missed passing CONF.neutron.http_retries to the manual
client used for calling the port binding API. This sets the
connect_retries attribute on the manual ksa client so http_retries
will be honored.

Closes-Bug: #1929886
Related-Bug: #1866937

Change-Id: I8296e4be9f0706fab043451b856efadbb7bd41f6
(cherry picked from commit 56eb253)
(cherry picked from commit 46aa3f4)
NOTE(melwitt): This is a combination of two changes to avoid
intermittent test failure that was introduced by the original bug fix,
and was fixed by change I2bd360dcc6501feea7baf02d4510b282205fc061.

We have discovered that if an exception is raised at any point during
the running of the init_application WSGI script in an apache/mod_wsgi
Daemon Mode environment, it will prompt apache/mod_wsgi to re-run the
script without starting a fresh python process. Because we initialize
global data structures during app init, subsequent runs of the script
blow up as some global data do *not* support re-initialization. It is
anyway not safe to assume that init of global data is safe to run
multiple times.

This mod_wsgi behavior appears to be a special situation that does not
behave the same as a normal reload in Daemon Mode as the script file is
being reloaded upon failure instead of the daemon process being
shutdown and restarted as described in the documentation [1].

In order to handle this situation, we can move the initialization of
global data structures to a helper method that is decorated to run only
once per python interpreter instance. This way, we will not attempt to
re-initialize global data that are not safe to init more than once.

Co-Authored-By: Michele Baldessari <[email protected]>
Co-Authored-By: melanie witt <[email protected]>

Conflicts:
    nova/api/openstack/wsgi_app.py

NOTE(melwitt): The conflict is because change
If4783adda92da33d512d7c2834f0bb2e2a9b9654 (Support sys.argv in wsgi
app) is not in Victoria.

Closes-Bug: #1882094

[1] https://modwsgi.readthedocs.io/en/develop/user-guides/reloading-source-code.html#reloading-in-daemon-mode

Reset global wsgi app state in unit test

Since I2bd360dcc6501feea7baf02d4510b282205fc061 there is a global state
set during the wsgi_app init making our unit test cases
non-deterministic based on the order of them. This patch makes sure
that the global state is reset for each test case.

Closes-Bug: #1921098
(cherry picked from commit bc2c19b)

Change-Id: I2bd360dcc6501feea7baf02d4510b282205fc061
(cherry picked from commit 7c9edc0)
Error-out the migrations (cold and live) whenever the
anti-affinity policy is violated. This addresses
violations when multiple concurrent migrations are
requested.

Added detection on:
- prep_resize
- check_can_live_migration_destination
- pre_live_migration

The improved method of detection now locks based on group_id
and considers other migrations in-progress as well.

Closes-bug: #1821755
Change-Id: I32e6214568bb57f7613ddeba2c2c46da0320fabc
(cherry picked from commit 33c8af1)
(cherry picked from commit 8b62a4e)
By default, we don't currently allow the Nova microversion
header for CORS requests.  It should be something that is
included out of the box because it's part of the core API.

Deployers can workaround this by overriding allow_headers,
but we should provide a better experience out of the box.

Closes-Bug: #1931908
Change-Id: Idf4650f36952331f02d7512580c21451f3ee3b63
(cherry picked from commit b02a95a)
This currently runs in the 'check' pipeline, as part of the pep8 job,
which causes otherwise perfectly valid backports to report as failing
CI. There's no reason a stable core shouldn't be encouraged to review
these patches: we simply want to prevent them *merging* before their
parent(s). Resolve this conflict by moving the check to separate voting
job in the 'gate' pipeline as well as a non-voting job in the 'check'
pipeline to catch more obvious issues.

NOTE(lyarwood): Conflict as I672904e9bfb45a66a82331063c7d49c4bc0439df
isn't present on stable/victoria.

Conflicts:
  .zuul.yaml

Change-Id: Id3e4452883f6a3cf44ff58b39ded82e882e28c23
Signed-off-by: Stephen Finucane <[email protected]>
(cherry picked from commit 98b01c9)
(cherry picked from commit fef0305)
During the VM booting process Nova asks Neutron for the security groups
of the project. If there are no any fields specified, Neutron will
prepare list of security groups with all fields, including rules.
In case if project got many SGs, it may take long time as rules needs to
be loaded separately for each SG on Neutron's side.

During booting of the VM, Nova really needs only "id" and "name" of the
security groups so this patch limits request to only those 2 fields.

This lazy loading of the SG rules was introduced in Neutron in [1] and
[2].

[1] https://review.opendev.org/#/c/630401/
[2] https://review.opendev.org/#/c/637407/

Related-Bug: #1865223
Change-Id: I15c3119857970c38541f4de270bd561166334548
(cherry picked from commit 388498a)
(cherry picked from commit 4f49545)
If the ceph df command fails in the get_pool_info
method of RBD utils the actual command executed
if seen by the users in the fault error message.

This hides the command behind a StorageError
exception and logs the exception instead of leaking
it to the users.

Change-Id: I6e3a73f2e04d1a7636daf96d5af73c9cf2fbe220
Closes-Bug: 1926978
(cherry picked from commit 86af7fe)
(cherry picked from commit 5ede75c)
Consider the following situation:

- Using the Ironic virt driver
- Replacing (so removing and re-adding) all baremetal nodes
  associated with a single nova-compute service

The update resources periodic will have destroyed the compute node
records because they're no longer being reported by the virt driver.
If we then attempt to manually delete the compute service record, the
datbase layer will raise an exception, as there are no longer any
compute node records for the host. This exception gets bubbled up as
an error 500 in the API. This patch adds a unit test to demonstrate
this.

Related bug: 1860312
Change-Id: I03eec634b25582ec9643cacf3e5868c101176983
(cherry picked from commit 32257a2)
(cherry picked from commit e6cd23c)
Consider the following situation:

- Using the Ironic virt driver
- Replacing (so removing and re-adding) all baremetal nodes
  associated with a single nova-compute service

The update resources periodic will have destroyed the compute node
records because they're no longer being reported by the virt driver.
If we then attempt to manually delete the compute service record, the
datbase layer will raise an exception, as there are no longer any
compute node records for the host. Previously, this exception would
get bubbled up as an error 500 in the API. This patch catches it and
allows service deletion to complete succefully.

Closes bug: 1860312
Change-Id: I2f9ad3df25306e070c8c3538bfed1212d6d8682f
(cherry picked from commit 880611d)
(cherry picked from commit df5158b)
Zuul and others added 28 commits September 3, 2021 19:36
Ie36401c782f023d1d5f2623732619105dc2cfa24 was intended
to address OSSA-2021-002 (CVE-2021-3654) however after its
release it was discovered that the fix only worked
for urls with 2 leading slashes or more then 4.

This change adresses the missing edgecase for 3 leading slashes
and also maintian support for rejecting 2+.

Conflicts:
  nova/tests/unit/console/test_websocketproxy.py

NOTE: conflict is due to I58b0382c86d4ef798572edb63d311e0e3e6937bb
is missing in Victoria and Ie36401c782f023d1d5f2623732619105dc2cfa24
backport contained conflicts and methods order was swapped.

Change-Id: I95f68be76330ff09e5eabb5ef8dd9a18f5547866
co-authored-by: Matteo Pozza
Closes-Bug: #1927677
(cherry picked from commit 6fbd0b7)
(cherry picked from commit 47dad48)
Add functional tests to reproduce the race between resize_instance()
and update_available_resources().

Related-Bug: #1944759

Change-Id: Icb7e3379248fe00f9a94f9860181b5de44902379
(cherry picked from commit 3e4e448)
(cherry picked from commit e6c6880)
(cherry picked from commit 140ae45)
During resize, on the source host, in resize_instance(), the instance.host
and .node is updated to point to the destination host. This indicates to
the source host's resource tracker that the allocation of this instance
does not need to be tracked as an instance but as an outbound migration
instead. However for the source host's resource tracker to do that it,
needs to use the instance.old_flavor. Unfortunately the
instance.old_flavor is only set during finish_resize() on the
destination host. (resize_instance cast to the finish_resize). So it is
possible that a running resize_instance() set the instance.host to point
to the destination and then before the finish_resize could set the
old_flavor an update_available_resources periodic runs on the source
host. This causes that the allocation of this instance is not tracked as
an instance as the instance.host point to the destination but it is not
tracked as a migration either as the instance.old_flavor is not yet set.
So the allocation on the source host is simply dropped by the periodic
job.

When such migration is confirmed the confirm_resize() tries to drop
the same resource allocation again but fails as the pinned CPUs of the
instance already freed.

When such migration is reverted instead, then revert succeeds but the
source host resource allocation will not contain the resource allocation
of the instance until the next update_available_resources periodic runs
and corrects it.

This does not affect resources tracked exclusively in placement (e.g.
VCPU, MEMORY_MB, DISK_GB) but it does affect NUMA related resource that
are still tracked in the resource tracker (e.g. huge pages, pinned
CPUs).

This patch moves the instance.old_flavor setting to the source node to
the same transaction that sets the instance.host to point to the
destination host. Hence solving the race condition.

Change-Id: Ic0d6c59147abe5e094e8f13e0d3523b178daeba9
Closes-Bug: #1944759
(cherry picked from commit b841e55)
(cherry picked from commit d4edcd6)
(cherry picked from commit c8b04d1)
The libvirt driver power on and hard reboot destroys the domain first
and unplugs the vifs then recreate the domain and replug the vifs.
However nova does not wait for the network-vif-plugged event before
unpause the domain. This can cause that the domain starts running and
requesting IP via DHCP before the networking backend finished plugging
the vifs.

So this patch adds a workaround config option to nova to wait for
network-vif-plugged events during hard reboot the same way as nova waits
for this event during new instance spawn.

This logic cannot be enabled unconditionally as not all neutron
networking backend sending plug time events to wait for. Also the logic
needs to be vnic_type dependent as ml2/ovs and the in tree sriov backend
often deployed together on the same compute. While ml2/ovs sends plug
time event the sriov backend does not send it reliably. So the
configuration is not just a boolean flag but a list of vnic_types
instead. This way the waiting for the plug time event for a vif that is
handled by ml2/ovs is possible while the instance has other vifs handled
by the sriov backend where no event can be expected.

Conflicts:
      nova/virt/libvirt/driver.py both
      I73305e82da5d8da548961b801a8e75fb0e8c4cf1 and
      I0b93bdc12cdce591c7e642ab8830e92445467b9a are not in
      stable/victoria

The stable/victoria specific changes:

* The list of accepted vnic_type-s are adapted to what is supported by
  neutron on this release. So vdpa, accelerator-direct, and
  accelerator-direct-physical are removed as they are only added in
  stable/wallaby

Change-Id: Ie904d1513b5cf76d6d5f6877545e8eb378dd5499
Closes-Bug: #1946729
(cherry picked from commit 68c970e)
(cherry picked from commit 0c41bfb)
(cherry picked from commit 89c4ff5)
Currently neutron can report ports to have MAC addresses
in upper case when they're created like that. In the meanwhile
libvirt configuration file always stores MAC in lower case
which leads to KeyError while trying to retrieve migrate_vif.

Conflicts:
  nova/tests/unit/virt/libvirt/test_migration.py

Note: conflict is caused by not having six.text_type removal patch
I779bd1446dc1f070fa5100ccccda7881fa508d79 in stable/victoria.
Original assertion method was preserved to resolve this conflict.

Closes-Bug: #1945646
Change-Id: Ie3129ee395427337e9abcef2f938012608f643e1
(cherry picked from commit 6a15169)
(cherry picked from commit 63a6388)
(cherry picked from commit 6c3d5de)
This patch adds a functional test that reproduces a race between
incoming migration and the update_available_resource periodic

Fixes:
    - Added more memory to mock 'host_info', since the default would not
      fit the instance. Default was changed in later releases

Change-Id: I4be429c56aaa15ee12f448978c38214e741eae63
Related-Bug: #1953359
(cherry picked from commit c59224d)
(cherry picked from commit f0a6d94)
(cherry picked from commit d8859e4)
This patch extends the original reproduction
I4be429c56aaa15ee12f448978c38214e741eae63 to cover
bug 1952915 as well as they have a common root cause.

Change-Id: I57982131768d87e067d1413012b96f1baa68052b
Related-Bug: #1953359
Related-Bug: #1952915
(cherry picked from commit 9f296d7)
(cherry picked from commit 0411962)
(cherry picked from commit 94f17be)
There is a race condition between an incoming resize and an
update_available_resource periodic in the resource tracker. The race
window starts when the resize_instance RPC finishes  and ends when the
finish_resize compute RPC finally applies the migration context on the
instance.

In the race window, if the update_available_resource periodic is run on
the destination node, then it will see the instance as being tracked on
this host as the instance.node is already pointing to the dest. But the
instance.numa_topology still points to the source host topology as the
migration context is not applied yet. This leads to CPU pinning error if
the source topology does not fit to the dest topology. Also it stops the
periodic task and leaves the tracker in an inconsistent state. The
inconsistent state only cleanup up after the periodic is run outside of
the race window.

This patch applies the migration context temporarily to the specific
instances during the periodic to keep resource accounting correct.

Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a
Closes-Bug: #1953359
Closes-Bug: #1952915
(cherry picked from commit 32c1044)
(cherry picked from commit 1235dc3)
(cherry picked from commit 5f2f283)
During the PTG the TC discussed the topic and decided to drop the job
completely. Since the latest job configuration broke all stable gate
for nova (older than yoga) this is needed there to unblock our gates.
For dropping the job on master let's wait to the resolution as the
gate is not broken there, hence the patch is stable-only.

Conflicts:
  .zuul.yaml
  lower-constraints.txt

NOTE(elod.illes): conflict is due to branch specific settings (job
template names, lower constraints changes).

Change-Id: I514f6b337ffefef90a0ce9ab0b4afd083caa277e
(cherry picked from commit 15b7271)
(cherry picked from commit ba3c5b8)
(cherry picked from commit 327693a)
When tox 'docs' target is called, first it installs the dependencies
(listed in 'deps') in 'installdeps' phase, then it installs nova (with
its requirements) in 'develop-inst' phase. In the latter case 'deps' is
not used so that the constraints defined in 'deps' are not used.
This could lead to failures on stable branches when new packages are
released that break the build. To avoid this, the simplest solution is
to pre-install requirements, i.e. add requirements.txt to 'docs' tox
target.

Conflicts:
  tox.ini

NOTE(elod.illes): conflict is due to branch specific upper constraints
file link.

Change-Id: I4471d4488d336d5af0c23028724c4ce79d6a2031
(cherry picked from commit 494e8d7)
(cherry picked from commit 1ac0d69)
(cherry picked from commit 64cc084)
(cherry picked from commit f66a570)
We have placement-nova-tox-functional-py38 job defined and run
on placement gate[1] to run the nova functional test excluding
api and notification _sample_tests, and db-related tests but that
job skip those tests via tox_extra_args which is not right way
to do as we currently facing error when tox_extra_args is included
in tox siblings task
- https://opendev.org/zuul/zuul-jobs/commit/c02c28a982da8d5a9e7b4ca38d30967f6cd1531d

- https://zuul.openstack.org/build/a8c186b2c7124856ae32477f10e2b9a4

Let's define a new tox env which can exclude the required test
in stestr command itself.

Conflicts:
    tox.ini

NOTE(melwitt): The conflict is because change
I672904e9bfb45a66a82331063c7d49c4bc0439df (Add functional-py39 testing)
is not in Victoria.

[1] https://opendev.org/openstack/placement/src/commit/bd5b19c00e1ab293fc157f4699bc4f4719731c25/.zuul.yaml#L83

Change-Id: I20d6339a5203aed058f432f68e2ec1af57030401
(cherry picked from commit 7b063e4)
(cherry picked from commit 64f5c1c)
(cherry picked from commit baf0d93)
This patch is based upon a downstream patch which came up in discussion
amongst the ironic community when some operators began discussing a case
where resource providers had disappeared from a running deployment with
several thousand baremetal nodes.

Discussion amongst operators and developers ensued and we were able
to determine that this was still an issue in the current upstream code
and that time difference between collecting data and then reconciling
the records was a source of the issue. Per Arun, they have been running
this change downstream and had not seen any reoccurances of the issue
since the patch was applied.

This patch was originally authored by Arun S A G, and below is his
original commit mesage.

An instance could be launched and scheduled to a compute node between
get_uuids_by_host() call and _get_node_list() call. If that happens
the ironic node.instance_uuid may not be None but the instance_uuid
will be missing from the instance list returned by get_uuids_by_host()
method. This is possible because _get_node_list() takes several minutes to return
in large baremetal clusters and a lot can happen in that time.

This causes the compute node to be orphaned and associated resource
provider to be deleted from placement. Once the resource provider is
deleted it is never created again until the service restarts. Since
resource provider is deleted subsequent boots/rebuilds to the same
host will fail.

This behaviour is visibile in VMbooter nodes because it constantly
launches and deletes instances there by increasing the likelihood
of this race condition happening in large ironic clusters.

To reduce the chance of this race condition we call _get_node_list()
first followed by get_uuids_by_host() method.

Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
Co-Authored-By: Arun S A G <[email protected]>
Related-Bug: #1841481
(cherry picked from commit f84d591)
(cherry picked from commit 0c36bd2)
When the nova-compute service starts, by default it attempts to
startup instance configuration states for aspects such as networking.
This is fine in most cases, and makes a lot of sense if the
nova-compute service is just managing virtual machines on a hypervisor.

This is done, one instance at a time.

However, when the compute driver is ironic, the networking is managed
as part of the physical machine lifecycle potentially all the way into
committed switch configurations. As such, there is no need to attempt
to call ``plug_vifs`` on every single instance managed by the
nova-compute process which is backed by Ironic.

Additionally, using ironic tends to manage far more physical machines
per nova-compute service instance then when when operating co-installed
with a hypervisor. Often this means a cluster of a thousand machines,
with three controllers, will see thousands of un-needed API calls upon
service start, which elongates the entire process and negatively
impacts operations.

In essence, nova.virt.ironic's plug_vifs call now does nothing,
and merely issues a debug LOG entry when called.

Closes-Bug: #1777608
Change-Id: Iba87cef50238c5b02ab313f2311b826081d5b4ab
(cherry picked from commit 7f81cf2)
(cherry picked from commit eb6d70f)
(cherry picked from commit f210115)
This change add a repoducer test for evacuating
a vm in the powering-off state

While backporting to stable/victoria
Fixed conflict in functional.integrated_helpers
1 - Added placeholder NOT_SPECIFIED as object
	Its a default parameter value for _evacuate_server
2 - Updated _evacuate_server defition
	to allow function to wait, until expected
	server state reached
3 - Added _start_server and _stop_server
4 - Updated _evacuate_server arguments for test_evacuate
as per updated _evacuate_server signature

Related-Bug: #1978983
Change-Id: I5540df6c7497956219c06cff6f15b51c2c8bc299
(cherry picked from commit 5904c7f)
(cherry picked from commit 6bd0bf0)
(cherry picked from commit 1e0af92)
(cherry picked from commit b57b0ee)
ignore instance task state and continue with vm evacutaion.

Closes-Bug: #1978983
Change-Id: I5540df6c7497956219c06cff6f15b51c2c8bc29d
(cherry picked from commit db919aa)
(cherry picked from commit 6d61fcc)
(cherry picked from commit 8e9aa71)
(cherry picked from commit 0b8124b)
Nova's use of libvirt's compareCPU() API served its purpose
over the years, but its design limitations break live migration in
subtle ways.  For example, the compareCPU() API compares against the
host physical CPUID.  Some of the features from this CPUID aren not
exposed by KVM, and then there are some features that KVM emulates that
are not in the host CPUID.  The latter can cause bogus live migration
failures.

With QEMU >=2.9 and libvirt >= 4.4.0, libvirt will do the right thing in
terms of CPU compatibility checks on the destination host during live
migration.  Nova satisfies these minimum version requirements by a good
margin.  So, provide a workaround to skip the CPU comparison check on
the destination host before migrating a guest, and let libvirt handle it
correctly.  This workaround will be removed once Nova replaces the older
libvirt APIs with their newer and improved counterparts[1][2].

                - - -

Note that Nova's libvirt driver calls compareCPU() in another method,
_check_cpu_compatibility(); I did not remove its usage yet.  As it needs
more careful combing of the code, and then:

  - where possible, remove the usage of compareCPU() altogether, and
    rely on libvirt doing the right thing under the hood; or

  - where Nova _must_ do the CPU comparison checks, switch to the better
    libvirt CPU APIs -- baselineHypervisorCPU() and
    compareHypervisorCPU() -- that are described here[1].  This is work
    in progress[2].

[1] https://opendev.org/openstack/nova-specs/commit/70811da221035044e27
[2] https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration

Change-Id: I444991584118a969e9ea04d352821b07ec0ba88d
Closes-Bug: #1913716
Signed-off-by: Kashyap Chamarthy <[email protected]>
Signed-off-by: Balazs Gibizer <[email protected]>
(cherry picked from commit 267a406)
@markgoddard
Copy link

Would be clearer to push the upstream changes to stackhpc/victoria then rebase on top

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.