Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] #2411

Draft
wants to merge 52 commits into
base: master
Choose a base branch
from
Draft

[wip] #2411

wants to merge 52 commits into from

Conversation

kyrtapz
Copy link
Contributor

@kyrtapz kyrtapz commented Jan 8, 2025

No description provided.

npinaeva and others added 30 commits December 18, 2024 20:23
Handle host-network pods as default network.
Don't return per-pod errors on startup.
Remove nadController from UDNHostIsolationManager as we don't use it
anymore to find pod's UDN based on NADs that exist in the namespace.

Signed-off-by: Nadia Pinaeva <[email protected]>
This code isnt being used anymore. We dont expect users
to upgrade directly from code which contained the legacy LRPs,
therefore its safe to remove.

Signed-off-by: Martin Kennelly <[email protected]>
In an unlikely scenario where the service doesn't exist
and there was an issue getting the current active network
the code should not use the service object for the returned error.

Signed-off-by: Patryk Diak <[email protected]>
Previously, if the services controller failed
to start it would not be retried.

Signed-off-by: Patryk Diak <[email protected]>
L2 UDN: EgressIP hosted by primary interface (`breth0`)
If EncapIP is configured, it means it is different from the node's
primary address. Do not update EncapIP when node's primary address
changes.

Signed-off-by: Yun Zhou <[email protected]>
Assign network ID from network manager running in cluster manager. The
network ID is included in NetInfo and annotated on the NAD along with
the network name. Network managers running in zone & node controllers
will read the network ID from the annotation to set it on NetInfo.

On startup, network manager running in cluster manager will read the
network IDs annotated on the nodes to cover for the upgrade scenario.
Network IDs will still be annotated on the nodes because this PR does
not transition all the code to use the network ID from the NetInfo
instead of the node annotation. That will have to be done progressively.

This have several benefits, among them:
- NetworkID is available sooner overall since we dont have to wait for
  all the nodes to be annotated
- No need to unmarshall the node annotation to get the network IDs, they
  are available in NetInfo
- No need to unmashall the NAD to get the network name, can be accessed
  directly from the annotation.

If a network is replaced with a different one with the same name, the
network ID is reused as the respective network controller will not start
as the previous one is stopped and cleaned up so it shouldn't be a
problem.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Instead of considering managed VRFs those that follow the mp<id>-udn-vrf
naming template, use the table number: those vrfs associated to a table
within our reserved block of table numbers are managed by us. The block
right now is anything higher than RoutingTableIDStart (1000). This
allows to manage VRFs with any name which is desirable if the name is
going to be exposed through BGP.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Anticipating that these VRF names are going to be exposed through BGP,
we should to use friendlier names for our VRFs. The most natural name to
use is the network name. Thus giving a cluster UDN a name below 15
characters that matches an already existing VRF not managed by ovn-k
will fail. This is considered an admin problem and not an ovn-k problem
for now.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Was causing deadlocks in unit tests

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
…heir subcontrollers

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Assuming that there is three types of controllers, being: network
agnostic, network aware and network specific; we were already notifying
network specific controllers of network changes. But network aware
controllers, controllers for which we have a single instance capable of
managing multiple networks, had no code path to be informed of netwokr
changes.

This commit adds a code path for that and makes the RouteAdvertisments
controller aware of network changes.

Changed ClusterManager to be the controller manager for cluster manager
instead of secondaryNetworkClusterManager. It just makes more sense that
way sice ClusterManager is the top level manager.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
On CUDN cleanup is inconsistent as we see some flaky tests due to CUDN
"already exist" errors, implying object are not actually deleted.

Wait for CUDN object be gone when deleted

Signed-off-by: Or Mergi <[email protected]>
CUDN is cluster-scoped object, in case tests running in parallel,
having random names avoids conflicting with other tests.

Use random metadata.name for CUDN objects.

The "isolates overlapping CIDRs" tests create objects based on the
'red' and 'blue' variables, including CUDN objects.
Change the tests CUDN creation use random names and update the given
'networkAttachmentConfigParams' with the new generated name.
Update 'red' & 'blue' vaiables with the generated name, carried by
'networkAttachmentConfigParams' (netConfig.name).

The pod2Egress tests asserts on the CUDN object name given by 'userDefinedNetworkName'.
In practice the tests netConfigParam.name is userDefinedNetworkName.
Change the assertion to check the given netConfigParam.

Signed-off-by: Or Mergi <[email protected]>
Reconcile RouteAdvertisements in cluster manager
Add missing enum validation for RouteAdvertisements
Compare annotations directly if possible.
For network specific map entries only compare raw json
entries without parsing the map in full.

Co-authored-by: Tim Rozet <[email protected]>
Signed-off-by: Patryk Diak <[email protected]>
kyrtapz and others added 18 commits January 7, 2025 10:00
Instead of always parsing all node/join subnets
parse the raw json map and only compute the results
for the affected network.

Signed-off-by: Patryk Diak <[email protected]>
Increases async performance of informer cache being able to always queue
events and not blocking while performing ADD/UPDATE/DELETE operation.

Signed-off-by: Tim Rozet <[email protected]>
Add a pool of Event handlers instead of a single (federated) event
handler per informer.  Ensure a controller always gets registers with
the same event handler.

Set the pool size to 201 (200 for secondary controllers and one, index
0, reserved for the default network).

Always use pool entry with index 0 for the default network controller.

Signed-off-by: Dumitru Ceara <[email protected]>
Keep the initial add/sync queue small enough though.  This is needed to
avoid contention on handler addition initial processing.

Signed-off-by: Dumitru Ceara <[email protected]>
k8s.ovn.org/user-defined-network is now required to be labeled on a
namespace at namespace creation time in order to use a primary UDN. The
following conditions are true:

1. If namespace is missing the label, and a pod is created, it attaches
   to default network.
2. If the namespace is missing the label, and a primary UDN or CUDN is
   created that matches that namespace, the UDN/CUDN will report error
   status and the NAD will not be generated.
3. If the namespace is missing the label, and a primary UDN/CUDN exists,
   a pod in the namespace will be created and attached to default
   network.
4. If the namespace has the label, and a primary UDN/CUDN does not exist
   a pod in the namespace will fail creation until the UDN/CUDN is
   created.

Also includes some fixes to unit tests that were brought to light by
this PR. For example, the layer 2 multi-network tests were adding
invalid annotations for node-subnets, etc.

Signed-off-by: Tim Rozet <[email protected]>
Was using ipv6 on ipv4 cluster.

Signed-off-by: Tim Rozet <[email protected]>
The unit tests run with race detection enabled and on constrained
environments (e.g., default GitHub runners) and run out of resources
when using such large event queues.

This change doesn't affect e2e tests in any way.  Those will use default
event queue sizes (1K) in order to test what gets deployed on actual
clusters.

Signed-off-by: Dumitru Ceara <[email protected]>
Secondary network controllers should ingore resources
that do not belong to the current network.

Signed-off-by: Patryk Diak <[email protected]>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 8, 2025
Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kyrtapz
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kyrtapz
Copy link
Contributor Author

kyrtapz commented Jan 8, 2025

/test images

Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

@kyrtapz: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kyrtapz and others added 4 commits January 8, 2025 16:56
When multinetwork policies support is disabled
WatchNamespaces should run to completion for
primary UDNs. This was not happening because a primary
UDN is also a secondary network.

Signed-off-by: Patryk Diak <[email protected]>
Signed-off-by: Patryk Diak <[email protected]>
Otherwise there's a data race because subnets can be allocated while we
read the current usage/count.

Signed-off-by: Dumitru Ceara <[email protected]>
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.