Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ironic Standalone Operator #461

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dtantsur
Copy link
Member

This design proposal discusses ironic-standalone-operator: provides a
motivation for its creation, describes its goals and outlines the
current design and features, as well as the plans for the nearest
future.

Signed-off-by: Dmitry Tantsur [email protected]

@metal3-io-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@metal3-io-bot metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2024
@metal3-io-bot metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 25, 2024
@dtantsur dtantsur force-pushed the ironic-standalone-operator branch 4 times, most recently from 27ec8ca to 04f201c Compare August 1, 2024 16:06
@dtantsur dtantsur marked this pull request as ready for review August 1, 2024 16:07
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 1, 2024
This design proposal discusses ironic-standalone-operator: provides a
motivation for its creation, describes its goals and outlines the
current design and features, as well as the plans for the nearest
future.

Signed-off-by: Dmitry Tantsur <[email protected]>

## Motivation

Ironic is not trivial to install and operate. Also we provide
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Even if we provide" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also-vs-Although 🤦🏽 Will fix if I need to update.

- `credentialsRef` - a reference to a secret with credentials (generated if
missing)
- `databaseRef` - a reference to an `IronicDatabase` object (if needed)
- `distributed` - a boolean flag that enables the HA architecture

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call this field highlyAvailable or highAvailability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I guess distributed is more precise technically, while highAvailability may be more user-friendly. I'd like to hear more opinions on the topic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for highAvailability or haEnabled


- `apiPort`, `imageServerPort`, `imageServerTLSPort` allow overriding listening
ports for the services
- `bindInterface` - a boolean flag that makes Ironic listen on only the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may sound too simplistic, but can we call this provisioningInterfaceOnly? )bindInterface can sound like another kind of interface.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur but see below about the word "provisioning".

- `dhcp` - another nested structure with DHCP parameters (see below)
- `externalIP` - IP through which nodes deployed over virtual media access
Ironic and HTTPD
- `interface`, `ipAddress`, `macAddresses` - various ways to specify the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, how does provisioningInterface, provisioningIPAddress and provisioningMACAddress sound?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential for confusion here. We call "provisioning network" a dedicated network for the PXE and other provisioning traffic. Ironic can work without it, and the variables work regardless of whether you have a "provisioning network".

Maybe I'm overthinking it? Would like more opinions here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do decide to change this, I'd rather rename the whole "networking" sub-field into something like "provisioningNetwork" to avoid duplicate prefixes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the rest of the sub-fields seem unrelated to the provisioning network (like dhcp and external ip)?
How would the interface, ipAddress and macAddresses fields work if there is no provisioning interface?
(I am not too familiar with ironic's networking, so just trying to understand the expected behaviour)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DHCP is very-very much related to the provisioning network: it's DHCP on the provisioning network. ExternalIP is not directly related indeed, maybe we'd need to move it out then.

How would the interface, ipAddress and macAddresses fields work if there is no provisioning interface?

There must be a provisioning interface, but it can be on some existing network rather than an isolated provisioning network. Confusing, I know. This is why I'm pondering our usage of "provisioning" here.


After each full reconciliation, the operator will store the version of Ironic
it has just installed in the `status`. In the future, this will allow to apply
any logic on upgrade. However, we will try to keep ironic-image self-upgrading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'However, we will try to keep ironic-image self-upgrading for the sake of users that do not use ironic-standalone-operator.' What do you mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if you just use ironic-image, and you upgrade it by replacing the container, we will try to support this case still. Any ideas on how to phrase it better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that for example, if someone was running ironic-standalone-operator v1.0 (with ironic-image v1.0), and just replaced their ironic-image with v1.1, ironic-standalone-operator will continue to work?
If that's what you mean, then 'for the sake of users that do not use ironic-standalone-operator.' can be confusing..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the source of confusion. What I wanted to say was: users of ironic-image without the operator will be able to do it still. I'll work on a different way to phrase it.

avoid running it 3+ times in parallel. Maybe it will take a form of a *Job*.
- BMO will need to be updated to handle more cases of unexpected provision
state. E.g. what needs to be done when the BMH is *inspecting* but the node
is found in a completely wrong state like `active` or `clean wait`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an additional complexity to consider if we're going to support mariadb combined with persistent volumes, when the mariadb version changes you either have to ensure clean shutdown of the container or enable some kind of backup/restore workflow due to the mariadb checks around versioning for the redo log - see here for example.

This can likely be resolved with with pod lifecycle hooks, or perhaps some kind of job pre/post upgrade but it's additional complexity which I don't think we currently consider in the community deployment examples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, thank you!


Q: Why using *DaemonSets* if *StatefulSets* provide us an easier way to address
separate Ironic instances? A: While we're relying on host networking, making
several Ironic instances co-exist on the same Kubernetes node is too complex.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Kubernetes HA perspective it is also more reliable to have the ironic pods present on separate control plane nodes. In other words clustering all Ironic pods on e.g. 1 controlplane node wouldn't be HA anyways as that node could be lost.

Copy link
Member

@Rozzii Rozzii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Member

@Rozzii Rozzii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
From my perspective it is fine given that you will address the comments.

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2024
@Rozzii
Copy link
Member

Rozzii commented Sep 25, 2024

/hold
Just holding to avoid someone putting approve on it while my lgtm is active and merging it

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 25, 2024
Copy link
Member

@kashifest kashifest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @dtantsur few questions/nits inline

#### HA architecture

The *non-HA* architecture is the architecture that Metal3 uses now. All Ironic
components are run in a *deployment*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
components are run in a *deployment*.
components are run in a single *deployment*.


When the HA architecture is enabled via a flag on the `Ironic` resource, all
Ironic components (except for MariaDB and dnsmasq) will be installed in a
*DaemonSet* instead of a *Deployment*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain the reasoning here ?


## Design Details

This proposal adds a new project under the Metal3 umbrella:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a project or a repository?

its private key certificate and sign the certificate with this CA. The CA will
be trusted **only** for the RPC purpose, removing the possibility of abuse.
To reduce the number of code paths, this process will happen unconditionally,
even when TLS for Ironic itself is not enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding what prevents us from enabling TLS unconditionally for ironic itself?

- `credentialsRef` - a reference to a secret with credentials (generated if
missing)
- `databaseRef` - a reference to an `IronicDatabase` object (if needed)
- `distributed` - a boolean flag that enables the HA architecture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for highAvailability or haEnabled

@kashifest
Copy link
Member

I can approve it already and hold can be taken back when the reviews are addressed
/approve

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kashifest, Rozzii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants