Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre v1alpha4 rel #428

Merged
merged 28 commits into from
Dec 9, 2024
Merged

Pre v1alpha4 rel #428

merged 28 commits into from
Dec 9, 2024

Conversation

roehrich-hpe
Copy link
Contributor

No description provided.

matthew-richerson and others added 28 commits September 26, 2024 15:45
* Add timeout when creating fan-out child resources

The ClientMount, NnfNodeStorage, and NnfNodeBlockStorage resources are fanned out to the
Rabbit and compute nodes. If the correct controllers aren't running on one or more of those
nodes, then the workflow will not progress but won't give an error. Add an optional timeout
that checks whether the controller on the Rabbit/compute node has added its finalizer within
a configurable amount of time. If the finalizer hasn't been added, then return an error.

Signed-off-by: Matt Richerson <[email protected]>

* use default child timeout value instead of returning error

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>
The workflow controller add/removes owner labels on the PersistentStorageInstance
resource in the teardown phase of the create_persistent/destroy_persistent directives.
This is so that the later call to DeleteChildren() will find (or not find) the persistent
storage and delete it if necessary. The call to DeleteChildren() may do the wrong thing
if the PersistentStorageInstance resource in the cache is stale. This commit adds a check
after the labels are changed to make sure the changes are visible in our client cache.

Also, change the Requeue while waiting for children to delete to a RequeueAfter.

Fix a bug in the NnfSystemStorage and NnfAccess tests. The Storage resource are created
by the SystemConfiguration controller, so we don't need to create or delete them
Signed-off-by: Matt Richerson <[email protected]>
Save the working rules so they can be found quickly some other day.

Signed-off-by: Dean Roehrich <[email protected]>
Role rules to monitor API Priority and Fairness
Create v1alpha3 APIs.

This used "kubebuilder create api --resource --controller=false"
for each API.

Signed-off-by: Matt Richerson <[email protected]>
Copy API content from v1alpha2 to v1alpha3.

Move the kubebuilder:storageversion marker from v1alpha2 to v1alpha3.

Set localSchemeBuilder var in api/v1alpha2/groupversion_info.go
to satisfy zz_generated.conversion.go.

Signed-off-by: Matt Richerson <[email protected]>
Move the existing webhooks from v1alpha2 to v1alpha3.

Signed-off-by: Matt Richerson <[email protected]>
Create conversion webhooks and hub routines for v1alpha3.

This may have used "kubebuilder create webhook --conversion" for any
API that did not already have a webhook.

Any newly-created api/v1alpha3/*_webhook_test.go is empty and
does not need content at this time. It has been updated with a comment
to explain where conversion tests are located.

ACTION: Any new tests added to
  github/cluster-api/util/conversion/conversion_test.go
  may need to be manually adjusted. Look for the "ACTION" comments
  in this file.

This may have added a new SetupWebhookWithManager() to suite_test.go,
though a later step will complete the changes to that file.

Signed-off-by: Matt Richerson <[email protected]>
Create conversion routines and tests for v1alpha2.

Switch api/v1alpha2/conversion.go content from hub to spoke.

These conversion.go ConvertTo()/ConvertFrom() routines are complete
and do not require manual adjustment at this time, because v1alpha2 is
currently identical to the new hub v1alpha3.

ACTION: The api/v1alpha2/conversion_test.go may need to be
  manually adjusted for your needs, especially if it has been manually
  adjusted in earlier spokes.

ACTION: Any new tests added to internal/controller/conversion_test.go
  may need to be manually adjusted.

This added api/v1alpha2/doc.go to hold the k8s:conversion-gen
marker that points to the new hub.

Signed-off-by: Matt Richerson <[email protected]>
Point controllers at new hub v1alpha3

Point conversion fuzz test at new hub. These routines are still
valid for the new hub because it is currently identical to the
previous hub.

ACTION: Some controllers may have been referencing one of these
  non-local APIs. Verify that these APIs are being referenced
  by their correct versions:
  DirectiveBreakdown, Workflow
Signed-off-by: Matt Richerson <[email protected]>
Point earlier spoke APIs at new hub v1alpha3.

The conversion_test.go and the ConvertTo()/ConvertFrom() routines in
conversion.go are still valid for the new hub because it is currently
identical to the previous hub.

Update the k8s:conversion-gen marker in doc.go to point to the new hub.

ACTION: Some API libraries may have been referencing one of these
  non-local APIs. Verify that these APIs are being referenced
  by their correct versions:
  DirectiveBreakdown, Workflow
Signed-off-by: Matt Richerson <[email protected]>
Make the auto-generated files.

Update the SRC_DIRS spoke list in the Makefile.

make manifests & make generate & make generate-go-conversions
make fmt

ACTION: If any of the code in this repo was referencing non-local
  APIs, the references to them may have been inadvertently
  modified. Verify that any non-local APIs are being referenced
  by their correct versions.

ACTION: Begin by running "make vet". Repair any issues that it finds.
  Then run "make test" and continue repairing issues until the tests
  pass.
Signed-off-by: Matt Richerson <[email protected]>
A rabbit that has lost its NoSchedule taint, but retains its
nnf.cray.hpe.com/taints_and_labels_completed=true label, was not able to
repair its taints.

This change allows the nnf_systemconfiguration_controller to examine the node
and determine whether the label is stale with respect to the state of the
taints, and to correct the taints if necessary.

Signed-off-by: Dean Roehrich <[email protected]>
Add two new fields to the NnfStorageProfile: postActivate and preDeactive. These
are free form string lists that allow an admin to list commands to run on the
Rabbit after a file system has been activated or before it is deactivated.

Signed-off-by: Matt Richerson <[email protected]>
* Use a file based database for nnf-ec

Mount /localdisk (the M.2) into the nnf-node-manager pods. Use the default database
in nnf-ec (badger) and change the working directory of the container to /localdisk
so the database file is created in the correct spot.

Signed-off-by: Matt Richerson <[email protected]>

* add type field to localdisk volumes

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>
User jobs currently do not have a way to retrieve the Servers resource
for a workflow. Access to the servers resource can provide lustre
information, such as which rabbit nodes are being used for MDT/OSTs.

This creates a file (`./.nnf-servers.json`) at the root of the lustre
filesystem that contains MDTs/OSTs. It can then be parsed using `jq` to
retrieve the pertinent information.

Examples:

```
# non-persistent
flux run -N4 --setattr=dw="#DW jobdw name=blake type=lustre capacity=30GB" bash -c "cat \$DW_JOB_blake/.nnf-servers.json | jq '.ost'"

# persistent
flux run -N4 --setattr=dw="#DW persistentdw name=blake-persistent" bash -c "cat \$DW_PERSISTENT_blake_persistent/.nnf-servers.json | jq '.ost'"

```

Signed-off-by: Blake Devcich <[email protected]>
Do not print the always-nil 'err' value when we timeout while waiting for
a VG to appear.

Signed-off-by: Dean Roehrich <[email protected]>
…urrent spec (#410)

For NnfStorage and NnfAccess resources created by the NnfSystemStorage, the spec section may
change as Storage resources are disabled/enabled. When aggregating status from child objects
(NnfNodeBlockStorage, NnfNodeStorage, and ClientMounts), only check the status from child
resources that are currently requested by the spec. This avoids trying to collect status
from Rabbits that are disabled.

Signed-off-by: Matt Richerson <[email protected]>
Signed-off-by: Dean Roehrich <[email protected]>
@roehrich-hpe roehrich-hpe requested a review from bdevcich December 9, 2024 20:53
@roehrich-hpe roehrich-hpe merged commit 08e8c53 into releases/v0 Dec 9, 2024
3 checks passed
@roehrich-hpe roehrich-hpe deleted the pre-v1alpha4-rel branch December 9, 2024 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants