Release v0.0.5 #224

matthew-richerson · 2023-09-07T19:29:35Z

No description provided.

This affected some related functions, so I ended up reworking their function signatures and collecting them in a new nnfcontainerprofile_helpers.go. Signed-off-by: Dean Roehrich <[email protected]>

…ofile Fix and use verifyPinnedContainerProfile() in tests

As the Ginkgo book says, "Declare in container nodes, initialize in setup nodes". I was stumped by why this test wasn't failing. When I corrected the declare-before-initialize, it became more clear. Signed-off-by: Dean Roehrich <[email protected]>

`containerHandler()` was much too large. This change moves all of the underlying User Container code into it's own helper file. To reduce the number of function parameters (e.g. workflow, profile, ctx) passed into these functions, a new struct was added so that the helper functions could make use it. Also renamed it to `userContainerHandler()`. Signed-off-by: Blake Devcich <[email protected]>

Add an optional UserID/GroupID field to NnfContainerProfileData. If either of these is specified, then the chosen container profile must match the user ID or group ID specified in the workflow resource that references the profile. Add tests. Signed-off-by: Dean Roehrich <[email protected]>

The NnfContainerProfile test was targeting a failure in the kube-apiserver, by tripping a kubebuilder:validation for RetryLimit. That has been changed to something that will pass the kube-apiserver, and then will proceed on to fail in the webhook. Now we can distinguish when Create() errors are responses from the webhook, versus when they are errors about the webhook not being installed. Signed-off-by: Dean Roehrich <[email protected]>

On LLNL systems, the host's default capabilities are different than HPE systems due to the difference in container runtime and also configuration of those runtimes. This change drops all capabilities and then adds in the required ones to ensure mpirun can access the worker pods. Tested capabilities by: - Drop ALL capabilities without the specific Adds (and failing) - Then adding in the specific Adds (and passing) Signed-off-by: Blake Devcich <[email protected]>

Signed-off-by: Dean Roehrich <[email protected]>

…e resources (#212) The system profiles must be in the NNF_STORAGE_PROFILE_NAMESPACE and NNF_CONTAINER_PROFILE_NAMESPACE namespaces, as specified in the Deployment spec. Make the 'pinned' flag immutable. Signed-off-by: Dean Roehrich <[email protected]>

This adds `DW_GLOBAL_*` storage support to container workflows/profiles, which are backed by the `LustreFilesystem` resource. This allows users to mount global lustre fileystems into user containers. An administrator must first edit the `LustreFilesystem` resource to add the workflow's namespace (e.g. `default`) to the list of namespaces along with a mode. The mode defaults to `ReadWriteMany` if not set. Doing so will create a PVC that can be used to mount the fileystem. A `DW_GLOBAL_*` storage must also be added to the NnfContainer Profile. See `config/examples` and `config/samples` for more detail. The value for the `DW_GLOBAL_*` parameter is the path of the lustre filesystem, e.g.: ``` #DW jobdw type=gfs2 name=my-local-storage capacity=100GB #DW container name=my-container-workflow profile=example-mpi \ DW_JOB_foo_local_storage=my-local-storage \ DW_GLOBAL_foo_global_lustre=/lus/sawbill ``` Signed-off-by: Blake Devcich <[email protected]>

* github-46: Use ResourceError when returning errors This commit uses the new ResourceError struct embedded in the status section of the DWS/nnf-sos resources. When returning an error, use the NewResourceError() call to return a ResourceError and fill it in with the correct information. This allows the end user and WLM to make informed decisions about what to do when there's an error. Signed-off-by: Matt Richerson <[email protected]> * copyrights and print column ordering Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]>

#214) This change allows for container workflows to open ports. These ports are opened on the host nodes (i.e. NNF nodes) where the containers are running. This enables traffic from outside of the network through the IP address of the NNF node and the port. An application on the compute node can contact the container with <NNF_NODE_IP>:<PORT>. The port number(s) can be retrieved via the NNF_CONTAINER_PORTS environment variable. This environment variable is available inside of the containers. It is also provided to the Workflow so that Flux can inform the application on the compute node of which port(s) to use. If multiple ports are desired, the environment variable will provide a comma separated list of port numbers. Ports are requested via the NnfContainerProfile's `numPorts`. **A system admin must enable the `Ports` port range in the `SystemConfiguration` before ports can be requested**. If not, the NnfPortManager will not allocate any ports. More details: - Enabled default NnfPortManager to manage port allocation - Port allocation occurs in the Setup State - Port de-allocation occurs in the Teardown State - User Container Pods are now destroyed in the Teardown State prior to Port de-allocation - Added `example-mpi-webserver` NnfContainerProfile to show use of envionrment variable with a simple webserver - Added container teardown + port allocation to workflow deletion Signed-off-by: Blake Devcich <[email protected]>

Create an option in the Lustre section of the storage profile to only create an MGT. This option will be used by admins to create a pool of MGTs that can be used as external MGTs for other Lustre file systems. Signed-off-by: Matt Richerson <[email protected]>

Once ports have been released, they now go into a Cooldown period. The cooldown period can be set via the SystemConfiguration. This defaults to 60s and corresponds with the kernel's default TIME_WAIT. Once the cooldown period expires, the port is then removed from the allocation list - freeing it up for reuse. The cooldown period is checked in the following situations: 1. When a new user requests a port 2. When a user has requested a port but there are none left (forcing a requeue) 3. When a user has released a port For #2, the reconciler will requeue for the cooldown period in hopes that a port has been freed in that time. If not, the requque will continue until that request has been satisfied. If these situations do not occur, the port will remain in Cooldown even though the time has expired. Since there is no reason to reconcile, the cooldown period will not be checked until a new port is requested/released (or the reconcile fires for another reason). Signed-off-by: Blake Devcich <[email protected]>

Use the routines provided by DWS. Signed-off-by: Dean Roehrich <[email protected]>

* Use external MGS from a pool of PersistentStorageInstance MGSs - Add an NnfStorageProfile option "standaloneMgtPoolName" to create a Lustre file system that only has an MGT. This option only works with the "create_persistent" directive. - Apply a label to the PersistentStorageInstance with the value of the "standaloneMgtPoolName" option. This adds the PersistentStorageInstance to a pool of ExternalMGSs. - Change the externalMgs option in the NnfStorageProfile to also accept "pool:[poolname]" where "poolname" is the name of an MGS pool. - Modify the Setup phase to pick an MGS from the pool and add a reference to the PersistentStorageInstance. Signed-off-by: Matt Richerson <[email protected]> * review comments Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]>

Signed-off-by: Blake Devcich <[email protected]>

- Adds a timeout to Prerun and errors out if the containers do not start - Moved timeout anonymous functions to be normal functions - Fixed some issues with the example profiles for testing + added example-mpi-fail - Improved error handing in Postrun - Made Pre/PostRunTimeSeconds a pointer to allow for disabling and defaulting Signed-off-by: Blake Devcich <[email protected]>

There is an intermittent issue where mpirun cannot contact the workers even though nslookup can successfully resolve their DNS hostnames in the Init Container. This is seen somewhat infrequently, but has happened enough. The end result causes the user containers to restart (if restartLimit > 0), and it always seems to work on the second try. This seems to solve the issue by using the Init Continer to use mpirun to contact the workers and just get their hostnames. This replaces the use of nslookup and ensures that mpirun can be successful on the launcher. To support this, the Init Container must run as the given UID/GID rather than root. It also speeds up container start times as we only need to run 1 Init Container for all of the workers rather than an Init Container for each worker. I have not been able to reproduce the original error using int-test, which would (in)frequently catch this. Signed-off-by: Blake Devcich <[email protected]>

* Use GetUserMessage() for resource errors Use the GetUserMessage() receiver function when getting the user message for a resource error. This will properly prefix the message with the error type. Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]>

roehrich-hpe · 2023-09-07T21:19:08Z

Do NOT use squash!

roehrich-hpe and others added 21 commits June 15, 2023 12:22

Fix and use verifyPinnedContainerProfile() in tests

a7c28f3

This affected some related functions, so I ended up reworking their function signatures and collecting them in a new nnfcontainerprofile_helpers.go. Signed-off-by: Dean Roehrich <[email protected]>

Merge pull request #205 from NearNodeFlash/verify-pinned-container-pr…

96510e4

…ofile Fix and use verifyPinnedContainerProfile() in tests

Handle stale kustomize or controller-gen in bin dir. (#211)

e5a7951

Signed-off-by: Dean Roehrich <[email protected]>

Translate a ResourceError severity to a Workflow status (#218)

31c55db

Use the routines provided by DWS. Signed-off-by: Dean Roehrich <[email protected]>

Added more detail to NnfContainerProfile comments (#220)

e62e7fe

Signed-off-by: Blake Devcich <[email protected]>

Merge branch 'master' into release-v0.0.5

bdb258e

matthew-richerson requested review from roehrich-hpe, bdevcich and ajfloeder and removed request for roehrich-hpe and bdevcich September 7, 2023 19:29

bdevcich approved these changes Sep 7, 2023

View reviewed changes

ajfloeder approved these changes Sep 7, 2023

View reviewed changes

roehrich-hpe approved these changes Sep 7, 2023

View reviewed changes

matthew-richerson merged commit 2af650d into releases/v0 Sep 8, 2023

matthew-richerson deleted the release-v0.0.5 branch September 8, 2023 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.0.5 #224

Release v0.0.5 #224

matthew-richerson commented Sep 7, 2023

roehrich-hpe commented Sep 7, 2023

Release v0.0.5 #224

Release v0.0.5 #224

Conversation

matthew-richerson commented Sep 7, 2023

roehrich-hpe commented Sep 7, 2023