-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release v0.0.5 #224
Merged
Merged
Release v0.0.5 #224
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This affected some related functions, so I ended up reworking their function signatures and collecting them in a new nnfcontainerprofile_helpers.go. Signed-off-by: Dean Roehrich <[email protected]>
…ofile Fix and use verifyPinnedContainerProfile() in tests
As the Ginkgo book says, "Declare in container nodes, initialize in setup nodes". I was stumped by why this test wasn't failing. When I corrected the declare-before-initialize, it became more clear. Signed-off-by: Dean Roehrich <[email protected]>
`containerHandler()` was much too large. This change moves all of the underlying User Container code into it's own helper file. To reduce the number of function parameters (e.g. workflow, profile, ctx) passed into these functions, a new struct was added so that the helper functions could make use it. Also renamed it to `userContainerHandler()`. Signed-off-by: Blake Devcich <[email protected]>
Add an optional UserID/GroupID field to NnfContainerProfileData. If either of these is specified, then the chosen container profile must match the user ID or group ID specified in the workflow resource that references the profile. Add tests. Signed-off-by: Dean Roehrich <[email protected]>
The NnfContainerProfile test was targeting a failure in the kube-apiserver, by tripping a kubebuilder:validation for RetryLimit. That has been changed to something that will pass the kube-apiserver, and then will proceed on to fail in the webhook. Now we can distinguish when Create() errors are responses from the webhook, versus when they are errors about the webhook not being installed. Signed-off-by: Dean Roehrich <[email protected]>
On LLNL systems, the host's default capabilities are different than HPE systems due to the difference in container runtime and also configuration of those runtimes. This change drops all capabilities and then adds in the required ones to ensure mpirun can access the worker pods. Tested capabilities by: - Drop ALL capabilities without the specific Adds (and failing) - Then adding in the specific Adds (and passing) Signed-off-by: Blake Devcich <[email protected]>
Signed-off-by: Dean Roehrich <[email protected]>
…e resources (#212) The system profiles must be in the NNF_STORAGE_PROFILE_NAMESPACE and NNF_CONTAINER_PROFILE_NAMESPACE namespaces, as specified in the Deployment spec. Make the 'pinned' flag immutable. Signed-off-by: Dean Roehrich <[email protected]>
This adds `DW_GLOBAL_*` storage support to container workflows/profiles, which are backed by the `LustreFilesystem` resource. This allows users to mount global lustre fileystems into user containers. An administrator must first edit the `LustreFilesystem` resource to add the workflow's namespace (e.g. `default`) to the list of namespaces along with a mode. The mode defaults to `ReadWriteMany` if not set. Doing so will create a PVC that can be used to mount the fileystem. A `DW_GLOBAL_*` storage must also be added to the NnfContainer Profile. See `config/examples` and `config/samples` for more detail. The value for the `DW_GLOBAL_*` parameter is the path of the lustre filesystem, e.g.: ``` #DW jobdw type=gfs2 name=my-local-storage capacity=100GB #DW container name=my-container-workflow profile=example-mpi \ DW_JOB_foo_local_storage=my-local-storage \ DW_GLOBAL_foo_global_lustre=/lus/sawbill ``` Signed-off-by: Blake Devcich <[email protected]>
* github-46: Use ResourceError when returning errors This commit uses the new ResourceError struct embedded in the status section of the DWS/nnf-sos resources. When returning an error, use the NewResourceError() call to return a ResourceError and fill it in with the correct information. This allows the end user and WLM to make informed decisions about what to do when there's an error. Signed-off-by: Matt Richerson <[email protected]> * copyrights and print column ordering Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]>
#214) This change allows for container workflows to open ports. These ports are opened on the host nodes (i.e. NNF nodes) where the containers are running. This enables traffic from outside of the network through the IP address of the NNF node and the port. An application on the compute node can contact the container with <NNF_NODE_IP>:<PORT>. The port number(s) can be retrieved via the NNF_CONTAINER_PORTS environment variable. This environment variable is available inside of the containers. It is also provided to the Workflow so that Flux can inform the application on the compute node of which port(s) to use. If multiple ports are desired, the environment variable will provide a comma separated list of port numbers. Ports are requested via the NnfContainerProfile's `numPorts`. **A system admin must enable the `Ports` port range in the `SystemConfiguration` before ports can be requested**. If not, the NnfPortManager will not allocate any ports. More details: - Enabled default NnfPortManager to manage port allocation - Port allocation occurs in the Setup State - Port de-allocation occurs in the Teardown State - User Container Pods are now destroyed in the Teardown State prior to Port de-allocation - Added `example-mpi-webserver` NnfContainerProfile to show use of envionrment variable with a simple webserver - Added container teardown + port allocation to workflow deletion Signed-off-by: Blake Devcich <[email protected]>
Create an option in the Lustre section of the storage profile to only create an MGT. This option will be used by admins to create a pool of MGTs that can be used as external MGTs for other Lustre file systems. Signed-off-by: Matt Richerson <[email protected]>
Once ports have been released, they now go into a Cooldown period. The cooldown period can be set via the SystemConfiguration. This defaults to 60s and corresponds with the kernel's default TIME_WAIT. Once the cooldown period expires, the port is then removed from the allocation list - freeing it up for reuse. The cooldown period is checked in the following situations: 1. When a new user requests a port 2. When a user has requested a port but there are none left (forcing a requeue) 3. When a user has released a port For #2, the reconciler will requeue for the cooldown period in hopes that a port has been freed in that time. If not, the requque will continue until that request has been satisfied. If these situations do not occur, the port will remain in Cooldown even though the time has expired. Since there is no reason to reconcile, the cooldown period will not be checked until a new port is requested/released (or the reconcile fires for another reason). Signed-off-by: Blake Devcich <[email protected]>
Use the routines provided by DWS. Signed-off-by: Dean Roehrich <[email protected]>
* Use external MGS from a pool of PersistentStorageInstance MGSs - Add an NnfStorageProfile option "standaloneMgtPoolName" to create a Lustre file system that only has an MGT. This option only works with the "create_persistent" directive. - Apply a label to the PersistentStorageInstance with the value of the "standaloneMgtPoolName" option. This adds the PersistentStorageInstance to a pool of ExternalMGSs. - Change the externalMgs option in the NnfStorageProfile to also accept "pool:[poolname]" where "poolname" is the name of an MGS pool. - Modify the Setup phase to pick an MGS from the pool and add a reference to the PersistentStorageInstance. Signed-off-by: Matt Richerson <[email protected]> * review comments Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]>
Signed-off-by: Blake Devcich <[email protected]>
- Adds a timeout to Prerun and errors out if the containers do not start - Moved timeout anonymous functions to be normal functions - Fixed some issues with the example profiles for testing + added example-mpi-fail - Improved error handing in Postrun - Made Pre/PostRunTimeSeconds a pointer to allow for disabling and defaulting Signed-off-by: Blake Devcich <[email protected]>
There is an intermittent issue where mpirun cannot contact the workers even though nslookup can successfully resolve their DNS hostnames in the Init Container. This is seen somewhat infrequently, but has happened enough. The end result causes the user containers to restart (if restartLimit > 0), and it always seems to work on the second try. This seems to solve the issue by using the Init Continer to use mpirun to contact the workers and just get their hostnames. This replaces the use of nslookup and ensures that mpirun can be successful on the launcher. To support this, the Init Container must run as the given UID/GID rather than root. It also speeds up container start times as we only need to run 1 Init Container for all of the workers rather than an Init Container for each worker. I have not been able to reproduce the original error using int-test, which would (in)frequently catch this. Signed-off-by: Blake Devcich <[email protected]>
* Use GetUserMessage() for resource errors Use the GetUserMessage() receiver function when getting the user message for a resource error. This will properly prefix the message with the error type. Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]>
matthew-richerson
requested review from
roehrich-hpe,
bdevcich and
ajfloeder
and removed request for
roehrich-hpe and
bdevcich
September 7, 2023 19:29
bdevcich
approved these changes
Sep 7, 2023
ajfloeder
approved these changes
Sep 7, 2023
roehrich-hpe
approved these changes
Sep 7, 2023
Do NOT use squash! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.