Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.0.5 #224

Merged
merged 21 commits into from
Sep 8, 2023
Merged

Release v0.0.5 #224

merged 21 commits into from
Sep 8, 2023

Conversation

matthew-richerson
Copy link
Contributor

No description provided.

roehrich-hpe and others added 21 commits June 15, 2023 12:22
This affected some related functions, so I ended up reworking their function
signatures and collecting them in a new nnfcontainerprofile_helpers.go.

Signed-off-by: Dean Roehrich <[email protected]>
…ofile

Fix and use verifyPinnedContainerProfile() in tests
As the Ginkgo book says, "Declare in container nodes, initialize in setup
nodes".

I was stumped by why this test wasn't failing.  When I corrected the
declare-before-initialize, it became more clear.

Signed-off-by: Dean Roehrich <[email protected]>
`containerHandler()` was much too large. This change moves all of the
underlying User Container code into it's own helper file. To reduce the
number of function parameters (e.g. workflow, profile, ctx) passed into
these functions, a new struct was added so that the helper functions
could make use it.

Also renamed it to `userContainerHandler()`.

Signed-off-by: Blake Devcich <[email protected]>
Add an optional UserID/GroupID field to NnfContainerProfileData.  If either
of these is specified, then the chosen container profile must match the
user ID or group ID specified in the workflow resource that references the
profile.

Add tests.

Signed-off-by: Dean Roehrich <[email protected]>
The NnfContainerProfile test was targeting a failure in the kube-apiserver,
by tripping a kubebuilder:validation for RetryLimit.  That has been changed
to something that will pass the kube-apiserver, and then will proceed on to
fail in the webhook.

Now we can distinguish when Create() errors are responses from the webhook,
versus when they are errors about the webhook not being installed.

Signed-off-by: Dean Roehrich <[email protected]>
On LLNL systems, the host's default capabilities are different than HPE
systems due to the difference in container runtime and also
configuration of those runtimes.

This change drops all capabilities and then adds in the required ones to
ensure mpirun can access the worker pods.

Tested capabilities by:
- Drop ALL capabilities without the specific Adds (and failing)
- Then adding in the specific Adds (and passing)

Signed-off-by: Blake Devcich <[email protected]>
…e resources (#212)

The system profiles must be in the NNF_STORAGE_PROFILE_NAMESPACE and
NNF_CONTAINER_PROFILE_NAMESPACE namespaces, as specified in the Deployment
spec.

Make the 'pinned' flag immutable.

Signed-off-by: Dean Roehrich <[email protected]>
This adds `DW_GLOBAL_*` storage support to container workflows/profiles,
which are backed by the `LustreFilesystem` resource. This allows users
to mount global lustre fileystems into user containers.

An administrator must first edit the `LustreFilesystem` resource to add the
workflow's namespace (e.g. `default`) to the list of namespaces along
with a mode. The mode defaults to `ReadWriteMany` if not set. Doing so
will create a PVC that can be used to mount the fileystem.

A `DW_GLOBAL_*` storage must also be added to the NnfContainer Profile.
See `config/examples` and `config/samples` for more detail.

The value for the `DW_GLOBAL_*` parameter is the path of the lustre
filesystem, e.g.:

```
#DW jobdw type=gfs2 name=my-local-storage capacity=100GB
#DW container name=my-container-workflow profile=example-mpi \
    DW_JOB_foo_local_storage=my-local-storage \
    DW_GLOBAL_foo_global_lustre=/lus/sawbill
```

Signed-off-by: Blake Devcich <[email protected]>
* github-46: Use ResourceError when returning errors

This commit uses the new ResourceError struct embedded in the status section of the DWS/nnf-sos
resources. When returning an error, use the NewResourceError() call to return a ResourceError and
fill it in with the correct information. This allows the end user and WLM to make informed decisions
about what to do when there's an error.

Signed-off-by: Matt Richerson <[email protected]>

* copyrights and print column ordering

Signed-off-by: Matt Richerson <[email protected]>

* re-vendor

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>
#214)

This change allows for container workflows to open ports. These ports
are opened on the host nodes (i.e. NNF nodes) where the containers are
running. This enables traffic from outside of the network through the IP
address of the NNF node and the port. An application on the compute node
can contact the container with <NNF_NODE_IP>:<PORT>.

The port number(s) can be retrieved via the NNF_CONTAINER_PORTS
environment variable. This environment variable is available inside of
the containers. It is also provided to the Workflow so that Flux can
inform the application on the compute node of which port(s) to use. If
multiple ports are desired, the environment variable will provide a
comma separated list of port numbers.

Ports are requested via the NnfContainerProfile's `numPorts`. **A system
admin must enable the `Ports` port range in the `SystemConfiguration`
before ports can be requested**. If not, the NnfPortManager will not
allocate any ports.

More details:
- Enabled default NnfPortManager to manage port allocation
- Port allocation occurs in the Setup State
- Port de-allocation occurs in the Teardown State
- User Container Pods are now destroyed in the Teardown State prior to
  Port de-allocation
- Added `example-mpi-webserver` NnfContainerProfile to show use of
  envionrment variable with a simple webserver
- Added container teardown + port allocation to workflow deletion

Signed-off-by: Blake Devcich <[email protected]>
Create an option in the Lustre section of the storage profile to only create an MGT. This
option will be used by admins to create a pool of MGTs that can be used as external MGTs
for other Lustre file systems.

Signed-off-by: Matt Richerson <[email protected]>
Once ports have been released, they now go into a Cooldown period. The
cooldown period can be set via the SystemConfiguration. This defaults to
60s and corresponds with the kernel's default TIME_WAIT.

Once the cooldown period expires, the port is then removed from the
allocation list - freeing it up for reuse.

The cooldown period is checked in the following situations:

1. When a new user requests a port
2. When a user has requested a port but there are none left (forcing a
   requeue)
3. When a user has released a port

For #2, the reconciler will requeue for the cooldown period in hopes
that a port has been freed in that time. If not, the requque will
continue until that request has been satisfied.

If these situations do not occur, the port will remain in Cooldown even
though the time has expired. Since there is no reason to reconcile, the
cooldown period will not be checked until a new port is
requested/released (or
the reconcile fires for another reason).

Signed-off-by: Blake Devcich <[email protected]>
Use the routines provided by DWS.

Signed-off-by: Dean Roehrich <[email protected]>
* Use external MGS from a pool of PersistentStorageInstance MGSs

 - Add an NnfStorageProfile option "standaloneMgtPoolName" to create a Lustre file system
   that only has an MGT. This option only works with the "create_persistent" directive.
 - Apply a label to the PersistentStorageInstance with the value of the "standaloneMgtPoolName"
   option. This adds the PersistentStorageInstance to a pool of ExternalMGSs.
 - Change the externalMgs option in the NnfStorageProfile to also accept "pool:[poolname]" where
   "poolname" is the name of an MGS pool.
 - Modify the Setup phase to pick an MGS from the pool and add a reference to the PersistentStorageInstance.

Signed-off-by: Matt Richerson <[email protected]>

* review comments

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>
- Adds a timeout to Prerun and errors out if the containers do not start
- Moved timeout anonymous functions to be normal functions
- Fixed some issues with the example profiles for testing + added
  example-mpi-fail
- Improved error handing in Postrun
- Made Pre/PostRunTimeSeconds a pointer to allow for disabling and defaulting

Signed-off-by: Blake Devcich <[email protected]>
There is an intermittent issue where mpirun cannot contact the workers
even though nslookup can successfully resolve their DNS hostnames in the
Init Container. This is seen somewhat infrequently, but has happened
enough. The end result causes the user containers to restart (if
restartLimit > 0), and it always seems to work on the second try.

This seems to solve the issue by using the Init Continer to use
mpirun to contact the workers and just get their hostnames. This
replaces the use of nslookup and ensures that mpirun can be successful
on the launcher. To support this, the Init Container must run as the
given UID/GID rather than root.

It also speeds up container start times as we only need to run 1 Init
Container for all of the workers rather than an Init Container for each
worker.

I have not been able to reproduce the original error using int-test,
which would (in)frequently catch this.

Signed-off-by: Blake Devcich <[email protected]>
* Use GetUserMessage() for resource errors

Use the GetUserMessage() receiver function when getting the user message for a resource error.
This will properly prefix the message with the error type.

Signed-off-by: Matt Richerson <[email protected]>

* re-vendor

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>
@matthew-richerson matthew-richerson requested review from roehrich-hpe, bdevcich and ajfloeder and removed request for roehrich-hpe and bdevcich September 7, 2023 19:29
@roehrich-hpe
Copy link
Contributor

Do NOT use squash!

@matthew-richerson matthew-richerson merged commit 2af650d into releases/v0 Sep 8, 2023
@matthew-richerson matthew-richerson deleted the release-v0.0.5 branch September 8, 2023 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants