Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Debug build for NVM Format command Signed-off-by: Nate Thornton <[email protected]> * Upgrade nnf-ec to latest (e4ba0b) Signed-off-by: Nate Thornton <[email protected]> * Upgrade nnf-ec to latest (96d6a3) Signed-off-by: Nate Thornton <[email protected]> * Upgrade nnf-ec to latest (83d47b) Signed-off-by: Nate Thornton <[email protected]> * Use DWS variable for storage label (#114) Signed-off-by: Dean Roehrich <[email protected]> * Use the DWS workflowname vars for label names (#115) Signed-off-by: Dean Roehrich <[email protected]> * RABSW-1069: Support for refactored DWS Storage resource and NNF Fencing functionality (#113) Support for NNF Node Fencing with DWS Storage interaction Signed-off-by: Nate Thornton <[email protected]> * Add known controller-manager secret Signed-off-by: Nate Thornton <[email protected]> * Disable EC Data Controller for unit tests Signed-off-by: Nate Thornton <[email protected]> * Ignore not found resource on undeploy Signed-off-by: Nate Thornton <[email protected]> * RABSW-1081: Support multiple MDTs (#117) * RABSW-1081: Support multiple MDTs - Update the DirectiveBreakdown to ask for more than one MDT if necessary - Only use a combined MGT/MDT for the first allocation listed in the mgtmdt allocation set. All other allocations will only be MDTs. - Fix an accounting error in the Servers resource where the allocated capacity was not summed across multiple NnfNodeStorages on the same Rabbit. Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> * Re-vendor Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> Co-authored-by: Matt Richerson <[email protected]> * RABSW-1099: Vendor DWS (#122) Pick up the changes to the PersistentStorage state fields. Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> * Refactor Job/Persistent directive references to use "DW" prefix Signed-off-by: Nate Thornton <[email protected]> * RABSW-1097: Pass UserID and GroupID to ClientMount (#129) * RABSW-1097: Pass UserID and GroupID to ClientMount Pass the UserID and GroupID from the workflow, through the NnfAccess, and to the ClientMount. This is used to set the owner/group of Raw devices on the compute node. Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> * RABSW-1122: Don't allow staging to Raw allocations (#131) Do some more sanity checks on staging directives: - Don't allow staging to/from raw allocations - Match allocation directives based on name and command "jobdw/persistentdw" since names can collide between the two types. Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> * RABSW-1097: Use "raw" instead of "lvm" for Raw allocation FsType (#132) * RABSW-1097: Use "raw" instead of "lvm" for Raw allocation FsType nnf-ec now understands the "raw" file system type. * re-vendor Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> * Allow builds on all branches. (#134) Loosen the branch filter for pushes. Print some event context, for future debugging. Remove an unused variable from verify_tag. Rename some jobs to give them unique names, to help with debugging. Signed-off-by: Dean Roehrich <[email protected]> * RABSW-1124: Change NnfAccess TeardownState for servers (#136) * RABSW-1124: Change NnfAccess TeardownState for servers The data movement code was mounting and unmounting the Rabbit nodes during the DataIn and DataOut phases of the workflow. A stale workflow resource in the client cache could cause the NnfAccess to be re-mounted after it had already been unmounted. This commit changes the NnfAccess Teardown state logic to do the unmounts in PreRun and Teardown instead of DataIn and DataOut. Signed-off-by: Matt Richerson <[email protected]> * review comments Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * Update to latest nnf-ec with support for additional LVM commands (#130) Signed-off-by: Nate Thornton <[email protected]> * Fix for 'lockStart' typo Signed-off-by: Nate Thornton <[email protected]> * Added PR builds for feature branches (#138) Signed-off-by: Blake Devcich <[email protected]> * RABSW-1128: Make fake mounts on kind Rabbit nodes (#139) * RABSW-1128: Make fake mounts on kind Rabbit nodes Create empty directories on the Rabbit nodes in the clientmount reconciler to better fake out data movement and user containers. Signed-off-by: Matt Richerson <[email protected]> * review comments Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * Shorten LVM names (#141) * RABSW-1129: Shorten LVM names The LV and VG names were too long and caused an error during the lvcreate. This commit changes the VG name to use a truncated version of the file share ID which includes the workflow name/namespace, directive index, and allocation index. This string is combined with the UUID of the workflow. The LV name was changed to be "lv" for all logical volumes since there is only ever the single LV in each VG. Signed-off-by: Matt Richerson <[email protected]> * review comments Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * Ensure the NNF Node resource fencing status is cleared prior updating… (#126) * Ensure the NNF Node resource fencing status is cleared prior to updating the Storage resource * Refactor to use DWS Storage Controller Signed-off-by: Nate Thornton <[email protected]> * Remove finalizer on the new DWS Storage Controller (#144) * Remove finalizer on the new DWS Storage Controller Signed-off-by: Nate Thornton <[email protected]> * RABSW-1139: Fix ClientMount directory create/remove for kind environment (#145) * RABSW-1139: Fix ClientMount directory create/remove for kind environment In the ClientMount controller for kind nodes, check whether the directory exists before creating or removing it. Re-vendor dws Signed-off-by: Matt Richerson <[email protected]> * MkdirAll() already handles when the directory exists. Don't check before hand with a Stat() call. Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * Ensure 'key' values in the ruleset match against the exact string Signed-off-by: Nate Thornton <[email protected]> * Handle Retryable EC errors for File Share (#133) * Handle Retryable EC errors Signed-off-by: Nate Thornton <[email protected]> * Drive Slot information for drives which are offline (#152) * Upgrade nnf-ec to latest (ca4975) * Pull in drive Slot information from storage resource * update go.sum after running 'go mod tidy' --------- Signed-off-by: Nate Thornton <[email protected]> * add --ignore-not-found to uninstall Signed-off-by: Nate Thornton <[email protected]> * github-151: Fix LVM issues with gfs2 (#157) * github-151: Fix LVM issues with gfs2 This commit fixes two issues that were affecting gfs2 file systems: - The dlm lock manager was failing to lock because the VG name was too long - The lvcreate command needs an "--activate ys" to active a shared volume Signed-off-by: Matt Richerson <[email protected]> * use --extents Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * Update nnf-ec to 1dce5b Signed-off-by: Nate Thornton <[email protected]> * Add container support to workflows (#159) Note: This is an experimental feature, but is working end-to-end (i.e. Proposal to Teardown) via the workflow process. This feature adds experimental support for NNF Containers in workflows. Containers workflows are created by using the `#DW container` directive. An `NnfContainerProfile` must be supplied to the directive to instruct the workflow on what containers to create and which volumes to mount inside of the container. Look at the sample container profile in the `config/samples` directory for more information. The `config/examples` directory are deployed with examples profiles on the system, but do not contain the full documentation. The computes resource must also be updated to instruct the workflow on where to place the container pods. The provided compute nodes will be traced back to their local rabbit node, which will be used as the targets for the pods. Containers are created during `PreRun` through the use of Kubernetes Jobs. Each rabbit node will be the target of one kubernetes Job, which will manage the successful completion of the container. `PreRun` will progress to `ready:true` when the pods have started successfully. Each container has volumes mounted inside of it that are defined by the container profile. The mount paths for these volumes are exposed to the container via environment variables that match the storage names provided by the container directive's arguments (e.g. DW_JOB_foo-local-storage). These storages can be considered optional or not. If not, and the storage argument isn't supplied to the directive, the workflow will fail in the `Proposal` state. Once the workflow has progressed to `PostRun`, the workflow will start to check if the pods have finished. Once finished, `PostRun` will progress to `ready:true` if all pods (i.e. k8s jobs) have completed successfully. If not, `PostRun` will remain in ready:false. Example container directive: ``` #DW jobdw name=my-gfs2 type=gfs2 capacity=50GB #DW persistentdw name=my-persistent #DW container name=my-container profile=example-randomly-fail DW_JOB_foo-local-storage=my-gfs2 DW_PERSISTENT_foo-persistent-storage=my-persistent ``` --------- Signed-off-by: Blake Devcich <[email protected]> Signed-off-by: Nate Thornton <[email protected]> Co-authored-by: Nate Thornton <[email protected]> * Make sure example container profiles contain retryLimit Signed-off-by: Blake Devcich <[email protected]> * NNF Port Manager (#163) NNF Port Manager infrastructure and tests --------- Signed-off-by: Nate Thornton <[email protected]> * Nnf ec enhanced logging (#166) * NNF-EC logger * upgrade to nnf-ec master (47eb7a) * expose zap options --------- Signed-off-by: Nate Thornton <[email protected]> * Containers: Add non-root support This uses SecurityContext and inherits the Workflow's user/group ID. Signed-off-by: Blake Devcich <[email protected]> * Containers: Check for XFS/Raw filesystems (#167) These filesystems can only be mounted once - they are not supported for containers. Signed-off-by: Blake Devcich <[email protected]> * Vendor latest nnf-ec to fix namespace attach failures (#170) Signed-off-by: Anthony Floeder <[email protected]> * RABSW-1150: Add ServiceAccount for NNF fencing agent (#171) Create a ServiceAccount for the NNF fencing agent that allows read and write access to Node and NnfNode resources. Signed-off-by: Matt Richerson <[email protected]> * Containers: Fix Error Output A few situations where being reported as errors when they should not be: Job creation loop. Since the job is being reused for each rabbit node and with the possibility of updating the job, make sure the pod selector is empty. Do this by making sure the job structure for creating new jobs is fresh by doing DeepCopy. See more here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-selector Job container volumes: If the NNFAccess mount is not ready, requeue rather than return an error. Job container start: It's possible that while waiting for the job containers to start, the jobs themselves don't exist or aren't queryable yet. Requeue. Signed-off-by: Blake Devcich <[email protected]> * Incorporate latest nnf-ec to fix format issue (#173) Signed-off-by: Anthony Floeder <[email protected]> * Use the live k8s client object in suite_test.go. (#175) In the kubebuilder book: https://book.kubebuilder.io/cronjob-tutorial/writing-tests.html It explains that we should be using the "live" k8s client rather than the one from the manager: "Note that we set up both a “live” k8s client and a separate client from the manager. This is because when making assertions in tests, you generally want to assert against the live state of the API server. If you use the client from the manager (k8sManager.GetClient), you’d end up asserting against the contents of the cache instead, which is slower and can introduce flakiness into your tests." * Upgrade controller-runtime and friends (#174) Upgrade controller-runtime, ginkgo, gomega. Revendor dws and pick up the new API for status updater. Upgrade controller-gen and env-k8s-version. Signed-off-by: Dean Roehrich <[email protected]> * Github #39: Separate NnfAccess mount/unmount code paths (#176) * Github #39: Separate NnfAccess mount/unmount code paths This commit separates out the logic for mounting and unmounting an NnfAccess. This was to provide proper unlocking of the NnfStorage for XFS and raw allocations. Signed-off-by: Matt Richerson <[email protected]> * review comments Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * Added NnfContainerProfile validation webhook + unit tests (#172) - Added validation webhook for container profiles - Fix a bug in the container filesystem check for persistent filesystems - Add unit tests for container directives, most notably the storages in the profile and in the container directive arguements - Add integration test to ensure that targeted compute nodes select the correct local NNF nodes for container workflows Signed-off-by: Blake Devcich <[email protected]> * main.go has too many calls to controllers.NnfPortManagerReconciler (#178) Keep the one for the SLC, remove the one in main(). Signed-off-by: Dean Roehrich <[email protected]> * RABSW-1096: Add Lustre target allocation hints (#179) * RABSW-1096: Add Lustre target allocation hints This commit adds three new fields to the NnfStorageProfile that are used to direct the WLM on how many Lustre targets to create. The three new fields are: - Count: Specify how many Lustre targets to create - Scale: A unitless 1-10 value that the WLM uses with other information to come up with a target count - ColocateComputes: Limit the Lustre targets to the Rabbits in the same chassis as the compute nodes. These NnfStorageProfile fields are used to fill in the DirectiveBreakdown correctly. Signed-off-by: Matt Richerson <[email protected]> * Review comments Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * Add MPI support to containers via mpi-operator (#177) This adds in a new way to create containers using mpi-operator. mpi-operator is now a requirement of nnf-sos in order to run MPI containers. Users can now launch MPI container workflows. This is done via the NnfContainerProfile. Container workflows can be executed in two ways: - MPI (launcher/worker model) - Non-MPI (one command for all containers) The launcher/worker model allows users to run `mpirun` on the launcher pod and then use the workers as nodes `mpirun`. See the mpi-operator docs for more: https://www.kubeflow.org/docs/components/training/mpi/. Major Changes: - Added `MPISpec` to define `MPIJobs` to container profile - Moved original container implementation from `Template` to `Spec` to mimic `MPISpec` name. User now only defines the PodSpec rather than the PodTemplateSpec - Added example-mpi NnfContainerProfile (used for testing) - Added permissions to both MPI and non-MPI containers to run as non root users (i.e. `user` or `mpiuser`). - Reworked PreRun to create either type and watch for successful container start for Ready logic. - Reworked PostRun to watch for completion for either type and determine Ready state if containers completed successfuly. - Added InitContainers to map the `user` or `mpiuser` to the workflow's User and Group ID. This allows ssh to work properly for mpirun. - New functions added to support both MPI and non-MPI container creation logic. - Use server-side deployment to workaround MPIJob's large CRD annotations Signed-off-by: Blake Devcich <[email protected]> * Add support for extra dcp and dryrun options in NnfDataMovementSpec (#180) In order to support per-DM configuration options, we need to add some options to the spec. These values will override/supplement the existing data movement configuration that is defined in the nnf-dm-config ConfigMap. In this case, LLNL has a need to add extra dcp options for a given data movement request. This will be done via the Copy Offload API. For debugging purposes, the dryrun option has also been added to fake out data movement. Signed-off-by: Blake Devcich <[email protected]> * Wait for DWS webhook (#181) Wait for the DWS webhook to be ready when doing a fresh deploy. Update an out of date CRD. Signed-off-by: Dean Roehrich <[email protected]> * RABSW-1159: Update deploy.sh to look at the deployment ready count (#182) The deploy.sh was looking for a "1/1" ready field for the dws webhook. There may not be enough worker nodes on the system to run all 3 DWS webhooks, so some of the webhook pods may not be ready. If one of these pods shows up first in the pod list, then the deploy.sh script will hang forever. Instead, look at the number of ready replicas in the dws webhook deployment to be one or more. Signed-off-by: Matt Richerson <[email protected]> * Use the new "lus" API group for lustre-fs-operator (#183) Use the new "lus" API group for lustre-fs-operator Signed-off-by: Dean Roehrich <[email protected]> * RABSW-1158: Update nnf-ec and add timeout environment variable (#184) * RABSW-1158: Update nnf-ec and add timeout environment variable Make use of the new timeout in nnf-ec when running commands. Timeout commands after 90 seconds and return an error. Signed-off-by: Matt Richerson <[email protected]> * use nnf-ec timeout env variable in seconds Signed-off-by: Matt Richerson <[email protected]> * re-vendor Signed-off-by: Matt Richerson <[email protected]> * go.mod/go.sum merge error Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Matt Richerson <[email protected]> * DM Types: Add options to log/store stdout (#185) Signed-off-by: Blake Devcich <[email protected]> * Github action triggers on master and release branches Signed-off-by: Matt Richerson <[email protected]> --------- Signed-off-by: Nate Thornton <[email protected]> Signed-off-by: Dean Roehrich <[email protected]> Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Matt Richerson <[email protected]> Signed-off-by: Blake Devcich <[email protected]> Signed-off-by: Anthony Floeder <[email protected]> Co-authored-by: Nate Thornton <[email protected]> Co-authored-by: Dean Roehrich <[email protected]> Co-authored-by: Matt Richerson <[email protected]> Co-authored-by: Blake Devcich <[email protected]> Co-authored-by: Blake Devcich <[email protected]> Co-authored-by: Tony Floeder <[email protected]>
- Loading branch information