Skip to content

Commit

Permalink
Release v0.0.2 (#186)
Browse files Browse the repository at this point in the history
* Debug build for NVM Format command

Signed-off-by: Nate Thornton <[email protected]>

* Upgrade nnf-ec to latest (e4ba0b)

Signed-off-by: Nate Thornton <[email protected]>

* Upgrade nnf-ec to latest (96d6a3)

Signed-off-by: Nate Thornton <[email protected]>

* Upgrade nnf-ec to latest (83d47b)

Signed-off-by: Nate Thornton <[email protected]>

* Use DWS variable for storage label (#114)

Signed-off-by: Dean Roehrich <[email protected]>

* Use the DWS workflowname vars for label names (#115)

Signed-off-by: Dean Roehrich <[email protected]>

* RABSW-1069: Support for refactored DWS Storage resource and NNF Fencing functionality (#113)

Support for NNF Node Fencing with DWS Storage interaction

Signed-off-by: Nate Thornton <[email protected]>

* Add known controller-manager secret

Signed-off-by: Nate Thornton <[email protected]>

* Disable EC Data Controller for unit tests

Signed-off-by: Nate Thornton <[email protected]>

* Ignore not found resource on undeploy

Signed-off-by: Nate Thornton <[email protected]>

* RABSW-1081: Support multiple MDTs (#117)

* RABSW-1081: Support multiple MDTs

 - Update the DirectiveBreakdown to ask for more than one MDT if necessary
 - Only use a combined MGT/MDT for the first allocation listed in the mgtmdt allocation set.
   All other allocations will only be MDTs.
 - Fix an accounting error in the Servers resource where the allocated capacity was not summed
   across multiple NnfNodeStorages on the same Rabbit.

Signed-off-by: Matt Richerson <[email protected]>
Signed-off-by: Matt Richerson <[email protected]>

* Re-vendor

Signed-off-by: Matt Richerson <[email protected]>

Signed-off-by: Matt Richerson <[email protected]>
Signed-off-by: Matt Richerson <[email protected]>
Co-authored-by: Matt Richerson <[email protected]>

* RABSW-1099: Vendor DWS (#122)

Pick up the changes to the PersistentStorage state fields.

Signed-off-by: Matt Richerson <[email protected]>

Signed-off-by: Matt Richerson <[email protected]>

* Refactor Job/Persistent directive references to use "DW" prefix

Signed-off-by: Nate Thornton <[email protected]>

* RABSW-1097: Pass UserID and GroupID to ClientMount (#129)

* RABSW-1097: Pass UserID and GroupID to ClientMount

Pass the UserID and GroupID from the workflow, through the NnfAccess, and to the
ClientMount. This is used to set the owner/group of Raw devices on the compute
node.

Signed-off-by: Matt Richerson <[email protected]>

* re-vendor

Signed-off-by: Matt Richerson <[email protected]>

Signed-off-by: Matt Richerson <[email protected]>

* RABSW-1122: Don't allow staging to Raw allocations (#131)

Do some more sanity checks on staging directives:
 - Don't allow staging to/from raw allocations
 - Match allocation directives based on name and command "jobdw/persistentdw" since
   names can collide between the two types.

Signed-off-by: Matt Richerson <[email protected]>

Signed-off-by: Matt Richerson <[email protected]>

* RABSW-1097: Use "raw" instead of "lvm" for Raw allocation FsType (#132)

* RABSW-1097: Use "raw" instead of "lvm" for Raw allocation FsType

nnf-ec now understands the "raw" file system type.

* re-vendor

Signed-off-by: Matt Richerson <[email protected]>

Signed-off-by: Matt Richerson <[email protected]>

* Allow builds on all branches. (#134)

Loosen the branch filter for pushes.
Print some event context, for future debugging.
Remove an unused variable from verify_tag.
Rename some jobs to give them unique names, to help with debugging.

Signed-off-by: Dean Roehrich <[email protected]>

* RABSW-1124: Change NnfAccess TeardownState for servers (#136)

* RABSW-1124: Change NnfAccess TeardownState for servers

The data movement code was mounting and unmounting the Rabbit nodes during the DataIn
and DataOut phases of the workflow. A stale workflow resource in the client cache could
cause the NnfAccess to be re-mounted after it had already been unmounted. This commit
changes the NnfAccess Teardown state logic to do the unmounts in PreRun and Teardown
instead of DataIn and DataOut.

Signed-off-by: Matt Richerson <[email protected]>

* review comments

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* Update to latest nnf-ec with support for additional LVM commands (#130)

Signed-off-by: Nate Thornton <[email protected]>

* Fix for 'lockStart' typo

Signed-off-by: Nate Thornton <[email protected]>

* Added PR builds for feature branches (#138)

Signed-off-by: Blake Devcich <[email protected]>

* RABSW-1128: Make fake mounts on kind Rabbit nodes (#139)

* RABSW-1128: Make fake mounts on kind Rabbit nodes

Create empty directories on the Rabbit nodes in the clientmount reconciler
to better fake out data movement and user containers.

Signed-off-by: Matt Richerson <[email protected]>

* review comments

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* Shorten LVM names (#141)

* RABSW-1129: Shorten LVM names

The LV and VG names were too long and caused an error during the lvcreate. This commit
changes the VG name to use a truncated version of the file share ID which includes the
workflow name/namespace, directive index, and allocation index. This string is combined
with the UUID of the workflow. The LV name was changed to be "lv" for all logical volumes
since there is only ever the single LV in each VG.

Signed-off-by: Matt Richerson <[email protected]>

* review comments

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* Ensure the NNF Node resource fencing status is cleared prior updating… (#126)

* Ensure the NNF Node resource fencing status is cleared prior to updating the Storage resource
* Refactor to use DWS Storage Controller

Signed-off-by: Nate Thornton <[email protected]>

* Remove finalizer on the new DWS Storage Controller (#144)

* Remove finalizer on the new DWS Storage Controller

Signed-off-by: Nate Thornton <[email protected]>

* RABSW-1139: Fix ClientMount directory create/remove for kind environment (#145)

* RABSW-1139: Fix ClientMount directory create/remove for kind environment

In the ClientMount controller for kind nodes, check whether the directory exists
before creating or removing it.

Re-vendor dws

Signed-off-by: Matt Richerson <[email protected]>

* MkdirAll() already handles when the directory exists. Don't check before hand with
a Stat() call.

Signed-off-by: Matt Richerson <[email protected]>

* re-vendor

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* Ensure 'key' values in the ruleset match against the exact string

Signed-off-by: Nate Thornton <[email protected]>

* Handle Retryable EC errors for File Share (#133)

* Handle Retryable EC errors

Signed-off-by: Nate Thornton <[email protected]>

* Drive Slot information for drives which are offline (#152)

* Upgrade nnf-ec to latest (ca4975)
* Pull in drive Slot information from storage resource
* update go.sum after running 'go mod tidy'

---------

Signed-off-by: Nate Thornton <[email protected]>

* add --ignore-not-found to uninstall

Signed-off-by: Nate Thornton <[email protected]>

* github-151: Fix LVM issues with gfs2 (#157)

* github-151: Fix LVM issues with gfs2

This commit fixes two issues that were affecting gfs2 file systems:
 - The dlm lock manager was failing to lock because the VG name was too long
 - The lvcreate command needs an "--activate ys" to active a shared volume

Signed-off-by: Matt Richerson <[email protected]>

* use --extents

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* Update nnf-ec to 1dce5b

Signed-off-by: Nate Thornton <[email protected]>

* Add container support to workflows (#159)

Note: This is an experimental feature, but is working end-to-end (i.e. Proposal to Teardown) via the workflow process.

This feature adds experimental support for NNF Containers in workflows. Containers workflows are created by using the `#DW container` directive. An `NnfContainerProfile` must be supplied to the directive to instruct the workflow on what containers to create and which volumes to mount inside of the container. Look at the sample container profile in the `config/samples` directory for more information. The `config/examples` directory are deployed with examples profiles on the system, but do not contain the full documentation.

The computes resource must also be updated to instruct the workflow on where to place the container pods. The provided compute nodes will be traced back to their local rabbit node, which will be used as the targets for the pods.

Containers are created during `PreRun` through the use of Kubernetes Jobs. Each rabbit node will be the target of one kubernetes Job, which will manage the successful completion of the container. `PreRun` will progress to `ready:true` when the pods have started successfully. Each container has volumes mounted inside of it that are defined by the container profile. The mount paths for these volumes are exposed to the container via environment variables that match the storage names provided by the container directive's arguments (e.g. DW_JOB_foo-local-storage). These storages can be considered optional or not. If not, and the storage argument isn't supplied to the directive, the workflow will fail in the `Proposal` state.

Once the workflow has progressed to `PostRun`, the workflow will start to check if the pods have finished. Once finished, `PostRun` will progress to `ready:true` if all pods (i.e. k8s jobs) have completed successfully. If not, `PostRun` will remain in ready:false.

Example container directive:

```
#DW jobdw name=my-gfs2 type=gfs2 capacity=50GB
#DW persistentdw name=my-persistent
#DW container name=my-container profile=example-randomly-fail
       DW_JOB_foo-local-storage=my-gfs2
       DW_PERSISTENT_foo-persistent-storage=my-persistent
```

---------

Signed-off-by: Blake Devcich <[email protected]>
Signed-off-by: Nate Thornton <[email protected]>
Co-authored-by: Nate Thornton <[email protected]>

* Make sure example container profiles contain retryLimit

Signed-off-by: Blake Devcich <[email protected]>

* NNF Port Manager (#163)

NNF Port Manager infrastructure and tests

---------

Signed-off-by: Nate Thornton <[email protected]>

* Nnf ec enhanced logging (#166)

* NNF-EC logger
* upgrade to nnf-ec master (47eb7a)
* expose zap options

---------

Signed-off-by: Nate Thornton <[email protected]>

* Containers: Add non-root support

This uses SecurityContext and inherits the Workflow's user/group ID.

Signed-off-by: Blake Devcich <[email protected]>

* Containers: Check for XFS/Raw filesystems (#167)

These filesystems can only be mounted once - they are not supported for
containers.

Signed-off-by: Blake Devcich <[email protected]>

* Vendor latest nnf-ec to fix namespace attach failures (#170)

Signed-off-by: Anthony Floeder <[email protected]>

* RABSW-1150: Add ServiceAccount for NNF fencing agent (#171)

Create a ServiceAccount for the NNF fencing agent that allows read and write access
to Node and NnfNode resources.

Signed-off-by: Matt Richerson <[email protected]>

* Containers: Fix Error Output

A few situations where being reported as errors when they should not be:

Job creation loop. Since the job is being reused for each rabbit node
and with the possibility of updating the job, make sure the pod selector
is empty. Do this by making sure the job structure for creating new jobs
is fresh by doing DeepCopy. See more here:
https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-selector

Job container volumes: If the NNFAccess mount is not ready, requeue
rather than return an error.

Job container start: It's possible that while waiting for the job
containers to start, the jobs themselves don't exist or aren't queryable
yet. Requeue.

Signed-off-by: Blake Devcich <[email protected]>

* Incorporate latest nnf-ec to fix format issue (#173)

Signed-off-by: Anthony Floeder <[email protected]>

* Use the live k8s client object in suite_test.go. (#175)

In the kubebuilder book:
https://book.kubebuilder.io/cronjob-tutorial/writing-tests.html
It explains that we should be using the "live" k8s client rather than the one
from the manager:

"Note that we set up both a “live” k8s client and a separate client from the
manager. This is because when making assertions in tests, you generally want to
assert against the live state of the API server. If you use the client from the
manager (k8sManager.GetClient), you’d end up asserting against the contents of
the cache instead, which is slower and can introduce flakiness into your
tests."

* Upgrade controller-runtime and friends (#174)

Upgrade controller-runtime, ginkgo, gomega.  Revendor dws and pick up the
new API for status updater.

Upgrade controller-gen and env-k8s-version.

Signed-off-by: Dean Roehrich <[email protected]>

* Github #39: Separate NnfAccess mount/unmount code paths (#176)

* Github #39: Separate NnfAccess mount/unmount code paths

This commit separates out the logic for mounting and unmounting an NnfAccess. This was to
provide proper unlocking of the NnfStorage for XFS and raw allocations.

Signed-off-by: Matt Richerson <[email protected]>

* review comments

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* Added NnfContainerProfile validation webhook + unit tests (#172)

- Added validation webhook for container profiles
- Fix a bug in the container filesystem check for persistent filesystems
- Add unit tests for container directives, most notably the storages in
  the profile and in the container directive arguements
- Add integration test to ensure that targeted compute nodes select the
  correct local NNF nodes for container workflows

Signed-off-by: Blake Devcich <[email protected]>

* main.go has too many calls to controllers.NnfPortManagerReconciler (#178)

Keep the one for the SLC, remove the one in main().

Signed-off-by: Dean Roehrich <[email protected]>

* RABSW-1096: Add Lustre target allocation hints (#179)

* RABSW-1096: Add Lustre target allocation hints

This commit adds three new fields to the NnfStorageProfile that are used to direct the WLM
on how many Lustre targets to create. The three new fields are:
- Count: Specify how many Lustre targets to create
- Scale: A unitless 1-10 value that the WLM uses with other information to come up with a target count
- ColocateComputes: Limit the Lustre targets to the Rabbits in the same chassis as the compute nodes.

These NnfStorageProfile fields are used to fill in the DirectiveBreakdown correctly.

Signed-off-by: Matt Richerson <[email protected]>

* Review comments

Signed-off-by: Matt Richerson <[email protected]>

* re-vendor

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* Add MPI support to containers via mpi-operator (#177)

This adds in a new way to create containers using mpi-operator.
mpi-operator is now a requirement of nnf-sos in order to run MPI
containers.

Users can now launch MPI container workflows. This is done via the
NnfContainerProfile. Container workflows can be executed in two ways:
  - MPI (launcher/worker model)
  - Non-MPI (one command for all containers)

The launcher/worker model allows users to run `mpirun` on the launcher
pod and then use the workers as nodes `mpirun`. See the mpi-operator
docs for more: https://www.kubeflow.org/docs/components/training/mpi/.

Major Changes:

- Added `MPISpec` to define `MPIJobs` to container profile
- Moved original container implementation from `Template` to `Spec` to
  mimic `MPISpec` name. User now only defines the PodSpec rather than
  the PodTemplateSpec
- Added example-mpi NnfContainerProfile (used for testing)
- Added permissions to both MPI and non-MPI containers to run as non
  root users (i.e. `user` or `mpiuser`).
- Reworked PreRun to create either type and watch for successful
  container start for Ready logic.
- Reworked PostRun to watch for completion for either type and determine
  Ready state if containers completed successfuly.
- Added InitContainers to map the `user` or `mpiuser` to the workflow's
  User and Group ID. This allows ssh to work properly for mpirun.
- New functions added to support both MPI and non-MPI container creation
  logic.
- Use server-side deployment to workaround MPIJob's large CRD annotations

Signed-off-by: Blake Devcich <[email protected]>

* Add support for extra dcp and dryrun options in NnfDataMovementSpec (#180)

In order to support per-DM configuration options, we need to add some
options to the spec. These values will override/supplement the existing
data movement configuration that is defined in the nnf-dm-config
ConfigMap.

In this case, LLNL has a need to add extra dcp options for a given data
movement request. This will be done via the Copy Offload API.

For debugging purposes, the dryrun option has also been added to fake
out data movement.

Signed-off-by: Blake Devcich <[email protected]>

* Wait for DWS webhook (#181)

Wait for the DWS webhook to be ready when doing a fresh deploy.

Update an out of date CRD.

Signed-off-by: Dean Roehrich <[email protected]>

* RABSW-1159: Update deploy.sh to look at the deployment ready count (#182)

The deploy.sh was looking for a "1/1" ready field for the dws webhook. There may not
be enough worker nodes on the system to run all 3 DWS webhooks, so some of the webhook
pods may not be ready. If one of these pods shows up first in the pod list, then the
deploy.sh script will hang forever. Instead, look at the number of ready replicas in the
dws webhook deployment to be one or more.

Signed-off-by: Matt Richerson <[email protected]>

* Use the new "lus" API group for lustre-fs-operator (#183)

Use the new "lus" API group for lustre-fs-operator

Signed-off-by: Dean Roehrich <[email protected]>

* RABSW-1158: Update nnf-ec and add timeout environment variable (#184)

* RABSW-1158: Update nnf-ec and add timeout environment variable

Make use of the new timeout in nnf-ec when running commands. Timeout commands
after 90 seconds and return an error.

Signed-off-by: Matt Richerson <[email protected]>

* use nnf-ec timeout env variable in seconds

Signed-off-by: Matt Richerson <[email protected]>

* re-vendor

Signed-off-by: Matt Richerson <[email protected]>

* go.mod/go.sum merge error

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Matt Richerson <[email protected]>

* DM Types: Add options to log/store stdout (#185)

Signed-off-by: Blake Devcich <[email protected]>

* Github action triggers on master and release branches

Signed-off-by: Matt Richerson <[email protected]>

---------

Signed-off-by: Nate Thornton <[email protected]>
Signed-off-by: Dean Roehrich <[email protected]>
Signed-off-by: Matt Richerson <[email protected]>
Signed-off-by: Matt Richerson <[email protected]>
Signed-off-by: Blake Devcich <[email protected]>
Signed-off-by: Anthony Floeder <[email protected]>
Co-authored-by: Nate Thornton <[email protected]>
Co-authored-by: Dean Roehrich <[email protected]>
Co-authored-by: Matt Richerson <[email protected]>
Co-authored-by: Blake Devcich <[email protected]>
Co-authored-by: Blake Devcich <[email protected]>
Co-authored-by: Tony Floeder <[email protected]>
  • Loading branch information
7 people authored May 1, 2023
1 parent 3a40435 commit f2d5fcc
Show file tree
Hide file tree
Showing 1,276 changed files with 131,920 additions and 42,275 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ on:
branches:
- 'master'
- 'releases/v*'
- 'feature/*'

env:
# TEST_TARGET: Name of the testing target in the Dockerfile
Expand All @@ -27,6 +28,11 @@ jobs:
runs-on: ubuntu-latest

steps:
- name: "Build context"
run: |
echo "ref is ${{ github.ref }}"
echo "ref_type is ${{ github.ref_type }}"
- name: "Checkout repository"
id: checkout_repo
uses: actions/checkout@v3
Expand Down
10 changes: 6 additions & 4 deletions .github/workflows/verify_tag.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@ on:
tags:
- "v*"

env:
IMAGE_NAME: ${{ github.repository }}

jobs:
build:
verify_tag:
runs-on: ubuntu-latest
steps:
- name: "Verify context"
run: |
echo "ref is ${{ github.ref }}"
echo "ref_type is ${{ github.ref_type }}"
- uses: actions/checkout@v3
# actions/checkout@v3 breaks annotated tags by converting them into
# lightweight tags, so we need to force fetch the tag again
Expand Down
4 changes: 3 additions & 1 deletion .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@
"-ginkgo.progress"
],
"env": {
"KUBEBUILDER_ASSETS": "${workspaceFolder}/testbin/bin"
"KUBEBUILDER_ASSETS": "${workspaceFolder}/bin/k8s/1.25.0-darwin-amd64",
"GOMEGA_DEFAULT_EVENTUALLY_TIMEOUT": "10m",
"GOMEGA_DEFAULT_EVENTUALLY_POLLING_INTERVAL": "100ms"
},
"showLog": true
},
Expand Down
8 changes: 4 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ IMAGE_TAG_BASE ?= ghcr.io/nearnodeflash/nnf-sos
# You can use it as an arg. (E.g make bundle-build BUNDLE_IMG=<some-registry>/<project-name-bundle>:<tag>)

# ENVTEST_K8S_VERSION refers to the version of kubebuilder assets to be downloaded by envtest binary.
ENVTEST_K8S_VERSION = 1.25.0
ENVTEST_K8S_VERSION = 1.26.0

# Jenkins behaviors
# pipeline_service builds its target docker image and stores it into 1 of 3 destination folders.
Expand Down Expand Up @@ -223,7 +223,7 @@ test: manifests generate fmt vet envtest ## Run tests.
export GOMEGA_DEFAULT_EVENTUALLY_INTERVAL=${EVENTUALLY_INTERVAL}; \
export WEBHOOK_DIR=${ENVTEST_ASSETS_DIR}/webhook; \
for subdir in ${TESTDIRS}; do \
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) -p path --bin-dir $(LOCALBIN))" go test -v ./$$subdir/... -coverprofile cover.out -ginkgo.v -ginkgo.progress $$failfast; \
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) -p path --bin-dir $(LOCALBIN))" go test -v ./$$subdir/... -coverprofile cover.out -ginkgo.v $$failfast; \
done

##@ Build
Expand Down Expand Up @@ -254,7 +254,7 @@ install: manifests kustomize ## Install CRDs into the K8s cluster specified in ~
$(KUSTOMIZE) build config/crd | kubectl apply -f -

uninstall: manifests kustomize ## Uninstall CRDs from the K8s cluster specified in ~/.kube/config.
$(KUSTOMIZE) build config/crd | kubectl delete -f -
$(KUSTOMIZE) build config/crd | kubectl delete --ignore-not-found -f -

deploy: VERSION ?= $(shell cat .version)
deploy: .version kustomize ## Deploy controller to the K8s cluster specified in ~/.kube/config.
Expand Down Expand Up @@ -285,7 +285,7 @@ ENVTEST ?= $(LOCALBIN)/setup-envtest

## Tool Versions
KUSTOMIZE_VERSION ?= v4.5.7
CONTROLLER_TOOLS_VERSION ?= v0.9.2
CONTROLLER_TOOLS_VERSION ?= v0.11.1

KUSTOMIZE_INSTALL_SCRIPT ?= "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"
.PHONY: kustomize
Expand Down
20 changes: 20 additions & 0 deletions PROJECT
Original file line number Diff line number Diff line change
Expand Up @@ -105,4 +105,24 @@ resources:
kind: NnfNodeECData
path: github.com/NearNodeFlash/nnf-sos/api/v1alpha1
version: v1alpha1
- api:
crdVersion: v1
namespaced: true
domain: cray.hpe.com
group: nnf
kind: NnfContainerProfile
path: github.com/NearNodeFlash/nnf-sos/api/v1alpha1
version: v1alpha1
webhooks:
validation: true
webhookVersion: v1
- api:
crdVersion: v1
namespaced: true
controller: true
domain: cray.hpe.com
group: nnf
kind: NnfPortManager
path: github.com/NearNodeFlash/nnf-sos/api/v1alpha1
version: v1alpha1
version: "3"
8 changes: 7 additions & 1 deletion api/v1alpha1/nnf_access_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ type NnfAccessSpec struct {

// TeardownState is the desired state of the workflow for this NNF Access resource to
// be torn down and deleted.
// +kubebuilder:validation:Enum:=DataIn;PreRun;PostRun;DataOut
// +kubebuilder:validation:Enum:=PreRun;PostRun;Teardown
// +kubebuilder:validation:Type:=string
TeardownState dwsv1alpha1.WorkflowState `json:"teardownState"`

Expand All @@ -45,6 +45,12 @@ type NnfAccessSpec struct {
// +kubebuilder:validation:Enum=single;all
Target string `json:"target"`

// UserID for the new mount. Currently only used for raw
UserID uint32 `json:"userID"`

// GroupID for the new mount. Currently only used for raw
GroupID uint32 `json:"groupID"`

// ClientReference is for a client resource. (DWS) Computes is the only client
// resource type currently supported
ClientReference corev1.ObjectReference `json:"clientReference,omitempty"`
Expand Down
35 changes: 33 additions & 2 deletions api/v1alpha1/nnf_datamovement_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,11 @@ type NnfDataMovementSpec struct {
// Set to true if the data movement operation should be canceled.
// +kubebuilder:default:=false
Cancel bool `json:"cancel,omitempty"`

// User defined configuration on how data movement should be performed. This overrides the
// configuration defined in the nnf-dm-config ConfigMap. These values are typically set by the
// Copy Offload API.
UserConfig *NnfDataMovementConfig `json:"userConfig,omitempty"`
}

// DataMovementSpecSourceDestination defines the desired source or destination of data movement
Expand All @@ -72,6 +77,31 @@ type NnfDataMovementSpecSourceDestination struct {
StorageReference corev1.ObjectReference `json:"storageReference,omitempty"`
}

// NnfDataMovementConfig provides a way for a user to override the data movement behavior on a
// per DM basis.
type NnfDataMovementConfig struct {

// Fake the Data Movement operation. The system "performs" Data Movement but the command to do so
// is trivial. This means a Data Movement request is still submitted but the IO is skipped.
// +kubebuilder:default:=false
Dryrun bool `json:"dryrun,omitempty"`

// Extra options to pass to the dcp command (used to perform data movement).
DCPOptions string `json:"dcpOptions,omitempty"`

// If true, enable the command's stdout to be saved in the log when the command completes
// successfully. On failure, the output is always logged.
// Note: Enabling this option may degrade performance.
// +kubebuilder:default:=false
LogStdout bool `json:"logStdout,omitempty"`

// Similar to LogStdout, store the command's stdout in Status.Message when the command completes
// successfully. On failure, the output is always stored.
// Note: Enabling this option may degrade performance.
// +kubebuilder:default:=false
StoreStdout bool `json:"storeStdout,omitempty"`
}

// DataMovementCommandStatus defines the observed status of the underlying data movement
// command (MPI File Utils' `dcp` command).
type NnfDataMovementCommandStatus struct {
Expand All @@ -89,7 +119,7 @@ type NnfDataMovementCommandStatus struct {

// LastMessage reflects the last message received over standard output or standard error as
// captured by the underlying data movement command.
LastMessage string `json:"message,omitempty"`
LastMessage string `json:"lastMessage,omitempty"`

// LastMessageTime reflects the time at which the last message was received over standard output or
// standard error by the underlying data movement command.
Expand All @@ -106,7 +136,8 @@ type NnfDataMovementStatus struct {
// +kubebuilder:validation:Enum=Success;Failed;Invalid;Cancelled
Status string `json:"status,omitempty"`

// Message contains any text that explains the Status.
// Message contains any text that explains the Status. If Data Movement failed or storeStdout is
// enabled, this will contain the command's output.
Message string `json:"message,omitempty"`

// StartTime reflects the time at which the Data Movement operation started.
Expand Down
3 changes: 3 additions & 0 deletions api/v1alpha1/nnf_node_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ type NnfNodeStatus struct {

Health NnfResourceHealthType `json:"health,omitempty"`

// Fenced is true when the NNF Node is fenced by the STONITH agent, and false otherwise.
Fenced bool `json:"fenced,omitempty"`

Capacity int64 `json:"capacity,omitempty"`
CapacityAllocated int64 `json:"capacityAllocated,omitempty"`

Expand Down
136 changes: 136 additions & 0 deletions api/v1alpha1/nnf_port_manager_types.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
/*
* Copyright 2023 Hewlett Packard Enterprise Development LP
* Other additional copyright holders may be indicated within.
*
* The entirety of this work is licensed under the Apache License,
* Version 2.0 (the "License"); you may not use this file except
* in compliance with the License.
*
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package v1alpha1

import (
"github.com/HewlettPackard/dws/utils/updater"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// EDIT THIS FILE! THIS IS SCAFFOLDING FOR YOU TO OWN!
// NOTE: json tags are required. Any new fields you add must have json tags for the fields to be serialized.

// NnfPortManagerAllocationSpec defines the desired state for a single port allocation
type NnfPortManagerAllocationSpec struct {
// Requester is an object reference to the requester of a ports.
Requester corev1.ObjectReference `json:"requester"`

// Count is the number of desired ports the requester needs. The port manager
// will attempt to allocate this many ports.
// +kubebuilder:default:=1
Count int `json:"count"`
}

// NnfPortManagerSpec defines the desired state of NnfPortManager
type NnfPortManagerSpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
// Important: Run "make" to regenerate code after modifying this file

// SystemConfiguration is an object reference to the system configuration. The
// Port Manager will use the available ports defined in the system configuration.
SystemConfiguration corev1.ObjectReference `json:"systemConfiguration"`

// Allocations is a list of allocation requests that the Port Manager will attempt
// to satisfy. To request port resources from the port manager, clients should add
// an entry to the allocations. Entries must be unique. The port manager controller
// will attempt to allocate port resources for each allocation specification in the
// list. To remove an allocation and free up port resources, remove the allocation
// from the list.
Allocations []NnfPortManagerAllocationSpec `json:"allocations"`
}

// AllocationStatus is the current status of a port requestor. A port that is in use by the respective owner
// will have a status of "InUse". A port that is freed by the owner but not yet reclaimed by the port manager
// will have a status of "Free". Any other status value indicates a failure of the port allocation.
// +kubebuilder:validation:Enum:=InUse;Free;InvalidConfiguration;InsufficientResources
type NnfPortManagerAllocationStatusStatus string

const (
NnfPortManagerAllocationStatusInUse NnfPortManagerAllocationStatusStatus = "InUse"
NnfPortManagerAllocationStatusFree NnfPortManagerAllocationStatusStatus = "Free"
NnfPortManagerAllocationStatusInvalidConfiguration NnfPortManagerAllocationStatusStatus = "InvalidConfiguration"
NnfPortManagerAllocationStatusInsufficientResources NnfPortManagerAllocationStatusStatus = "InsufficientResources"
// NOTE: You must ensure any new value is added to the above kubebuilder validation enum
)

// NnfPortManagerAllocationStatus defines the allocation status of a port for a given requester.
type NnfPortManagerAllocationStatus struct {
// Requester is an object reference to the requester of the port resource, if one exists, or
// empty otherwise.
Requester *corev1.ObjectReference `json:"requester,omitempty"`

// Ports is list of ports allocated to the owning resource.
Ports []uint16 `json:"ports,omitempty"`

// Status is the ownership status of the port.
Status NnfPortManagerAllocationStatusStatus `json:"status"`
}

// PortManagerStatus is the current status of the port manager.
// +kubebuilder:validation:Enum:=Ready;SystemConfigurationNotFound
type NnfPortManagerStatusStatus string

const (
NnfPortManagerStatusReady NnfPortManagerStatusStatus = "Ready"
NnfPortManagerStatusSystemConfigurationNotFound NnfPortManagerStatusStatus = "SystemConfigurationNotFound"
// NOTE: You must ensure any new value is added in the above kubebuilder validation enum
)

// NnfPortManagerStatus defines the observed state of NnfPortManager
type NnfPortManagerStatus struct {
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
// Important: Run "make" to regenerate code after modifying this file

// Allocations is a list of port allocation status'.
Allocations []NnfPortManagerAllocationStatus `json:"allocations,omitempty"`

// Status is the current status of the port manager.
Status NnfPortManagerStatusStatus `json:"status"`
}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status

// NnfPortManager is the Schema for the nnfportmanagers API
type NnfPortManager struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

Spec NnfPortManagerSpec `json:"spec,omitempty"`
Status NnfPortManagerStatus `json:"status,omitempty"`
}

func (mgr *NnfPortManager) GetStatus() updater.Status[*NnfPortManagerStatus] {
return &mgr.Status
}

//+kubebuilder:object:root=true

// NnfPortManagerList contains a list of NnfPortManager
type NnfPortManagerList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []NnfPortManager `json:"items"`
}

func init() {
SchemeBuilder.Register(&NnfPortManager{}, &NnfPortManagerList{})
}
21 changes: 21 additions & 0 deletions api/v1alpha1/nnf_resource_status_type.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
package v1alpha1

import (
dwsv1alpha1 "github.com/HewlettPackard/dws/api/v1alpha1"

sf "github.com/NearNodeFlash/nnf-ec/pkg/rfsf/pkg/models"
)

Expand Down Expand Up @@ -93,6 +95,25 @@ func (rst NnfResourceStatusType) UpdateIfWorseThan(status *NnfResourceStatusType
}
}

func (rst NnfResourceStatusType) ConvertToDWSResourceStatus() dwsv1alpha1.ResourceStatus {
switch rst {
case ResourceStarting:
return dwsv1alpha1.StartingStatus
case ResourceReady:
return dwsv1alpha1.ReadyStatus
case ResourceDisabled:
return dwsv1alpha1.DisabledStatus
case ResourceNotPresent:
return dwsv1alpha1.NotPresentStatus
case ResourceOffline:
return dwsv1alpha1.OfflineStatus
case ResourceFailed:
return dwsv1alpha1.FailedStatus
default:
return dwsv1alpha1.UnknownStatus
}
}

// StaticResourceStatus will convert a Swordfish ResourceStatus to the NNF Resource Status.
func StaticResourceStatus(s sf.ResourceStatus) NnfResourceStatusType {
switch s.State {
Expand Down
Loading

0 comments on commit f2d5fcc

Please sign in to comment.