Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

divide by zero in nnf-ec/pkg/manager-nvme/manager.go #162

Closed
roehrich-hpe opened this issue Jun 7, 2024 · 4 comments
Closed

divide by zero in nnf-ec/pkg/manager-nvme/manager.go #162

roehrich-hpe opened this issue Jun 7, 2024 · 4 comments
Assignees

Comments

@roehrich-hpe
Copy link
Contributor

roehrich-hpe commented Jun 7, 2024

Using nnf-deploy-v0.1.2

$ kubectl logs -n nnf-system nnf-node-manager-lr5qn
[...]
2024-06-07T07:07:13.896-0700	INFO	Observed a panic in reconciler: runtime error: integer divide by zero	{"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "NnfNode": {"name":"nnf-nlc","namespace":"elcap886"}, "namespace": "elcap886", "name": "nnf-nlc", "reconcileID": "3421a141-edb3-48b2-987d-d2b9caaef995"}
panic: runtime error: integer divide by zero [recovered]
	panic: runtime error: integer divide by zero

goroutine 606 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa
panic({0x197a480, 0x2c78ec0})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Manager).StorageIdStoragePoolsStoragePoolIdGet(0xc0049f7650?, {0x1c8407b, 0x2}, {0x1bdc14e, 0x1}, 0xc0049d6ba8)
	/workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:1140 +0x278
github.com/NearNodeFlash/nnf-sos/internal/controller.updateDrives(0xc00429d880, {{0x1ece1a8?, 0xc00498bf50?}, 0xc0004bcb40?})
	/workspace/internal/controller/nnf_node_controller.go:482 +0x925
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeReconciler).Reconcile(0xc0001e5a40, {0x1ecafc8, 0xc00498bf20}, {{{0xc00317e410, 0x8}, {0xc00317e406, 0x7}}})
	/workspace/internal/controller/nnf_node_controller.go:294 +0x837
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1ece1a8?, {0x1ecafc8?, 0xc00498bf20?}, {{{0xc00317e410?, 0xb?}, {0xc00317e406?, 0x0?}}})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000397860, {0x1ecaf20, 0xc0001df840}, {0x1a147a0?, 0xc0006d2520?})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3f9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000397860, {0x1ecaf20, 0xc0001df840})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x333
@roehrich-hpe
Copy link
Contributor Author

roehrich-hpe commented Jun 7, 2024

The NnfNode resource:

apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfNode
metadata:
  creationTimestamp: "2024-06-05T18:28:28Z"
  generation: 1
  name: nnf-nlc
  namespace: elcapX
  resourceVersion: "129973439"
  uid: b14de82a-0884-43f9-b92c-f3237091d873
spec:
  name: elcapX
  pod: nnf-node-manager-lr5qn
  state: Enable
status:
  capacity: 17283450691584
  drives:
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "0"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A11N0U61
    slot: "8"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "1"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A09R0U61
    slot: "7"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "2"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A1970U61
    slot: "15"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "3"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18R0U61
    slot: "16"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "4"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A0D00U61
    slot: "17"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "5"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18S0U61
    slot: "18"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "6"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D60A03X0U61
    slot: "14"
    status: Ready
  - health: Critical
    id: "7"
    slot: "13"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "8"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A0GD0U61
    slot: "12"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "9"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18H0U61
    slot: "4"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: "\0\0\0\0\0\0\0\0"
    health: Critical
    id: "10"
    model: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    serialNumber: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    slot: "5"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "11"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A11H0U61
    slot: "6"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "12"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D60A03H0U61
    slot: "2"
    status: Offline
  - health: Critical
    id: "13"
    slot: "1"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "14"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A11J0U61
    slot: "9"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: "\0\0\0\0\0\0\0\0"
    health: OK
    id: "15"
    model: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    serialNumber: "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
    slot: "10"
    status: Ready
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: Critical
    id: "16"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D60A04U0U61
    slot: "11"
    status: Offline
  - capacity: 1920383410176
    firmwareVersion: 1TCRS104
    health: OK
    id: "17"
    model: KIOXIA KCM7DRJE1T92
    serialNumber: 3D50A18G0U61
    slot: "3"
    status: Ready
  health: OK
  lnetNid: 183802@kfi4
  servers:
  - health: OK
    hostname: elcapX
    id: "0"
    name: Rabbit
    status: Ready
  - [...]
  status: Ready

@roehrich-hpe
Copy link
Contributor Author

Earlier in the log:

2024-06-07T07:12:22.088-0700    INFO    ec.nvme.16      Initialize storage devic
e       {"storageId": "16", "slot": 11}
2024-06-07T07:12:22.092-0700    ERROR   ec.nvme Failed to initialize storage device     {"slot": 11, "switchId": "1", "portId": "17", "error": "Initialize Storage 16: Failed to indentify common controller: Error: Device 0x1500@/dev/switchtec0: Failed NVMe Command: OpCode: Identify (0x06): Error: NVMe Status: UNKNOWN (0x001) CRD: 0 More: false DNR: true"}
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Storage).LinkEstablishedEventHandler
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:908
github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme.(*Manager).EventHandler
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-nvme/manager.go:884
github.com/NearNodeFlash/nnf-ec/pkg/manager-event.(*manager).Publish
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-event/manager.go:176
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*Switch).refreshPortStatus
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:508
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.Start
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/manager.go:1109
github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric.(*DefaultApiRouter).Start
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/manager-fabric/router.go:57
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).initialize
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:171
github.com/NearNodeFlash/nnf-ec/pkg/ec.(*Controller).Init
        /workspace/vendor/github.com/NearNodeFlash/nnf-ec/pkg/ec/ec.go:320
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeECDataReconciler).Start
        /workspace/internal/controller/nnf_node_ec_data_controller.go:112
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223

@roehrich-hpe
Copy link
Contributor Author

The entire log:
nnf-node-manager-lr5qn.log

@roehrich-hpe roehrich-hpe self-assigned this Jun 7, 2024
@roehrich-hpe
Copy link
Contributor Author

@github-project-automation github-project-automation bot moved this from 📋 Open to ✅ Closed in Issues Dashboard Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

No branches or pull requests

1 participant