Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): add node down during migration test cases #2208

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions e2e/tests/negative/live_migration.robot
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,70 @@ Migration Confirmation After Migration Node Down
Then Wait for volume 0 to migrate to node 1
And Wait for volume 0 healthy
And Check volume 0 data is intact

Migration Rollback After Migration Node Down
Given Create volume 0 with migratable=True accessMode=RWX dataEngine=${DATA_ENGINE}
And Attach volume 0 to node 0
And Wait for volume 0 healthy
And Write data to volume 0

And Attach volume 0 to node 1
And Wait for volume 0 migration to be ready

# power off migration node
When Power off node 1
# migration rollback by detaching from the migration node
And Detach volume 0 from node 1

# migration rollback succeed
Then Wait for volume 0 to stay on node 0
And Wait for volume 0 degraded
And Check volume 0 data is intact

Migration Confirmation After Original Node Down
Given Create volume 0 with migratable=True accessMode=RWX dataEngine=${DATA_ENGINE}
And Attach volume 0 to node 0
And Wait for volume 0 healthy
And Write data to volume 0

And Attach volume 0 to node 1
And Wait for volume 0 migration to be ready

# power off original node
When Power off node 0
# migration confirmation by detaching from the original node
And Detach volume 0 from node 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to detach explicitly from node 0? Should we just wait for the instance-manager to terminate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case is Migration Confirmation After Original Node Down, so we should detach the original node to Confirm the Migration.

If no action was taken after the original node went down, it seems to be another test case, called Original Node Down After Migration Ready.
And I just tested this scenario: after the migration was ready, powered off the original node, and then did nothing. Eventually the volume was clearly detached:

$ kubectl get volumes -n longhorn-system pvc-8bf2ae3a-be4d-4110-b9e0-a9a124db8095 -oyaml -w | grep -i nodeid
  migrationNodeID: ""
  nodeID: ""
  currentMigrationNodeID: ""
  currentNodeID: ""
  pendingNodeID: ""

Even after powering on the original node again, the volume remains in detached state permanently.

supportbundle_ef8a6972-0c68-48f9-bd45-1e575e519d95_2024-12-19T00-29-41Z.zip

If we need this test case, it needs @derekbit and @PhanLe1010 to confirm whether this is the expected behavior first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is expected behavior. The volume will remain in detached until CSI flow or user detach the volume from the original node. This is the new design that when volume crash, migration will be stop, volume remain in detach state waiting for the user/csi to make decision about what is the only node Longhorn attach to. The reason behind this is that volume already crash so we don't need live migration anymore. This reduces the risk of unnecessary migration and chaotic.

Ref longhorn/longhorn#8735 (comment)
Manual test case is updated at #1948


# migration is stuck until the Kubernetes pod eviction controller decides to
# terminate the instance-manager pod that was running on the original node.
# then Longhorn detaches the volume and cleanly reattaches it to the migration node.
Then Wait for volume 0 to migrate to node 1
And Wait for volume 0 degraded
And Check volume 0 data is intact

Migration Rollback After Original Node Down
Given Create volume 0 with migratable=True accessMode=RWX dataEngine=${DATA_ENGINE}
And Attach volume 0 to node 0
And Wait for volume 0 healthy
And Write data to volume 0

And Attach volume 0 to node 1
And Wait for volume 0 migration to be ready

# power off original node
When Power off node 0
# migration rollback by detaching from the migration node
And Detach volume 0 from node 1

# migration is stuck until the Kubernetes pod eviction controller decides to
# terminate the instance-manager pod that was running on the original node.
# then Longhorn detaches the volume and attempts to cleanly reattach it to the original node,
# but it is stuck in attaching until the node comes back.
Then Check volume 0 kept in attaching

# power on original node
When Power on off nodes
khushboo-rancher marked this conversation as resolved.
Show resolved Hide resolved

Then Wait for volume 0 to stay on node 0
And Wait for volume 0 healthy
And Check volume 0 data is intact
Comment on lines +81 to +106
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure proper test isolation and cleanup for node power operations

The test manipulates node power state which could affect other tests. Consider:

  1. Adding verification steps in the test teardown to ensure nodes are powered back on
  2. Verifying cluster state is fully restored before the next test

Add these steps to the test teardown:

[Teardown]
    Power on off nodes
    Wait for all nodes ready    timeout=300
    Cleanup test resources

Loading