-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(robot): add node down during migration test cases #2208
base: master
Are you sure you want to change the base?
test(robot): add node down during migration test cases #2208
Conversation
Signed-off-by: Yang Chiu <[email protected]>
WalkthroughThis pull request adds four new negative test cases to the Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
e2e/tests/negative/live_migration.robot (1)
Line range hint
1-106
: Consider refactoring common test patternsThe test cases share similar setup and verification patterns. Consider:
- Creating a custom keyword for the common volume setup sequence
- Enhancing data integrity checks to verify more than just presence
- Adding comprehensive volume state validation keywords
Example keyword extraction:
*** Keywords *** Setup Migration Test Volume [Arguments] ${node_index} Create volume 0 with migratable=True accessMode=RWX dataEngine=${DATA_ENGINE} Attach volume 0 to node ${node_index} Wait for volume 0 healthy Write data to volume 0 Verify Volume State And Data [Arguments] ${expected_node} ${expected_state} Wait for volume 0 ${expected_state} Wait for volume 0 to stay on node ${expected_node} Check volume 0 data is intact
🧹 Nitpick comments (2)
e2e/tests/negative/live_migration.robot (2)
Line range hint
1-40
: Consider adding explicit timeout parameters for wait operationsThe test case effectively validates migration behavior when the migration node fails. However, consider adding explicit timeout parameters to wait operations to handle varying cluster response times.
Example modification:
- Then Check volume 0 kept in attaching + Then Check volume 0 kept in attaching timeout=300
41-59
: Add validation for error messages and eventsWhile the test case verifies the rollback behavior and data integrity, it would be valuable to also validate that appropriate error messages or events are generated during the rollback process.
Consider adding validation steps like:
And Verify error event "Migration failed due to node down" exists And Verify volume condition "Degraded" has reason "NodeDown"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
e2e/tests/negative/live_migration.robot
(1 hunks)
🔇 Additional comments (1)
e2e/tests/negative/live_migration.robot (1)
60-80
: Address potential test flakiness due to K8s pod eviction timing
The test depends on Kubernetes pod eviction controller timing, which could make it flaky. Consider:
- Adding retry mechanisms with appropriate timeouts
- Moving the detailed comments about K8s behavior to test documentation
Let's check if similar timing issues exist in other test files:
Migration Rollback After Original Node Down | ||
Given Create volume 0 with migratable=True accessMode=RWX dataEngine=${DATA_ENGINE} | ||
And Attach volume 0 to node 0 | ||
And Wait for volume 0 healthy | ||
And Write data to volume 0 | ||
|
||
And Attach volume 0 to node 1 | ||
And Wait for volume 0 migration to be ready | ||
|
||
# power off original node | ||
When Power off node 0 | ||
# migration rollback by detaching from the migration node | ||
And Detach volume 0 from node 1 | ||
|
||
# migration is stuck until the Kubernetes pod eviction controller decides to | ||
# terminate the instance-manager pod that was running on the original node. | ||
# then Longhorn detaches the volume and attempts to cleanly reattach it to the original node, | ||
# but it is stuck in attaching until the node comes back. | ||
Then Check volume 0 kept in attaching | ||
|
||
# power on original node | ||
When Power on off nodes | ||
|
||
Then Wait for volume 0 to stay on node 0 | ||
And Wait for volume 0 healthy | ||
And Check volume 0 data is intact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper test isolation and cleanup for node power operations
The test manipulates node power state which could affect other tests. Consider:
- Adding verification steps in the test teardown to ensure nodes are powered back on
- Verifying cluster state is fully restored before the next test
Add these steps to the test teardown:
[Teardown]
Power on off nodes
Wait for all nodes ready timeout=300
Cleanup test resources
# power off original node | ||
When Power off node 0 | ||
# migration confirmation by detaching from the original node | ||
And Detach volume 0 from node 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to detach explicitly from node 0? Should we just wait for the instance-manager to terminate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case is Migration Confirmation After Original Node Down
, so we should detach the original node to Confirm the Migration
.
If no action was taken after the original node went down, it seems to be another test case, called Original Node Down After Migration Ready
.
And I just tested this scenario: after the migration was ready, powered off the original node, and then did nothing. Eventually the volume was clearly detached:
$ kubectl get volumes -n longhorn-system pvc-8bf2ae3a-be4d-4110-b9e0-a9a124db8095 -oyaml -w | grep -i nodeid
migrationNodeID: ""
nodeID: ""
currentMigrationNodeID: ""
currentNodeID: ""
pendingNodeID: ""
Even after powering on the original node again, the volume remains in detached
state permanently.
supportbundle_ef8a6972-0c68-48f9-bd45-1e575e519d95_2024-12-19T00-29-41Z.zip
If we need this test case, it needs @derekbit and @PhanLe1010 to confirm whether this is the expected behavior first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is expected behavior. The volume will remain in detached until CSI flow or user detach the volume from the original node. This is the new design that when volume crash, migration will be stop, volume remain in detach state waiting for the user/csi to make decision about what is the only node Longhorn attach to. The reason behind this is that volume already crash so we don't need live migration anymore. This reduces the risk of unnecessary migration and chaotic.
Ref longhorn/longhorn#8735 (comment)
Manual test case is updated at #1948
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9905 longhorn/longhorn#8817 longhorn/longhorn#8906
What this PR does / why we need it:
add node down during migration test cases
Special notes for your reviewer:
Additional documentation or context
Summary by CodeRabbit