test(robot): v2 volume should cleanup resources when instance manager is deleted #2199

c3y1huang · 2024-12-17T04:17:37Z

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9959, longhorn/longhorn#9989

What this PR does / why we need it:

Introduce a new robot test case.

Special notes for your reviewer:

None

Additional documentation or context

None

Summary by CodeRabbit

New Features
- Added keywords for uncordoning nodes and validating device existence in volume management.
- Introduced a new test case for verifying resource cleanup when an instance manager is deleted.
Bug Fixes
- Corrected variable naming inconsistencies in test scripts.
Documentation
- Enhanced documentation for test cases related to node draining.
Style
- Minor formatting adjustments for improved readability in code and logging statements.

coderabbitai · 2024-12-17T04:17:44Z

Walkthrough

The pull request introduces enhancements to Longhorn's end-to-end testing framework, focusing on node management, device handling, and test coverage. New keywords and methods have been added to facilitate better testing of volume and node operations, particularly around device management and instance manager interactions. The changes include methods for listing devices, retrieving instance managers, and adding new test scenarios to validate resource cleanup and node management processes.

Changes

File	Change Summary
`e2e/keywords/k8s.resource`	Added `Uncordon node ${node_id}` keyword, corrected `instance_manager` spelling
`e2e/keywords/longhorn.resource`	Simplified instance manager deletion keywords
`e2e/keywords/volume.resource`	Added keywords to assert device existence for volumes
`e2e/libs/node/node.py`	Added methods `list_dm_devices` and `list_volume_devices`
`e2e/libs/keywords/node_keywords.py`	Added methods to list device types on nodes
`e2e/libs/keywords/volume_keywords.py`	Added `get_volume_instance_manager` method
`e2e/libs/node_exec/node_exec.py`	Updated logging and added node toleration
`e2e/tests/regression/test_v2.robot`	Added test case for volume resource cleanup

Assessment against linked issues

Objective	Addressed	Explanation
Clean up orphan devices when IM pod crashes [#9959]	✅
Verify device cleanup for v2 volumes	✅

Possibly related PRs

test(robot): v2 volume should block trim when volume is degraded #2114: The main PR introduces a new keyword for uncordoning nodes and corrects a naming inconsistency, while this PR focuses on trimming volumes based on conditions, indicating a potential overlap in node management functionalities.
test(robot): migrate test_soft_anti_affinity_scheduling #2158: The main PR's changes to node management keywords may relate to this PR's introduction of a keyword for cordoning nodes, suggesting a connection in managing node states.
test(robot): volume should not reattach after node eviction #2164: The main PR's focus on node management aligns with this PR's addition of a keyword for adding deleted nodes back, indicating a shared context in node operations.
test(robot): add block disk back to previously deleted node for v2 test cases #2190: The main PR's enhancements to node management keywords could relate to this PR's focus on adding a block disk back to a previously deleted node, indicating a connection in node and disk management.
test(robot): fix pull backup created by another longhorn system test case for v2 volumes #2198: The main PR's updates to the environment setup for v2 volumes may connect with this PR's focus on fixing the pull backup test case for v2 volumes, indicating a shared context in volume management.

Suggested reviewers

chriscchien

Poem

🐰 Nodes dance, devices clean,
Volumes whisper their testing dream
Uncordon, list, and verify with glee
Longhorn's robustness, now we see!
A rabbit's test of pure delight 🚀

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75b888b and c7d5648.

📒 Files selected for processing (11)

e2e/keywords/k8s.resource (2 hunks)
e2e/keywords/longhorn.resource (2 hunks)
e2e/keywords/volume.resource (1 hunks)
e2e/libs/backing_image/crd.py (6 hunks)
e2e/libs/keywords/node_keywords.py (1 hunks)
e2e/libs/keywords/volume_keywords.py (3 hunks)
e2e/libs/keywords/workload_keywords.py (1 hunks)
e2e/libs/node/node.py (1 hunks)
e2e/libs/node_exec/node_exec.py (2 hunks)
e2e/tests/negative/node_drain.robot (1 hunks)
e2e/tests/regression/test_v2.robot (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

e2e/libs/node_exec/node_exec.py
e2e/libs/node/node.py
e2e/libs/keywords/workload_keywords.py
e2e/libs/keywords/node_keywords.py
e2e/keywords/k8s.resource
e2e/tests/negative/node_drain.robot
e2e/libs/backing_image/crd.py
e2e/libs/keywords/volume_keywords.py

👮 Files not reviewed due to content moderation or server errors (3)

e2e/keywords/longhorn.resource
e2e/tests/regression/test_v2.robot
e2e/keywords/volume.resource

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

c3y1huang · 2024-12-17T04:18:31Z

@coderabbitai review

coderabbitai · 2024-12-17T04:18:37Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

e2e/libs/keywords/workload_keywords.py (1)
49-49: Consider standardizing logging format across similar methods.

For consistency, consider adding namespace information to logs in related methods like create_pod(). This would maintain a uniform logging pattern throughout the class.

Example for create_pod():
     def create_pod(self, pod_name, claim_name):
-        logging(f'Creating pod {pod_name} using pvc {claim_name}')
+        logging(f'Creating pod {pod_name} using pvc {claim_name} in namespace {namespace}')
         create_pod(new_busybox_manifest(pod_name, claim_name))
e2e/tests/regression/test_v2.robot (2)

82-93: Consider adding cleanup verification after test completion.

While the test properly sets up the volumes and writes data, it would be beneficial to verify that all resources are properly cleaned up after the test, even in failure scenarios.

Consider adding these verifications in the test teardown:

Verify no orphaned devices remain

Ensure instance managers are properly terminated

Check for any leaked resources

95-103: Consider adding error handling for node operations.

The test should handle potential failures in node cordoning and instance manager deletion gracefully.

Consider:

Adding timeout and retry logic for node operations

Verifying node state before proceeding with device checks

Adding cleanup steps if the test fails at this stage
e2e/libs/node/node.py (2)
291-294: Consider adding error handling for dmsetup command.

While the implementation is clean and focused, it could benefit from error handling in case the dmsetup command is not available on the node.
 def list_dm_devices(self, node_name):
-    cmd = "dmsetup ls | awk '{print $1}'"
+    cmd = "if command -v dmsetup >/dev/null 2>&1; then dmsetup ls | awk '{print $1}'; else echo 'dmsetup command not found' >&2; exit 1; fi"
     res = NodeExec(node_name).issue_cmd(cmd)
     return res
296-299: Consider adding error handling for missing directory.

While the implementation is clean and focused, it could benefit from error handling in case the Longhorn device directory doesn't exist.
 def list_volume_devices(self, node_name):
-    cmd = "ls /dev/longhorn/"
+    cmd = "if [ -d /dev/longhorn/ ]; then ls /dev/longhorn/; else echo 'Longhorn device directory not found' >&2; exit 1; fi"
     res = NodeExec(node_name).issue_cmd(cmd)
     return res
e2e/keywords/volume.resource (1)

392-402: LGTM: Well-structured volume device assertion keyword.

The keyword follows the same pattern as the DM device assertion:

Consistent implementation style

Clear condition handling

Proper error handling

Consider adding documentation to clarify the expected format of device names and the difference between DM devices and volume devices.

e2e/tests/negative/node_drain.robot (1)

Line range hint 1-220: Well-structured test suite with comprehensive coverage.

The test suite demonstrates excellent test design practices:

Clear documentation for each test case with detailed steps

Proper test setup and teardown

Comprehensive coverage of node drain scenarios

Good use of Robot Framework's Given-When-Then style

Proper validation of data integrity after operations

Consider adding the following test scenarios if not covered elsewhere:

Node drain during volume expansion

Node drain with multiple concurrent workloads

Node drain during backup/restore operations

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c4ab15 and 954f69a.

📒 Files selected for processing (11)

e2e/keywords/k8s.resource (2 hunks)
e2e/keywords/longhorn.resource (2 hunks)
e2e/keywords/volume.resource (1 hunks)
e2e/libs/backing_image/crd.py (6 hunks)
e2e/libs/keywords/node_keywords.py (1 hunks)
e2e/libs/keywords/volume_keywords.py (3 hunks)
e2e/libs/keywords/workload_keywords.py (1 hunks)
e2e/libs/node/node.py (1 hunks)
e2e/libs/node_exec/node_exec.py (2 hunks)
e2e/tests/negative/node_drain.robot (1 hunks)
e2e/tests/regression/test_v2.robot (2 hunks)

✅ Files skipped from review due to trivial changes (1)

e2e/libs/backing_image/crd.py

🔇 Additional comments (14)

e2e/libs/keywords/workload_keywords.py (1)

49-49: LGTM! Enhanced logging with namespace context.

The addition of namespace information to the log message improves observability and debugging capabilities, especially useful for tracking pod deletions across different namespaces.

e2e/libs/keywords/node_keywords.py (1)

79-83: LGTM! Clean implementation of device listing methods.

The new methods follow the established pattern in the class, providing clear interfaces to the underlying node functionality.

e2e/keywords/longhorn.resource (1)

72-73: LGTM! Good refactoring of instance manager deletion.

The changes simplify the logic by directly retrieving the instance manager, making the code more maintainable and reducing duplication.

Also applies to: 78-79

e2e/tests/regression/test_v2.robot (2)

75-81: LGTM! Well-structured test case with clear documentation.

The test case effectively validates the resource cleanup behavior with good coverage of both v1 and v2 volumes.

104-116: Verify volume health check timeouts.

The test waits for volumes to become healthy but doesn't specify timeouts. This could lead to test hangs in failure scenarios.

Let's check the default timeout values:

✅ Verification successful

Let me search for the timeout configurations in the robot framework files.

Let me search for the implementation of the volume health check wait function.

Let me check the implementation of the volume status wait function to find the timeout configuration.

Let me check the values of the timeout constants used in the wait function.

Based on the findings, I can now generate the final response:

The volume health check has appropriate timeout configurations.

The wait_for_volume_healthy function uses the following timeout settings:

Default retry count: 150 attempts (RETRY_COUNTS)
Long retry count: 360 attempts (RETRY_COUNTS_LONG)
Retry interval: 1 second (RETRY_INTERVAL)

This means the health check will wait up to:

150 seconds (2.5 minutes) by default
360 seconds (6 minutes) for long operations

These timeouts are sufficient to handle normal volume operations while preventing indefinite hangs. The test framework also uses these same constants consistently across different test scenarios.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for volume health check timeout configurations
rg -A 2 "wait.*health" --glob '*.{py,robot,resource}'

Length of output: 89535

Script:

#!/bin/bash
# Search for volume health check timeout configurations in robot files
rg -A 2 "Wait for volume.*healthy" --glob '*.robot'

Length of output: 25981

Script:

#!/bin/bash
# Search for the implementation of volume health check wait function
rg -A 5 "def wait_for_volume.*healthy" --glob '*.{py,robot}'

Length of output: 2125

Script:

#!/bin/bash
# Search for the implementation of wait_for_volume_status function
rg -A 10 "def wait_for_volume_status" --glob '*.{py,robot}'

Length of output: 905

Script:

#!/bin/bash
# Search for the definition of RETRY constants
rg "RETRY_COUNTS|RETRY_INTERVAL|RETRY_COUNTS_LONG" --glob '*.py'

Length of output: 40264

e2e/libs/node_exec/node_exec.py (2)

54-54: LGTM! Improved log readability.

The addition of a newline character enhances log readability by clearly separating the command result from the log message.

103-108: LGTM! Essential toleration for cordoned nodes.

The added toleration is crucial as it enables command execution on cordoned nodes, which is necessary for node management operations. The configuration follows Kubernetes best practices.

e2e/keywords/k8s.resource (2)

75-77: LGTM! Well-structured node management keyword.

The new keyword follows consistent patterns with other node management keywords and provides essential functionality for uncordoning nodes by their index.

124-125: LGTM! Fixed argument name typo.

Corrected the argument name from instance_manger to instance_manager for better consistency.

e2e/libs/keywords/volume_keywords.py (3)

17-17: LGTM: Clean import addition.

Added import for VolumeRest to support the new instance manager retrieval functionality.

74-78: LGTM: Well-implemented instance manager retrieval.

The new method get_volume_instance_manager is well-implemented with proper validation:

Retrieves volume details using VolumeRest
Validates single controller existence
Returns the instance manager name

133-133: LGTM: Clean formatting.

Removed unnecessary line break, improving code readability.

e2e/keywords/volume.resource (1)

380-390: LGTM: Well-structured DM device assertion keyword.

The keyword implementation is clean and includes:

Proper variable resolution
Clear condition handling
Appropriate error handling for invalid conditions

e2e/tests/negative/node_drain.robot (1)

220-220: Fix typo in variable name.

Good catch on fixing the typo from instance_manger to instance_manager. This maintains consistency with the variable naming across the codebase.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

e2e/tests/regression/test_v2.robot (1)

99-118: LGTM: Comprehensive test assertions with proper validation flow

The test case thoroughly validates the cleanup process by:

Verifying device cleanup for affected volumes

Ensuring v1 volumes remain unaffected

Confirming data integrity after recovery

Using appropriate assertions for both positive and negative cases

The loop structure ensures reliability by testing multiple iterations.

Consider adding error handling for node operations to make the test more robust in case of infrastructure issues:

Timeout handling for cordon/uncordon operations

Retry mechanism for health checks

Cleanup steps in case of intermediate failures

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 954f69a and 75b888b.

📒 Files selected for processing (6)

e2e/keywords/k8s.resource (2 hunks)
e2e/keywords/volume.resource (1 hunks)
e2e/libs/keywords/node_keywords.py (1 hunks)
e2e/libs/node/node.py (1 hunks)
e2e/libs/node_exec/node_exec.py (2 hunks)
e2e/tests/regression/test_v2.robot (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

e2e/libs/node_exec/node_exec.py
e2e/libs/node/node.py
e2e/libs/keywords/node_keywords.py
e2e/keywords/k8s.resource
e2e/keywords/volume.resource

🔇 Additional comments (4)

e2e/tests/regression/test_v2.robot (4)

17-17: LGTM: Resource import is properly placed

The addition of k8s.resource is well-organized with other resource imports and provides necessary node management keywords for the new test case.

75-81: LGTM: Well-structured test case with clear documentation

The test case is properly structured with:

Clear, descriptive name
Appropriate test tag
Comprehensive documentation explaining the purpose and linking to the related issue

82-94: LGTM: Comprehensive test setup with proper volume initialization

The setup creates a robust test environment by:

Creating both v2 and v1 volumes for comparison
Properly waiting for volume health
Including data writing for validation

95-98: Verify instance manager deletion success

While the test flow is logical, it would be beneficial to verify that the instance manager deletion was successful before proceeding with the assertions.

Consider adding a verification step after line 97:

 When Cordon node 0
 And Delete instance-manager of volume 0
+And Wait for instance-manager deletion complete

Signed-off-by: Chin-Ya Huang <[email protected]>

… is deleted longhorn/longhorn-9959 longhorn/longhorn-9989 Signed-off-by: Chin-Ya Huang <[email protected]>

chriscchien

LGTM, I have verified longhorn/longhorn#9959 without any problem.

c3y1huang self-assigned this Dec 17, 2024

c3y1huang force-pushed the 9959-orphan-device-and-dm-device-when-im-pod-crash branch from 35896a0 to 954f69a Compare December 17, 2024 04:18

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

c3y1huang force-pushed the 9959-orphan-device-and-dm-device-when-im-pod-crash branch from 954f69a to 4ba5b19 Compare December 17, 2024 04:39

c3y1huang marked this pull request as ready for review December 17, 2024 04:53

c3y1huang requested a review from a team as a code owner December 17, 2024 04:53

This was referenced Dec 17, 2024

[TEST][BUG] (v2 volume) orphan longhorn device and dm device on the node when IM pod crash longhorn/longhorn#9989

Closed

[BUG] (v2 volume) orphan longhorn device and dm device on the node when IM pod crash longhorn/longhorn#9959

Closed

c3y1huang force-pushed the 9959-orphan-device-and-dm-device-when-im-pod-crash branch from 4ba5b19 to 75b888b Compare December 17, 2024 08:14

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

c3y1huang added 5 commits December 19, 2024 17:09

fix(robot): typo

e88f7ae

Signed-off-by: Chin-Ya Huang <[email protected]>

chore(robot): cleanup

19398f8

Signed-off-by: Chin-Ya Huang <[email protected]>

chore(robot): improve log message

e813141

Signed-off-by: Chin-Ya Huang <[email protected]>

refactor(robot): get volume instance manager

12f1d1c

Signed-off-by: Chin-Ya Huang <[email protected]>

test(robot): v2 volume should cleanup resources when instance manager…

c7d5648

… is deleted longhorn/longhorn-9959 longhorn/longhorn-9989 Signed-off-by: Chin-Ya Huang <[email protected]>

c3y1huang force-pushed the 9959-orphan-device-and-dm-device-when-im-pod-crash branch from 75b888b to c7d5648 Compare December 19, 2024 09:09

chriscchien approved these changes Dec 19, 2024

View reviewed changes

c3y1huang merged commit 19fa09f into longhorn:master Dec 19, 2024
9 checks passed

c3y1huang deleted the 9959-orphan-device-and-dm-device-when-im-pod-crash branch December 19, 2024 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(robot): v2 volume should cleanup resources when instance manager is deleted #2199

test(robot): v2 volume should cleanup resources when instance manager is deleted #2199

c3y1huang commented Dec 17, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

c3y1huang commented Dec 17, 2024

coderabbitai bot commented Dec 17, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

chriscchien left a comment

test(robot): v2 volume should cleanup resources when instance manager is deleted #2199

test(robot): v2 volume should cleanup resources when instance manager is deleted #2199

Conversation

c3y1huang commented Dec 17, 2024 • edited by coderabbitai bot Loading

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

Summary by CodeRabbit

coderabbitai bot commented Dec 17, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

c3y1huang commented Dec 17, 2024

coderabbitai bot commented Dec 17, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

chriscchien left a comment

Choose a reason for hiding this comment

c3y1huang commented Dec 17, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)