Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): v2 volume should cleanup resources when instance manager is deleted #2199

Conversation

c3y1huang
Copy link
Collaborator

@c3y1huang c3y1huang commented Dec 17, 2024

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9959, longhorn/longhorn#9989

What this PR does / why we need it:

Introduce a new robot test case.

Special notes for your reviewer:

None

Additional documentation or context

None

Summary by CodeRabbit

  • New Features

    • Added keywords for uncordoning nodes and validating device existence in volume management.
    • Introduced a new test case for verifying resource cleanup when an instance manager is deleted.
  • Bug Fixes

    • Corrected variable naming inconsistencies in test scripts.
  • Documentation

    • Enhanced documentation for test cases related to node draining.
  • Style

    • Minor formatting adjustments for improved readability in code and logging statements.

@c3y1huang c3y1huang self-assigned this Dec 17, 2024
Copy link

coderabbitai bot commented Dec 17, 2024

Walkthrough

The pull request introduces enhancements to Longhorn's end-to-end testing framework, focusing on node management, device handling, and test coverage. New keywords and methods have been added to facilitate better testing of volume and node operations, particularly around device management and instance manager interactions. The changes include methods for listing devices, retrieving instance managers, and adding new test scenarios to validate resource cleanup and node management processes.

Changes

File Change Summary
e2e/keywords/k8s.resource Added Uncordon node ${node_id} keyword, corrected instance_manager spelling
e2e/keywords/longhorn.resource Simplified instance manager deletion keywords
e2e/keywords/volume.resource Added keywords to assert device existence for volumes
e2e/libs/node/node.py Added methods list_dm_devices and list_volume_devices
e2e/libs/keywords/node_keywords.py Added methods to list device types on nodes
e2e/libs/keywords/volume_keywords.py Added get_volume_instance_manager method
e2e/libs/node_exec/node_exec.py Updated logging and added node toleration
e2e/tests/regression/test_v2.robot Added test case for volume resource cleanup

Assessment against linked issues

Objective Addressed Explanation
Clean up orphan devices when IM pod crashes [#9959]
Verify device cleanup for v2 volumes

Possibly related PRs

Suggested reviewers

  • chriscchien

Poem

🐰 Nodes dance, devices clean,
Volumes whisper their testing dream
Uncordon, list, and verify with glee
Longhorn's robustness, now we see!
A rabbit's test of pure delight 🚀


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75b888b and c7d5648.

📒 Files selected for processing (11)
  • e2e/keywords/k8s.resource (2 hunks)
  • e2e/keywords/longhorn.resource (2 hunks)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/backing_image/crd.py (6 hunks)
  • e2e/libs/keywords/node_keywords.py (1 hunks)
  • e2e/libs/keywords/volume_keywords.py (3 hunks)
  • e2e/libs/keywords/workload_keywords.py (1 hunks)
  • e2e/libs/node/node.py (1 hunks)
  • e2e/libs/node_exec/node_exec.py (2 hunks)
  • e2e/tests/negative/node_drain.robot (1 hunks)
  • e2e/tests/regression/test_v2.robot (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • e2e/libs/node_exec/node_exec.py
  • e2e/libs/node/node.py
  • e2e/libs/keywords/workload_keywords.py
  • e2e/libs/keywords/node_keywords.py
  • e2e/keywords/k8s.resource
  • e2e/tests/negative/node_drain.robot
  • e2e/libs/backing_image/crd.py
  • e2e/libs/keywords/volume_keywords.py
👮 Files not reviewed due to content moderation or server errors (3)
  • e2e/keywords/longhorn.resource
  • e2e/tests/regression/test_v2.robot
  • e2e/keywords/volume.resource

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@c3y1huang c3y1huang force-pushed the 9959-orphan-device-and-dm-device-when-im-pod-crash branch from 35896a0 to 954f69a Compare December 17, 2024 04:18
@c3y1huang
Copy link
Collaborator Author

@coderabbitai review

Copy link

coderabbitai bot commented Dec 17, 2024

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
e2e/libs/keywords/workload_keywords.py (1)

49-49: Consider standardizing logging format across similar methods.

For consistency, consider adding namespace information to logs in related methods like create_pod(). This would maintain a uniform logging pattern throughout the class.

Example for create_pod():

     def create_pod(self, pod_name, claim_name):
-        logging(f'Creating pod {pod_name} using pvc {claim_name}')
+        logging(f'Creating pod {pod_name} using pvc {claim_name} in namespace {namespace}')
         create_pod(new_busybox_manifest(pod_name, claim_name))
e2e/tests/regression/test_v2.robot (2)

82-93: Consider adding cleanup verification after test completion.

While the test properly sets up the volumes and writes data, it would be beneficial to verify that all resources are properly cleaned up after the test, even in failure scenarios.

Consider adding these verifications in the test teardown:

  • Verify no orphaned devices remain
  • Ensure instance managers are properly terminated
  • Check for any leaked resources

95-103: Consider adding error handling for node operations.

The test should handle potential failures in node cordoning and instance manager deletion gracefully.

Consider:

  • Adding timeout and retry logic for node operations
  • Verifying node state before proceeding with device checks
  • Adding cleanup steps if the test fails at this stage
e2e/libs/node/node.py (2)

291-294: Consider adding error handling for dmsetup command.

While the implementation is clean and focused, it could benefit from error handling in case the dmsetup command is not available on the node.

 def list_dm_devices(self, node_name):
-    cmd = "dmsetup ls | awk '{print $1}'"
+    cmd = "if command -v dmsetup >/dev/null 2>&1; then dmsetup ls | awk '{print $1}'; else echo 'dmsetup command not found' >&2; exit 1; fi"
     res = NodeExec(node_name).issue_cmd(cmd)
     return res

296-299: Consider adding error handling for missing directory.

While the implementation is clean and focused, it could benefit from error handling in case the Longhorn device directory doesn't exist.

 def list_volume_devices(self, node_name):
-    cmd = "ls /dev/longhorn/"
+    cmd = "if [ -d /dev/longhorn/ ]; then ls /dev/longhorn/; else echo 'Longhorn device directory not found' >&2; exit 1; fi"
     res = NodeExec(node_name).issue_cmd(cmd)
     return res
e2e/keywords/volume.resource (1)

392-402: LGTM: Well-structured volume device assertion keyword.

The keyword follows the same pattern as the DM device assertion:

  • Consistent implementation style
  • Clear condition handling
  • Proper error handling

Consider adding documentation to clarify the expected format of device names and the difference between DM devices and volume devices.

e2e/tests/negative/node_drain.robot (1)

Line range hint 1-220: Well-structured test suite with comprehensive coverage.

The test suite demonstrates excellent test design practices:

  • Clear documentation for each test case with detailed steps
  • Proper test setup and teardown
  • Comprehensive coverage of node drain scenarios
  • Good use of Robot Framework's Given-When-Then style
  • Proper validation of data integrity after operations

Consider adding the following test scenarios if not covered elsewhere:

  1. Node drain during volume expansion
  2. Node drain with multiple concurrent workloads
  3. Node drain during backup/restore operations
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c4ab15 and 954f69a.

📒 Files selected for processing (11)
  • e2e/keywords/k8s.resource (2 hunks)
  • e2e/keywords/longhorn.resource (2 hunks)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/backing_image/crd.py (6 hunks)
  • e2e/libs/keywords/node_keywords.py (1 hunks)
  • e2e/libs/keywords/volume_keywords.py (3 hunks)
  • e2e/libs/keywords/workload_keywords.py (1 hunks)
  • e2e/libs/node/node.py (1 hunks)
  • e2e/libs/node_exec/node_exec.py (2 hunks)
  • e2e/tests/negative/node_drain.robot (1 hunks)
  • e2e/tests/regression/test_v2.robot (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • e2e/libs/backing_image/crd.py
🔇 Additional comments (14)
e2e/libs/keywords/workload_keywords.py (1)

49-49: LGTM! Enhanced logging with namespace context.

The addition of namespace information to the log message improves observability and debugging capabilities, especially useful for tracking pod deletions across different namespaces.

e2e/libs/keywords/node_keywords.py (1)

79-83: LGTM! Clean implementation of device listing methods.

The new methods follow the established pattern in the class, providing clear interfaces to the underlying node functionality.

e2e/keywords/longhorn.resource (1)

72-73: LGTM! Good refactoring of instance manager deletion.

The changes simplify the logic by directly retrieving the instance manager, making the code more maintainable and reducing duplication.

Also applies to: 78-79

e2e/tests/regression/test_v2.robot (2)

75-81: LGTM! Well-structured test case with clear documentation.

The test case effectively validates the resource cleanup behavior with good coverage of both v1 and v2 volumes.


104-116: Verify volume health check timeouts.

The test waits for volumes to become healthy but doesn't specify timeouts. This could lead to test hangs in failure scenarios.

Let's check the default timeout values:

✅ Verification successful

Let me search for the timeout configurations in the robot framework files.


Let me search for the implementation of the volume health check wait function.


Let me check the implementation of the volume status wait function to find the timeout configuration.


Let me check the values of the timeout constants used in the wait function.


Based on the findings, I can now generate the final response:

The volume health check has appropriate timeout configurations.

The wait_for_volume_healthy function uses the following timeout settings:

  • Default retry count: 150 attempts (RETRY_COUNTS)
  • Long retry count: 360 attempts (RETRY_COUNTS_LONG)
  • Retry interval: 1 second (RETRY_INTERVAL)

This means the health check will wait up to:

  • 150 seconds (2.5 minutes) by default
  • 360 seconds (6 minutes) for long operations

These timeouts are sufficient to handle normal volume operations while preventing indefinite hangs. The test framework also uses these same constants consistently across different test scenarios.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for volume health check timeout configurations
rg -A 2 "wait.*health" --glob '*.{py,robot,resource}'

Length of output: 89535


Script:

#!/bin/bash
# Search for volume health check timeout configurations in robot files
rg -A 2 "Wait for volume.*healthy" --glob '*.robot'

Length of output: 25981


Script:

#!/bin/bash
# Search for the implementation of volume health check wait function
rg -A 5 "def wait_for_volume.*healthy" --glob '*.{py,robot}'

Length of output: 2125


Script:

#!/bin/bash
# Search for the implementation of wait_for_volume_status function
rg -A 10 "def wait_for_volume_status" --glob '*.{py,robot}'

Length of output: 905


Script:

#!/bin/bash
# Search for the definition of RETRY constants
rg "RETRY_COUNTS|RETRY_INTERVAL|RETRY_COUNTS_LONG" --glob '*.py'

Length of output: 40264

e2e/libs/node_exec/node_exec.py (2)

54-54: LGTM! Improved log readability.

The addition of a newline character enhances log readability by clearly separating the command result from the log message.


103-108: LGTM! Essential toleration for cordoned nodes.

The added toleration is crucial as it enables command execution on cordoned nodes, which is necessary for node management operations. The configuration follows Kubernetes best practices.

e2e/keywords/k8s.resource (2)

75-77: LGTM! Well-structured node management keyword.

The new keyword follows consistent patterns with other node management keywords and provides essential functionality for uncordoning nodes by their index.


124-125: LGTM! Fixed argument name typo.

Corrected the argument name from instance_manger to instance_manager for better consistency.

e2e/libs/keywords/volume_keywords.py (3)

17-17: LGTM: Clean import addition.

Added import for VolumeRest to support the new instance manager retrieval functionality.


74-78: LGTM: Well-implemented instance manager retrieval.

The new method get_volume_instance_manager is well-implemented with proper validation:

  • Retrieves volume details using VolumeRest
  • Validates single controller existence
  • Returns the instance manager name

133-133: LGTM: Clean formatting.

Removed unnecessary line break, improving code readability.

e2e/keywords/volume.resource (1)

380-390: LGTM: Well-structured DM device assertion keyword.

The keyword implementation is clean and includes:

  • Proper variable resolution
  • Clear condition handling
  • Appropriate error handling for invalid conditions
e2e/tests/negative/node_drain.robot (1)

220-220: Fix typo in variable name.

Good catch on fixing the typo from instance_manger to instance_manager. This maintains consistency with the variable naming across the codebase.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
e2e/tests/regression/test_v2.robot (1)

99-118: LGTM: Comprehensive test assertions with proper validation flow

The test case thoroughly validates the cleanup process by:

  • Verifying device cleanup for affected volumes
  • Ensuring v1 volumes remain unaffected
  • Confirming data integrity after recovery
  • Using appropriate assertions for both positive and negative cases

The loop structure ensures reliability by testing multiple iterations.

Consider adding error handling for node operations to make the test more robust in case of infrastructure issues:

  • Timeout handling for cordon/uncordon operations
  • Retry mechanism for health checks
  • Cleanup steps in case of intermediate failures
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 954f69a and 75b888b.

📒 Files selected for processing (6)
  • e2e/keywords/k8s.resource (2 hunks)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/keywords/node_keywords.py (1 hunks)
  • e2e/libs/node/node.py (1 hunks)
  • e2e/libs/node_exec/node_exec.py (2 hunks)
  • e2e/tests/regression/test_v2.robot (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • e2e/libs/node_exec/node_exec.py
  • e2e/libs/node/node.py
  • e2e/libs/keywords/node_keywords.py
  • e2e/keywords/k8s.resource
  • e2e/keywords/volume.resource
🔇 Additional comments (4)
e2e/tests/regression/test_v2.robot (4)

17-17: LGTM: Resource import is properly placed

The addition of k8s.resource is well-organized with other resource imports and provides necessary node management keywords for the new test case.


75-81: LGTM: Well-structured test case with clear documentation

The test case is properly structured with:

  • Clear, descriptive name
  • Appropriate test tag
  • Comprehensive documentation explaining the purpose and linking to the related issue

82-94: LGTM: Comprehensive test setup with proper volume initialization

The setup creates a robust test environment by:

  • Creating both v2 and v1 volumes for comparison
  • Properly waiting for volume health
  • Including data writing for validation

95-98: Verify instance manager deletion success

While the test flow is logical, it would be beneficial to verify that the instance manager deletion was successful before proceeding with the assertions.

Consider adding a verification step after line 97:

 When Cordon node 0
 And Delete instance-manager of volume 0
+And Wait for instance-manager deletion complete

Signed-off-by: Chin-Ya Huang <[email protected]>
Signed-off-by: Chin-Ya Huang <[email protected]>
… is deleted

longhorn/longhorn-9959
longhorn/longhorn-9989

Signed-off-by: Chin-Ya Huang <[email protected]>
@c3y1huang c3y1huang force-pushed the 9959-orphan-device-and-dm-device-when-im-pod-crash branch from 75b888b to c7d5648 Compare December 19, 2024 09:09
Copy link
Contributor

@chriscchien chriscchien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have verified longhorn/longhorn#9959 without any problem.

@c3y1huang c3y1huang merged commit 19fa09f into longhorn:master Dec 19, 2024
9 checks passed
@c3y1huang c3y1huang deleted the 9959-orphan-device-and-dm-device-when-im-pod-crash branch December 19, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants