Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): Add case DR Volume Node Reboot During Restoration #2165

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

roger-ryao
Copy link
Contributor

@roger-ryao roger-ryao commented Nov 15, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#8425

What this PR does / why we need it:

Implement manual test case The node the DR volume attached to is rebooted in robot

Special notes for your reviewer:

Additional documentation or context

Summary by CodeRabbit

  • New Features

    • Introduced a new test file for validating disaster recovery (DR) volume node reboot functionality, covering both initial and incremental restoration scenarios.
    • Added new keywords for creating volume backups and monitoring volume restoration processes.
  • Bug Fixes

    • Removed redundant backup verification steps in existing tests to streamline the process.
  • Documentation

    • Deleted outdated documentation related to node reboot scenarios with DR volumes.

@roger-ryao roger-ryao self-assigned this Nov 15, 2024
@roger-ryao roger-ryao requested a review from a team as a code owner November 15, 2024 09:33
Copy link

coderabbitai bot commented Nov 15, 2024

Walkthrough

This pull request includes the deletion of a documentation file related to disaster recovery (DR) volume node reboot scenarios and the introduction of several new methods and keywords for managing volume restoration processes. New keywords were added to facilitate backup creation and restoration monitoring, while existing keywords were modified to streamline testing. Additionally, a new test file was created to validate the functionality of DR volume node reboots during both initial and incremental restoration scenarios.

Changes

File Path Change Summary
docs/content/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted.md Deleted documentation for DR volume node reboot scenarios.
e2e/keywords/backup.resource Added keyword: Volume ${volume_id} backup ${backup_id} should be able to create.
e2e/keywords/volume.resource Added keyword: Wait for volume ${volume_id} restoration from backup ${backup_id} start.
e2e/libs/keywords/volume_keywords.py Added method: wait_for_volume_restoration_start(self, volume_name, backup_name).
e2e/libs/volume/base.py Added abstract method: wait_for_volume_restoration_start(self, volume_name, backup_name).
e2e/libs/volume/crd.py Added method: wait_for_volume_restoration_start(self, volume_name, backup_name, progress=0).
e2e/libs/volume/rest.py Added method: wait_for_volume_restoration_start(self, volume_name) (NotImplemented).
e2e/libs/volume/volume.py Added method: wait_for_volume_restoration_start(self, volume_name, backup_name).
e2e/tests/negative/test_backup_listing.robot Removed keyword: Volume ${volume_id} backup ${backup_id} should be able to create.
e2e/tests/negative/test_dr_volume_node_reboot.robot Created new test file for validating DR volume node reboot functionality with two test cases.

Possibly related PRs

  • test(robot): volume should not reattach after node eviction #2164: The new test case in this PR directly addresses the scenario of a volume not detaching after node eviction, which is closely related to the scenarios described in the main PR regarding the reboot of a node attached to a disaster recovery volume. Both involve managing the state of volumes during node operations.

Suggested reviewers

  • c3y1huang
  • chriscchien
  • yangchiu

🐇 In the land where the volumes play,
A reboot's dance, come what may.
With backups safe and data bright,
We hop along, all feels just right!
So let’s cheer for changes made,
In our DR world, we’re unafraid! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (10)
e2e/keywords/backup.resource (1)

51-54: Consider adding retry mechanism for backup verification

Good job on encapsulating the common backup creation workflow into a single keyword! This follows good automation practices and maintains code reusability.

However, in distributed systems, backup operations might have eventual consistency. Consider adding a retry mechanism for the verification steps.

Here's a suggested implementation:

 Volume ${volume_id} backup ${backup_id} should be able to create
     Create backup ${backup_id} for volume ${volume_id}
-    Verify backup list contains no error for volume ${volume_id}
-    Verify backup list contains backup ${backup_id} of volume ${volume_id}
+    Wait Until Keyword Succeeds    5x    2s    Verify backup list contains no error for volume ${volume_id}
+    Wait Until Keyword Succeeds    5x    2s    Verify backup list contains backup ${backup_id} of volume ${volume_id}

Also, consider a more concise name like Create And Verify Volume Backup while keeping the same parameters.

e2e/tests/negative/test_dr_volume_node_reboot.robot (3)

24-29: Add documentation for configuration variables.

Consider adding comments explaining:

  • The significance of the retry count and interval values
  • Why the default loop count is 5 when the PR suggests running with 3
  • The implications of these values on test duration and reliability
 *** Variables ***
+# Maximum number of retries for waiting operations
 ${RETRY_COUNT}    400
+# Number of times to repeat the test scenario
 ${LOOP_COUNT}    5
+# Interval between retries in seconds
 ${RETRY_INTERVAL}    1
+# Data engine version to use for volume creation
 ${DATA_ENGINE}    v1

61-91: Consider applying the same loop structure as the first test case.

While the test case properly validates the incremental restore scenario, it could benefit from:

  1. Using the same loop structure as the first test case for consistency
  2. Adding cleanup between iterations if looped

Example structure:

 DR Volume Node Reboot During Incremental Restoration
     [Documentation]    Test DR volume node reboot During Incremental Restoration
     ...
     Given Create volume 0 with    dataEngine=${DATA_ENGINE}
     And Attach volume 0
     And Wait for volume 0 healthy
     And Write data 0 to volume 0
     Then Volume 0 backup 0 should be able to create
+    FOR    ${i}    IN RANGE    ${LOOP_COUNT}
         Then Create DR volume 1 from backup 0    dataEngine=${DATA_ENGINE}
         ...
         And Check volume 1 data is backup 1
+        Then Detach volume 1
+        And Delete volume 1
+    END

1-91: Consider adding enhanced logging for better debugging.

While the test implementation successfully covers the PR objectives, consider adding:

  1. Timestamps for key operations (backup creation, restoration start/end, reboot)
  2. Node status logging before and after reboot
  3. Volume status logging at critical points

This will help in:

  • Analyzing test performance
  • Debugging test failures
  • Understanding the timing of various operations
e2e/libs/volume/volume.py (1)

81-83: Add parameter validation and documentation.

While the implementation follows the class's delegation pattern, consider these improvements:

  1. Add docstring explaining the method's purpose, parameters, and return value
  2. Add parameter validation for volume_name and backup_name
 def wait_for_volume_restoration_start(self, volume_name, backup_name):
+    """Wait for volume restoration to start.
+
+    Args:
+        volume_name (str): Name of the volume being restored
+        backup_name (str): Name of the backup being restored from
+
+    Raises:
+        ValueError: If volume_name or backup_name is None or empty
+    """
+    if not volume_name or not backup_name:
+        raise ValueError("volume_name and backup_name must not be empty")
     self.volume.wait_for_volume_restoration_start(volume_name, backup_name)
e2e/libs/keywords/volume_keywords.py (2)

283-285: LGTM! Consider adding parameter validation.

The implementation follows the class patterns and properly integrates with the Volume class. However, consider adding parameter validation for robustness.

Consider adding parameter validation:

 def wait_for_volume_restoration_start(self, volume_name, backup_name):
+    if not volume_name or not backup_name:
+        raise ValueError("Volume name and backup name must not be empty")
     logging(f'Waiting for volume {volume_name} restoration from {backup_name} start')
     self.volume.wait_for_volume_restoration_start(volume_name, backup_name)

283-285: Well-structured addition for testing DR volume restoration phases.

This method enhances the test framework's ability to verify DR volume behavior at different stages of the restoration process, particularly during node reboots. It maintains a clean separation of concerns by delegating the actual implementation to the Volume class while providing a consistent interface for test automation.

e2e/libs/volume/crd.py (3)

360-383: Consider refactoring for improved readability and maintainability

The current implementation has nested loops and conditions that make the logic flow harder to follow. Consider extracting the engine status check into a separate helper method.

 def wait_for_volume_restoration_start(self, volume_name, backup_name,
                                       progress=0):
+    def is_restoration_started(engine):
+        # Check restore status
+        for status in engine['status']['restoreStatus'].values():
+            if status['state'] == "in_progress" and status['progress'] > progress:
+                return True
+        # Handle quick restorations
+        return engine['status']['lastRestoredBackup'] == backup_name
+
     started = False
     for _i in range(self.retry_count):
         try:
             engines = self.engine.get_engines(volume_name)
             for engine in engines:
-                for status in engine['status']['restoreStatus'].values():
-                    if status['state'] == "in_progress" and status['progress'] > progress:
-                        started = True
-                        break
-                #  Sometime the restore time is pretty short
-                #  and the test may not be able to catch the intermediate status.
-                if engine['status']['lastRestoredBackup'] == backup_name:
-                    started = True
-                if started:
+                if is_restoration_started(engine):
+                    started = True
                     break
             if started:
                 break
         except Exception as e:
             logging(f"Getting volume {volume_name} engines error: {e}")
         time.sleep(self.retry_interval)
     assert started

Also, consider adding docstring to document the method's purpose and parameters:

 def wait_for_volume_restoration_start(self, volume_name, backup_name,
                                       progress=0):
+    """Wait for volume restoration to start.
+
+    Args:
+        volume_name: Name of the volume being restored
+        backup_name: Name of the backup being restored from
+        progress: Minimum progress threshold to consider restoration started
+
+    Raises:
+        AssertionError: If restoration doesn't start within retry limit
+    """
🧰 Tools
🪛 Ruff

363-363: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


363-363: Fix unused loop variable warning

The loop control variable i is not used within the loop body. Rename it to _i to indicate it's intentionally unused.

-    for i in range(self.retry_count):
+    for _i in range(self.retry_count):
🧰 Tools
🪛 Ruff

363-363: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


372-373: Consider improving comment formatting

The comment about quick restorations could be more clearly formatted.

-    #  Sometime the restore time is pretty short
-    #  and the test may not be able to catch the intermediate status.
+    # Sometimes the restore time is very short and the test may not
+    # be able to catch the intermediate status.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1e777ba and 548668d.

📒 Files selected for processing (10)
  • docs/content/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted.md (0 hunks)
  • e2e/keywords/backup.resource (1 hunks)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/keywords/volume_keywords.py (1 hunks)
  • e2e/libs/volume/base.py (1 hunks)
  • e2e/libs/volume/crd.py (1 hunks)
  • e2e/libs/volume/rest.py (1 hunks)
  • e2e/libs/volume/volume.py (1 hunks)
  • e2e/tests/negative/test_backup_listing.robot (0 hunks)
  • e2e/tests/negative/test_dr_volume_node_reboot.robot (1 hunks)
💤 Files with no reviewable changes (2)
  • docs/content/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted.md
  • e2e/tests/negative/test_backup_listing.robot
🧰 Additional context used
🪛 Ruff
e2e/libs/volume/crd.py

363-363: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🔇 Additional comments (7)
e2e/keywords/backup.resource (1)

51-54: Verify usage in DR volume node reboot tests

The keyword implementation looks good and aligns well with the PR objectives for testing DR volume node reboot scenarios.

Let's verify its usage in the test implementation:

✅ Verification successful

Keyword is correctly used in DR volume node reboot tests

The verification confirms that the keyword is actively used in the DR volume node reboot test scenarios:

  • Used 3 times in e2e/tests/negative/test_dr_volume_node_reboot.robot for backup creation
  • Also utilized in other test files, showing its reusability across the test suite
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the new keyword is used in the DR volume node reboot test
# Expected: The keyword should be used to set up test preconditions

rg -l "Volume \${.*} backup \${.*} should be able to create" "test_dr_volume_node_reboot.robot"

Length of output: 235


Script:

#!/bin/bash
# Let's try a different approach to find the test file and its usage
# First, find the test file
fd "test_dr_volume_node_reboot.robot"

# Then search for any backup-related keywords in it
rg "backup" -A 2 "e2e/test_dr_volume_node_reboot.robot" || true

# Also search in the entire e2e directory for this test file
fd "test_dr_volume_node_reboot" "e2e/"

# Let's also check where this backup keyword is actually used
rg "Volume.*backup.*should be able to create" "e2e/" || true

Length of output: 1065

e2e/libs/volume/base.py (1)

95-98: LGTM! Well-structured addition for DR volume testing

The new abstract method wait_for_volume_restoration_start is a well-designed addition that:

  • Complements the existing wait_for_volume_restoration_completed method
  • Enables precise timing control for node reboot tests during DR volume restoration
  • Follows the established pattern of the abstract base class

This addition directly supports the PR's objective of implementing DR volume node reboot test cases.

e2e/tests/negative/test_dr_volume_node_reboot.robot (2)

31-60: Enhance reboot verification in the test case.

The test case follows a good structure with proper cleanup between iterations. However, consider adding explicit verification for the reboot operation:

  1. Verify that the node actually went down
  2. Add timeout handling for the reboot operation
  3. Log the time taken for node recovery

Let's check the implementation of the reboot keyword:

#!/bin/bash
# Description: Verify the implementation of the reboot keyword and related volume operations

# Look for reboot implementation
rg -A 10 "Reboot volume.*volume node" ../keywords/
# Look for volume attachment verification
rg -A 5 "Wait for volume.*healthy" ../keywords/

1-23: LGTM! Well-structured test setup with proper documentation.

The test file is properly organized with clear documentation, appropriate test tags, and necessary resource imports. The manual test tag aligns with the PR objectives.

Let's verify the implementation of setup/teardown procedures:

✅ Verification successful

Setup and teardown procedures are properly implemented and consistent with test requirements

The test setup and teardown procedures are well-defined in common.resource and follow the established pattern used across all test files in the codebase. The setup includes essential initialization steps like:

  • Kubernetes API client initialization
  • Network latency configuration
  • Backup store setup
  • Data engine configuration

The teardown procedure includes proper cleanup of:

  • Node power states
  • Node scheduling
  • Network configurations
  • Node execution states
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of test setup and teardown procedures
# Look for the implementation of these procedures in the imported resources

rg -A 5 "Set test environment" ../keywords/
rg -A 5 "Cleanup test resources" ../keywords/

Length of output: 208


Script:

#!/bin/bash
# Let's try searching from the repository root instead
# First, find the keywords directory
fd -t d "keywords"

# Then search for the setup/teardown implementations
rg -A 5 "Set test environment"
rg -A 5 "Cleanup test resources"

Length of output: 19766


Script:

#!/bin/bash
# Let's check the implementation details of setup/teardown in common.resource
rg -B 5 -A 10 "Set test environment" e2e/keywords/common.resource

Length of output: 666

e2e/libs/volume/volume.py (1)

81-83: Implementation aligns well with PR objectives.

The new method is a necessary addition for testing DR volume node reboot scenarios during restoration. It complements the existing wait_for_volume_restoration_completed method and follows the class's established patterns.

e2e/keywords/volume.resource (1)

173-176: LGTM! Verify the underlying function implementation.

The new keyword follows consistent patterns and integrates well with existing volume restoration monitoring capabilities. However, let's verify the implementation of the underlying function.

Let's check the implementation of the underlying function:

Consider adding documentation.

Consider adding a comment to document the purpose and behavior of this keyword, similar to other keywords in the file. For example:

+# Waits for the restoration process to start for a volume from the specified backup
 Wait for volume ${volume_id} restoration from backup ${backup_id} start
     ${volume_name} =    generate_name_with_suffix    volume    ${volume_id}
     ${backup_name} =    get_backup_name    ${backup_id}
     wait_for_volume_restoration_start    ${volume_name}    ${backup_name}
✅ Verification successful

Implementation verified and properly integrated

The keyword's underlying implementation is well-structured and follows a clear hierarchy:

  • The Robot Framework keyword in volume.resource calls wait_for_volume_restoration_start in volume_keywords.py
  • volume_keywords.py logs the operation and delegates to the volume implementation
  • The concrete implementation in e2e/libs/volume/crd.py handles the actual monitoring logic with proper retry mechanism

The implementation aligns with the test objectives and integrates well with the existing volume restoration monitoring capabilities.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of wait_for_volume_restoration_start

# Check for the function definition in volume_keywords.py
echo "Checking volume_keywords.py for wait_for_volume_restoration_start implementation..."
rg -A 5 "def wait_for_volume_restoration_start" ../libs/keywords/volume_keywords.py

# Check for any related test cases
echo "Checking for related test cases..."
rg "Wait for volume .* restoration from backup .* start" ../suites/

Length of output: 538


Script:

#!/bin/bash
# Let's try searching in the entire repository for the implementation
echo "Searching for wait_for_volume_restoration_start implementation..."
rg -A 5 "def wait_for_volume_restoration_start"

echo "Searching for volume-related Python files..."
fd -e py volume

echo "Searching for Robot Framework resource files..."
fd -e resource

Length of output: 4137

e2e/libs/volume/rest.py (1)

83-85: ⚠️ Potential issue

Implementation needed for volume restoration monitoring

The NotImplemented placeholder needs to be replaced with actual implementation. Please implement this method following the class's established patterns:

  1. Use retry mechanism with self.retry_count and self.retry_interval
  2. Add proper error handling and logging
  3. Assert the expected restoration state

Here's a suggested implementation based on the class's patterns:

def wait_for_volume_restoration_start(self, volume_name):
-    return NotImplemented
+    started = False
+    for i in range(self.retry_count):
+        try:
+            volume = self.get(volume_name)
+            logging(f"Waiting for volume {volume_name} restoration to start... ({i})")
+            
+            # Check if restoration has started
+            if volume.restoreStatus and volume.restoreStatus.state == "in_progress":
+                started = True
+                break
+        except Exception as e:
+            logging(f"Failed to get volume {volume_name} restore status with error: {e}")
+        time.sleep(self.retry_interval)
+    
+    assert started, f"Timeout waiting for volume {volume_name} restoration to start"

This implementation:

  • Follows the class's retry and error handling patterns
  • Provides detailed logging for monitoring
  • Checks the volume's restore status
  • Asserts the expected state

Let's verify the restore status field usage in the codebase:

✅ Verification successful

Based on the search results, I can now generate a final response since I have sufficient information about how restore status is handled across the codebase.

Implementation suggestion needs revision for restore status handling

The suggested implementation needs to be adjusted based on the codebase patterns. The restore status is consistently accessed as an array/list across the codebase:

def wait_for_volume_restoration_start(self, volume_name):
-    return NotImplemented
+    started = False
+    for i in range(self.retry_count):
+        try:
+            volume = self.get(volume_name)
+            logging.info(f"Waiting for volume {volume_name} restoration to start... ({i})")
+            
+            # Check if restoration has started
+            for status in volume.restoreStatus:
+                if status.state == "in_progress" and status.progress > 0:
+                    started = True
+                    break
+        except Exception as e:
+            logging.error(f"Failed to get volume {volume_name} restore status: {e}")
+        time.sleep(self.retry_interval)
+    
+    assert started, f"Timeout waiting for volume {volume_name} restoration to start"

This revised implementation:

  • Correctly handles restoreStatus as an array/list, matching usage in test_settings.py and test_common.py
  • Uses the same state and progress check pattern seen in common.py
  • Maintains consistency with error handling and logging patterns
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of restore status field in volume objects
# Expected: Find examples of restore status field usage

# Search for restore status usage patterns
rg -A 3 'restoreStatus|restore_status'

Length of output: 3141

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant