-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(robot): Add case DR Volume Node Reboot During Restoration #2165
base: master
Are you sure you want to change the base?
Conversation
longhorn/longhorn#8425 Signed-off-by: Roger Yao <[email protected]>
longhorn/longhorn#8425 Signed-off-by: Roger Yao <[email protected]>
WalkthroughThis pull request includes the deletion of a documentation file related to disaster recovery (DR) volume node reboot scenarios and the introduction of several new methods and keywords for managing volume restoration processes. New keywords were added to facilitate backup creation and restoration monitoring, while existing keywords were modified to streamline testing. Additionally, a new test file was created to validate the functionality of DR volume node reboots during both initial and incremental restoration scenarios. Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (10)
e2e/keywords/backup.resource (1)
51-54
: Consider adding retry mechanism for backup verificationGood job on encapsulating the common backup creation workflow into a single keyword! This follows good automation practices and maintains code reusability.
However, in distributed systems, backup operations might have eventual consistency. Consider adding a retry mechanism for the verification steps.
Here's a suggested implementation:
Volume ${volume_id} backup ${backup_id} should be able to create Create backup ${backup_id} for volume ${volume_id} - Verify backup list contains no error for volume ${volume_id} - Verify backup list contains backup ${backup_id} of volume ${volume_id} + Wait Until Keyword Succeeds 5x 2s Verify backup list contains no error for volume ${volume_id} + Wait Until Keyword Succeeds 5x 2s Verify backup list contains backup ${backup_id} of volume ${volume_id}Also, consider a more concise name like
Create And Verify Volume Backup
while keeping the same parameters.e2e/tests/negative/test_dr_volume_node_reboot.robot (3)
24-29
: Add documentation for configuration variables.Consider adding comments explaining:
- The significance of the retry count and interval values
- Why the default loop count is 5 when the PR suggests running with 3
- The implications of these values on test duration and reliability
*** Variables *** +# Maximum number of retries for waiting operations ${RETRY_COUNT} 400 +# Number of times to repeat the test scenario ${LOOP_COUNT} 5 +# Interval between retries in seconds ${RETRY_INTERVAL} 1 +# Data engine version to use for volume creation ${DATA_ENGINE} v1
61-91
: Consider applying the same loop structure as the first test case.While the test case properly validates the incremental restore scenario, it could benefit from:
- Using the same loop structure as the first test case for consistency
- Adding cleanup between iterations if looped
Example structure:
DR Volume Node Reboot During Incremental Restoration [Documentation] Test DR volume node reboot During Incremental Restoration ... Given Create volume 0 with dataEngine=${DATA_ENGINE} And Attach volume 0 And Wait for volume 0 healthy And Write data 0 to volume 0 Then Volume 0 backup 0 should be able to create + FOR ${i} IN RANGE ${LOOP_COUNT} Then Create DR volume 1 from backup 0 dataEngine=${DATA_ENGINE} ... And Check volume 1 data is backup 1 + Then Detach volume 1 + And Delete volume 1 + END
1-91
: Consider adding enhanced logging for better debugging.While the test implementation successfully covers the PR objectives, consider adding:
- Timestamps for key operations (backup creation, restoration start/end, reboot)
- Node status logging before and after reboot
- Volume status logging at critical points
This will help in:
- Analyzing test performance
- Debugging test failures
- Understanding the timing of various operations
e2e/libs/volume/volume.py (1)
81-83
: Add parameter validation and documentation.While the implementation follows the class's delegation pattern, consider these improvements:
- Add docstring explaining the method's purpose, parameters, and return value
- Add parameter validation for
volume_name
andbackup_name
def wait_for_volume_restoration_start(self, volume_name, backup_name): + """Wait for volume restoration to start. + + Args: + volume_name (str): Name of the volume being restored + backup_name (str): Name of the backup being restored from + + Raises: + ValueError: If volume_name or backup_name is None or empty + """ + if not volume_name or not backup_name: + raise ValueError("volume_name and backup_name must not be empty") self.volume.wait_for_volume_restoration_start(volume_name, backup_name)e2e/libs/keywords/volume_keywords.py (2)
283-285
: LGTM! Consider adding parameter validation.The implementation follows the class patterns and properly integrates with the Volume class. However, consider adding parameter validation for robustness.
Consider adding parameter validation:
def wait_for_volume_restoration_start(self, volume_name, backup_name): + if not volume_name or not backup_name: + raise ValueError("Volume name and backup name must not be empty") logging(f'Waiting for volume {volume_name} restoration from {backup_name} start') self.volume.wait_for_volume_restoration_start(volume_name, backup_name)
283-285
: Well-structured addition for testing DR volume restoration phases.This method enhances the test framework's ability to verify DR volume behavior at different stages of the restoration process, particularly during node reboots. It maintains a clean separation of concerns by delegating the actual implementation to the Volume class while providing a consistent interface for test automation.
e2e/libs/volume/crd.py (3)
360-383
: Consider refactoring for improved readability and maintainabilityThe current implementation has nested loops and conditions that make the logic flow harder to follow. Consider extracting the engine status check into a separate helper method.
def wait_for_volume_restoration_start(self, volume_name, backup_name, progress=0): + def is_restoration_started(engine): + # Check restore status + for status in engine['status']['restoreStatus'].values(): + if status['state'] == "in_progress" and status['progress'] > progress: + return True + # Handle quick restorations + return engine['status']['lastRestoredBackup'] == backup_name + started = False for _i in range(self.retry_count): try: engines = self.engine.get_engines(volume_name) for engine in engines: - for status in engine['status']['restoreStatus'].values(): - if status['state'] == "in_progress" and status['progress'] > progress: - started = True - break - # Sometime the restore time is pretty short - # and the test may not be able to catch the intermediate status. - if engine['status']['lastRestoredBackup'] == backup_name: - started = True - if started: + if is_restoration_started(engine): + started = True break if started: break except Exception as e: logging(f"Getting volume {volume_name} engines error: {e}") time.sleep(self.retry_interval) assert startedAlso, consider adding docstring to document the method's purpose and parameters:
def wait_for_volume_restoration_start(self, volume_name, backup_name, progress=0): + """Wait for volume restoration to start. + + Args: + volume_name: Name of the volume being restored + backup_name: Name of the backup being restored from + progress: Minimum progress threshold to consider restoration started + + Raises: + AssertionError: If restoration doesn't start within retry limit + """🧰 Tools
🪛 Ruff
363-363: Loop control variable
i
not used within loop bodyRename unused
i
to_i
(B007)
363-363
: Fix unused loop variable warningThe loop control variable
i
is not used within the loop body. Rename it to_i
to indicate it's intentionally unused.- for i in range(self.retry_count): + for _i in range(self.retry_count):🧰 Tools
🪛 Ruff
363-363: Loop control variable
i
not used within loop bodyRename unused
i
to_i
(B007)
372-373
: Consider improving comment formattingThe comment about quick restorations could be more clearly formatted.
- # Sometime the restore time is pretty short - # and the test may not be able to catch the intermediate status. + # Sometimes the restore time is very short and the test may not + # be able to catch the intermediate status.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
docs/content/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted.md
(0 hunks)e2e/keywords/backup.resource
(1 hunks)e2e/keywords/volume.resource
(1 hunks)e2e/libs/keywords/volume_keywords.py
(1 hunks)e2e/libs/volume/base.py
(1 hunks)e2e/libs/volume/crd.py
(1 hunks)e2e/libs/volume/rest.py
(1 hunks)e2e/libs/volume/volume.py
(1 hunks)e2e/tests/negative/test_backup_listing.robot
(0 hunks)e2e/tests/negative/test_dr_volume_node_reboot.robot
(1 hunks)
💤 Files with no reviewable changes (2)
- docs/content/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted.md
- e2e/tests/negative/test_backup_listing.robot
🧰 Additional context used
🪛 Ruff
e2e/libs/volume/crd.py
363-363: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
🔇 Additional comments (7)
e2e/keywords/backup.resource (1)
51-54
: Verify usage in DR volume node reboot tests
The keyword implementation looks good and aligns well with the PR objectives for testing DR volume node reboot scenarios.
Let's verify its usage in the test implementation:
✅ Verification successful
Keyword is correctly used in DR volume node reboot tests
The verification confirms that the keyword is actively used in the DR volume node reboot test scenarios:
- Used 3 times in
e2e/tests/negative/test_dr_volume_node_reboot.robot
for backup creation - Also utilized in other test files, showing its reusability across the test suite
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the new keyword is used in the DR volume node reboot test
# Expected: The keyword should be used to set up test preconditions
rg -l "Volume \${.*} backup \${.*} should be able to create" "test_dr_volume_node_reboot.robot"
Length of output: 235
Script:
#!/bin/bash
# Let's try a different approach to find the test file and its usage
# First, find the test file
fd "test_dr_volume_node_reboot.robot"
# Then search for any backup-related keywords in it
rg "backup" -A 2 "e2e/test_dr_volume_node_reboot.robot" || true
# Also search in the entire e2e directory for this test file
fd "test_dr_volume_node_reboot" "e2e/"
# Let's also check where this backup keyword is actually used
rg "Volume.*backup.*should be able to create" "e2e/" || true
Length of output: 1065
e2e/libs/volume/base.py (1)
95-98
: LGTM! Well-structured addition for DR volume testing
The new abstract method wait_for_volume_restoration_start
is a well-designed addition that:
- Complements the existing
wait_for_volume_restoration_completed
method - Enables precise timing control for node reboot tests during DR volume restoration
- Follows the established pattern of the abstract base class
This addition directly supports the PR's objective of implementing DR volume node reboot test cases.
e2e/tests/negative/test_dr_volume_node_reboot.robot (2)
31-60
: Enhance reboot verification in the test case.
The test case follows a good structure with proper cleanup between iterations. However, consider adding explicit verification for the reboot operation:
- Verify that the node actually went down
- Add timeout handling for the reboot operation
- Log the time taken for node recovery
Let's check the implementation of the reboot keyword:
#!/bin/bash
# Description: Verify the implementation of the reboot keyword and related volume operations
# Look for reboot implementation
rg -A 10 "Reboot volume.*volume node" ../keywords/
# Look for volume attachment verification
rg -A 5 "Wait for volume.*healthy" ../keywords/
1-23
: LGTM! Well-structured test setup with proper documentation.
The test file is properly organized with clear documentation, appropriate test tags, and necessary resource imports. The manual test tag aligns with the PR objectives.
Let's verify the implementation of setup/teardown procedures:
✅ Verification successful
Setup and teardown procedures are properly implemented and consistent with test requirements
The test setup and teardown procedures are well-defined in common.resource
and follow the established pattern used across all test files in the codebase. The setup includes essential initialization steps like:
- Kubernetes API client initialization
- Network latency configuration
- Backup store setup
- Data engine configuration
The teardown procedure includes proper cleanup of:
- Node power states
- Node scheduling
- Network configurations
- Node execution states
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the implementation of test setup and teardown procedures
# Look for the implementation of these procedures in the imported resources
rg -A 5 "Set test environment" ../keywords/
rg -A 5 "Cleanup test resources" ../keywords/
Length of output: 208
Script:
#!/bin/bash
# Let's try searching from the repository root instead
# First, find the keywords directory
fd -t d "keywords"
# Then search for the setup/teardown implementations
rg -A 5 "Set test environment"
rg -A 5 "Cleanup test resources"
Length of output: 19766
Script:
#!/bin/bash
# Let's check the implementation details of setup/teardown in common.resource
rg -B 5 -A 10 "Set test environment" e2e/keywords/common.resource
Length of output: 666
e2e/libs/volume/volume.py (1)
81-83
: Implementation aligns well with PR objectives.
The new method is a necessary addition for testing DR volume node reboot scenarios during restoration. It complements the existing wait_for_volume_restoration_completed
method and follows the class's established patterns.
e2e/keywords/volume.resource (1)
173-176
: LGTM! Verify the underlying function implementation.
The new keyword follows consistent patterns and integrates well with existing volume restoration monitoring capabilities. However, let's verify the implementation of the underlying function.
Let's check the implementation of the underlying function:
Consider adding documentation.
Consider adding a comment to document the purpose and behavior of this keyword, similar to other keywords in the file. For example:
+# Waits for the restoration process to start for a volume from the specified backup
Wait for volume ${volume_id} restoration from backup ${backup_id} start
${volume_name} = generate_name_with_suffix volume ${volume_id}
${backup_name} = get_backup_name ${backup_id}
wait_for_volume_restoration_start ${volume_name} ${backup_name}
✅ Verification successful
Implementation verified and properly integrated
The keyword's underlying implementation is well-structured and follows a clear hierarchy:
- The Robot Framework keyword in
volume.resource
callswait_for_volume_restoration_start
involume_keywords.py
volume_keywords.py
logs the operation and delegates to the volume implementation- The concrete implementation in
e2e/libs/volume/crd.py
handles the actual monitoring logic with proper retry mechanism
The implementation aligns with the test objectives and integrates well with the existing volume restoration monitoring capabilities.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the implementation of wait_for_volume_restoration_start
# Check for the function definition in volume_keywords.py
echo "Checking volume_keywords.py for wait_for_volume_restoration_start implementation..."
rg -A 5 "def wait_for_volume_restoration_start" ../libs/keywords/volume_keywords.py
# Check for any related test cases
echo "Checking for related test cases..."
rg "Wait for volume .* restoration from backup .* start" ../suites/
Length of output: 538
Script:
#!/bin/bash
# Let's try searching in the entire repository for the implementation
echo "Searching for wait_for_volume_restoration_start implementation..."
rg -A 5 "def wait_for_volume_restoration_start"
echo "Searching for volume-related Python files..."
fd -e py volume
echo "Searching for Robot Framework resource files..."
fd -e resource
Length of output: 4137
e2e/libs/volume/rest.py (1)
83-85
:
Implementation needed for volume restoration monitoring
The NotImplemented
placeholder needs to be replaced with actual implementation. Please implement this method following the class's established patterns:
- Use retry mechanism with
self.retry_count
andself.retry_interval
- Add proper error handling and logging
- Assert the expected restoration state
Here's a suggested implementation based on the class's patterns:
def wait_for_volume_restoration_start(self, volume_name):
- return NotImplemented
+ started = False
+ for i in range(self.retry_count):
+ try:
+ volume = self.get(volume_name)
+ logging(f"Waiting for volume {volume_name} restoration to start... ({i})")
+
+ # Check if restoration has started
+ if volume.restoreStatus and volume.restoreStatus.state == "in_progress":
+ started = True
+ break
+ except Exception as e:
+ logging(f"Failed to get volume {volume_name} restore status with error: {e}")
+ time.sleep(self.retry_interval)
+
+ assert started, f"Timeout waiting for volume {volume_name} restoration to start"
This implementation:
- Follows the class's retry and error handling patterns
- Provides detailed logging for monitoring
- Checks the volume's restore status
- Asserts the expected state
Let's verify the restore status field usage in the codebase:
✅ Verification successful
Based on the search results, I can now generate a final response since I have sufficient information about how restore status is handled across the codebase.
Implementation suggestion needs revision for restore status handling
The suggested implementation needs to be adjusted based on the codebase patterns. The restore status is consistently accessed as an array/list across the codebase:
def wait_for_volume_restoration_start(self, volume_name):
- return NotImplemented
+ started = False
+ for i in range(self.retry_count):
+ try:
+ volume = self.get(volume_name)
+ logging.info(f"Waiting for volume {volume_name} restoration to start... ({i})")
+
+ # Check if restoration has started
+ for status in volume.restoreStatus:
+ if status.state == "in_progress" and status.progress > 0:
+ started = True
+ break
+ except Exception as e:
+ logging.error(f"Failed to get volume {volume_name} restore status: {e}")
+ time.sleep(self.retry_interval)
+
+ assert started, f"Timeout waiting for volume {volume_name} restoration to start"
This revised implementation:
- Correctly handles
restoreStatus
as an array/list, matching usage intest_settings.py
andtest_common.py
- Uses the same state and progress check pattern seen in
common.py
- Maintains consistency with error handling and logging patterns
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the usage of restore status field in volume objects
# Expected: Find examples of restore status field usage
# Search for restore status usage patterns
rg -A 3 'restoreStatus|restore_status'
Length of output: 3141
Which issue(s) this PR fixes:
longhorn/longhorn#8425
What this PR does / why we need it:
Implement manual test case The node the DR volume attached to is rebooted in robot
Special notes for your reviewer:
Additional documentation or context
./run.sh -i 'longhorn-8425' -v LOOP_COUNT:3
Summary by CodeRabbit
New Features
Bug Fixes
Documentation