-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(robot): add test case Test Longhorn components recovery #2143
test(robot): add test case Test Longhorn components recovery #2143
Conversation
cd756b4
to
7597939
Compare
WalkthroughThis pull request introduces multiple enhancements across various resource management functionalities in the Longhorn system. New keywords related to backing images, Longhorn components, and share managers have been added to facilitate operations such as deletion, waiting for operational status, and ensuring recovery. Additionally, modifications to existing methods improve their flexibility by allowing namespace specifications. A new test suite has been created to validate the resilience of Longhorn components and volumes under failure conditions, incorporating various test scenarios. Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 27
🧹 Outside diff range and nitpick comments (24)
e2e/libs/sharemanager/base.py (1)
21-23
: Consider adding type hints and docstring for wait_for_restart.
The wait_for_restart
method would benefit from type hints and documentation, especially for the last_creation_time
parameter, as its type and format might not be immediately obvious to implementers.
@abstractmethod
- def wait_for_restart(self, name, last_creation_time):
+ def wait_for_restart(self, name: str, last_creation_time: str) -> bool:
+ """Wait for a share manager to restart after the specified creation time.
+
+ Args:
+ name: Name of the share manager
+ last_creation_time: Previous creation timestamp to compare against
+
+ Returns:
+ bool: True if the share manager has restarted, False otherwise
+ """
return NotImplemented
e2e/libs/sharemanager/rest.py (4)
14-15
: Add type hints and docstring to get method
The get
method lacks type hints and documentation. This information is crucial for maintainability and usage understanding.
- def get(self, name):
+ def get(self, name: str) -> dict:
+ """Get share manager details by name.
+
+ Args:
+ name: Name of the share manager
+
+ Returns:
+ Share manager details
+
+ Raises:
+ ApiException: If the API request fails
+ """
17-18
: Add type hints and docstring to delete method
The delete
method lacks type hints and documentation.
- def delete(self, name):
+ def delete(self, name: str) -> None:
+ """Delete a share manager.
+
+ Args:
+ name: Name of the share manager to delete
+
+ Raises:
+ ApiException: If the deletion fails
+ """
20-21
: Add type hints and docstring to wait_for_running method
The wait_for_running
method lacks type hints and documentation.
- def wait_for_running(self, name):
+ def wait_for_running(self, name: str, timeout: int = 300) -> None:
+ """Wait for share manager to reach running state.
+
+ Args:
+ name: Name of the share manager
+ timeout: Maximum time to wait in seconds
+
+ Raises:
+ TimeoutError: If the share manager doesn't reach running state
+ ApiException: If the status check fails
+ """
23-24
: Add type hints and docstring to wait_for_restart method
The wait_for_restart
method lacks type hints and documentation.
- def wait_for_restart(self, name, last_creation_time):
+ def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> None:
+ """Wait for share manager to restart.
+
+ Args:
+ name: Name of the share manager
+ last_creation_time: Previous creation timestamp
+ timeout: Maximum time to wait in seconds
+
+ Raises:
+ TimeoutError: If the share manager doesn't restart
+ ApiException: If the status check fails
+ """
e2e/libs/sharemanager/sharemanager.py (1)
20-31
: Consider adding docstrings for better maintainability.
While the implementation is clean and follows the strategy pattern correctly, adding docstrings would improve maintainability and help other developers understand the purpose and expected behavior of each method.
Example improvement:
def delete(self, name):
+ """Delete a share manager instance.
+
+ Args:
+ name (str): Name of the share manager to delete
+
+ Returns:
+ The result from the underlying implementation
+ """
return self.sharemanager.delete(name)
e2e/keywords/backing_image.resource (1)
29-30
: Consider adding documentation for the keyword.
The keyword is well-implemented, but adding documentation would help users understand its purpose and expected behavior.
-Wait backing image managers running
+Wait backing image managers running
+ [Documentation] Waits until all backing image managers are in running state.
e2e/libs/backing_image/base.py (1)
38-40
: Consider using plural form in method name for consistency.
The method list_backing_image_manager
returns a collection of managers, so consider renaming it to list_backing_image_managers
for consistency with typical naming conventions for methods returning collections.
- def list_backing_image_manager(self):
+ def list_backing_image_managers(self):
return NotImplemented
e2e/keywords/sharemanager.resource (3)
24-27
: Add documentation for the new keyword.
The implementation looks good and aligns with the PR objectives. Consider adding documentation to describe the purpose, parameters, and expected behavior of this keyword.
Delete sharemanager of deployment ${deployment_id} and wait for recreation
+ [Documentation] Deletes the sharemanager associated with the given deployment ID and waits for it to be recreated.
+ ...
+ ... Arguments:
+ ... - deployment_id: The ID of the deployment whose sharemanager should be deleted
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
delete_sharemanager_and_wait_for_recreation ${volume_name}
29-32
: Add documentation and consider timeout parameter.
The implementation looks good but could benefit from some enhancements:
- Add documentation to describe the keyword's purpose and parameters
- Consider adding a timeout parameter to control how long to wait
Wait for sharemanager of deployment ${deployment_id} running
+ [Documentation] Waits for the sharemanager associated with the given deployment ID to be in running state.
+ ...
+ ... Arguments:
+ ... - deployment_id: The ID of the deployment whose sharemanager should be monitored
+ [Arguments] ${deployment_id} ${timeout}=300
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
- wait_for_share_manager_running ${volume_name}
+ wait_for_share_manager_running ${volume_name} timeout=${timeout}
24-32
: Implementation aligns well with PR objectives.
The new keywords provide essential functionality for testing sharemanager recovery scenarios:
Delete sharemanager of deployment
enables testing recovery by triggering sharemanager recreationWait for sharemanager of deployment running
allows verification of successful recovery
These additions will effectively support the automation of Longhorn components recovery test cases as outlined in the PR objectives.
Consider adding error handling keywords to handle cases where recovery fails or times out, which would make the test suite more robust.
e2e/libs/keywords/backing_image_keywords.py (1)
23-42
: Consider implementing a retry mechanism for resilience testing.
Since this code is part of component recovery testing, consider implementing a retry mechanism with exponential backoff for the wait operations. This would make the tests more resilient and better simulate real-world recovery scenarios.
Key recommendations:
- Create a common retry decorator/utility for all wait operations
- Add configurable retry parameters (max attempts, backoff factor)
- Implement detailed logging of retry attempts for test debugging
- Consider adding assertions about the time taken for recovery
Would you like me to provide an example implementation of the retry mechanism?
e2e/libs/keywords/sharemanager_keywords.py (3)
51-52
: Add documentation and error handling.
Consider adding a docstring and basic error handling to improve test maintainability and debugging:
def delete_sharemanager(self, name):
+ """Delete a share manager instance by name.
+
+ Args:
+ name (str): Name of the share manager to delete
+
+ Returns:
+ The result of the deletion operation
+
+ Raises:
+ Exception: If deletion fails
+ """
+ try:
return self.sharemanager.delete(name)
+ except Exception as e:
+ logging(f"Failed to delete share manager {name}: {str(e)}")
+ raise
60-61
: Add documentation and error handling for wait operation.
Consider adding a docstring and error handling to improve test reliability:
def wait_for_share_manager_running(self, name):
+ """Wait for a share manager to reach running state.
+
+ Args:
+ name (str): Name of the share manager
+
+ Raises:
+ TimeoutError: If the share manager doesn't reach running state
+ ValueError: If name is empty
+ """
+ if not name:
+ raise ValueError("Share manager name cannot be empty")
+
+ try:
return self.sharemanager.wait_for_running(name)
+ except Exception as e:
+ logging(f"Failed waiting for share manager {name} to run: {str(e)}")
+ raise
50-61
: Consider architectural improvements for better maintainability.
The new methods are well-integrated, but consider these improvements:
- Extract common timeout and wait logic into a base method to avoid duplication
- Add integration tests to verify the recovery scenarios
- Consider using a configuration object for timeouts and retry settings
Example of a base wait method:
def _wait_with_timeout(self, operation, timeout=300, interval=2):
"""Base method for wait operations with timeout.
Args:
operation (callable): Function to execute
timeout (int): Maximum wait time in seconds
interval (int): Sleep interval between retries
"""
start_time = time.time()
while time.time() - start_time < timeout:
try:
return operation()
except Exception as e:
if time.time() - start_time >= timeout:
raise TimeoutError(f"Operation timed out: {str(e)}")
time.sleep(interval)
e2e/libs/sharemanager/crd.py (1)
63-66
: Consider extracting timestamp comparison logic
The datetime parsing and comparison logic could be moved to a utility function for reuse across other test cases, especially since this PR involves multiple recovery test scenarios.
Consider creating a utility function like:
def is_newer_timestamp(new_time: str, old_time: str, fmt: str = "%Y-%m-%dT%H:%M:%SZ") -> bool:
return datetime.strptime(new_time, fmt) > datetime.strptime(old_time, fmt)
e2e/libs/keywords/k8s_keywords.py (1)
83-84
: Consider adding a docstring for better maintainability.
The method implementation looks good and follows the class's pattern of wrapping k8s module functions. However, adding a docstring would improve maintainability by documenting the method's purpose and parameters.
Consider adding documentation like this:
def wait_for_namespace_pods_running(self, namespace):
+ """Wait for all pods in the specified namespace to be in running state.
+
+ Args:
+ namespace (str): The namespace to check for running pods
+
+ Returns:
+ bool: True if all pods are running, False otherwise
+ """
return wait_for_namespace_pods_running(namespace)
e2e/libs/backing_image/rest.py (1)
113-113
: Fix whitespace consistency
There are extra blank lines around the new methods.
Apply this diff to maintain consistent spacing:
-
def delete_backing_image_manager(self, name):
e2e/libs/keywords/workload_keywords.py (2)
64-70
: LGTM: Enhanced pod selection capabilities.
The addition of namespace and label_selector parameters improves the flexibility of pod selection and deletion.
Consider adding docstring to document the parameters, especially the format expected for label_selector:
def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default", label_selector=""):
"""Delete workload pod on specific node.
Args:
workload_name (str): Name of the workload
node_name (str): Name of the node
namespace (str, optional): Kubernetes namespace. Defaults to "default"
label_selector (str, optional): Kubernetes label selector (e.g. "app=nginx"). Defaults to ""
"""
49-51
: LGTM: Consistent namespace parameter implementation.
The addition of namespace parameters across methods follows a consistent pattern and improves the test framework's flexibility while maintaining backward compatibility.
Consider creating a base class or configuration object to store common parameters like default namespace. This would make it easier to modify defaults across all methods and reduce parameter repetition.
Example:
class WorkloadConfig:
DEFAULT_NAMESPACE = "default"
class workload_keywords:
def __init__(self):
self.config = WorkloadConfig()
# ... rest of init ...
def delete_pod(self, pod_name, namespace=None):
namespace = namespace or self.config.DEFAULT_NAMESPACE
# ... rest of method ...
Also applies to: 64-70, 71-72
e2e/keywords/workload.resource (3)
190-201
: Consider improving maintainability and consistency.
A few suggestions to enhance the code:
- Remove unnecessary empty lines for consistency with the rest of the file.
- Consider using a mapping for label selectors to improve maintainability.
Apply this diff to implement the suggestions:
Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
-
${node_name} = get_node_by_index ${node_id}
-
IF '${workload_name}' == 'engine-image'
${label_selector} = Set Variable longhorn.io/component=engine-image
ELSE IF '${workload_name}' == 'instance-manager'
${label_selector} = Set Variable longhorn.io/component=instance-manager
ELSE
${label_selector} = Set Variable ${EMPTY}
END
delete_workload_pod_on_node ${workload_name} ${node_name} longhorn-system ${label_selector}
Additionally, consider creating a variable at the top of the file to map workload names to their label selectors:
*** Variables ***
&{LONGHORN_COMPONENT_LABELS} engine-image=longhorn.io/component=engine-image instance-manager=longhorn.io/component=instance-manager
Then simplify the keyword:
Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
${node_name} = get_node_by_index ${node_id}
${label_selector} = Get From Dictionary ${LONGHORN_COMPONENT_LABELS} ${workload_name} ${EMPTY}
delete_workload_pod_on_node ${workload_name} ${node_name} longhorn-system ${label_selector}
202-205
: Add documentation and error handling.
The keyword would benefit from:
- Documentation explaining its purpose and usage.
- Error handling for cases where the pod doesn't exist.
- Verification that the pod was successfully deleted.
Apply this diff to implement the suggestions:
Delete Longhorn ${workload_kind} ${workload_name} pod
+ [Documentation] Deletes a Longhorn pod of specified workload kind and name from the longhorn-system namespace.
+ ... Logs the pod name before deletion and verifies successful deletion.
+ ...
+ ... Arguments:
+ ... - workload_kind: The kind of workload (e.g., deployment, statefulset)
+ ... - workload_name: The name of the workload to delete
${pod_name} = get_workload_pod_name ${workload_name} longhorn-system
+ Should Not Be Empty ${pod_name} msg=No pod found for workload ${workload_name}
Log ${pod_name}
delete_pod ${pod_name} longhorn-system
+ Wait Until Keyword Succeeds 30s 5s Should Not Exist pod ${pod_name} longhorn-system
190-205
: Consider adding more verification steps for component recovery testing.
Given that these keywords are part of automating test cases for Longhorn components recovery, consider adding more verification steps:
- Verify that all associated resources are cleaned up after pod deletion.
- Add wait conditions to ensure the system is in a known state before proceeding with recovery tests.
Would you like me to provide examples of additional verification steps that could be added?
e2e/libs/workload/workload.py (1)
Line range hint 24-45
: Consider standardizing namespace handling across all functions
While the core pod retrieval functions now support custom namespaces, several other functions in this file still hardcode the 'default' namespace (e.g., write_pod_random_data, write_pod_large_data). Consider:
- Adding namespace parameters consistently across all pod-related functions
- Creating a module-level default namespace configuration
- Updating all exec/stream operations to use the specified namespace
Example pattern to consider:
# At module level
DEFAULT_NAMESPACE = "default"
def write_pod_random_data(pod_name, size_in_mb, file_name,
data_directory="/data", namespace=DEFAULT_NAMESPACE):
# ... use namespace parameter in api calls ...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (19)
- e2e/keywords/backing_image.resource (1 hunks)
- e2e/keywords/longhorn.resource (1 hunks)
- e2e/keywords/sharemanager.resource (1 hunks)
- e2e/keywords/workload.resource (1 hunks)
- e2e/libs/backing_image/backing_image.py (2 hunks)
- e2e/libs/backing_image/base.py (1 hunks)
- e2e/libs/backing_image/crd.py (1 hunks)
- e2e/libs/backing_image/rest.py (1 hunks)
- e2e/libs/k8s/k8s.py (3 hunks)
- e2e/libs/keywords/backing_image_keywords.py (1 hunks)
- e2e/libs/keywords/k8s_keywords.py (2 hunks)
- e2e/libs/keywords/sharemanager_keywords.py (1 hunks)
- e2e/libs/keywords/workload_keywords.py (2 hunks)
- e2e/libs/sharemanager/base.py (1 hunks)
- e2e/libs/sharemanager/crd.py (2 hunks)
- e2e/libs/sharemanager/rest.py (1 hunks)
- e2e/libs/sharemanager/sharemanager.py (1 hunks)
- e2e/libs/workload/workload.py (1 hunks)
- e2e/tests/negative/component_resilience.robot (1 hunks)
🧰 Additional context used
🪛 Ruff
e2e/libs/backing_image/crd.py
57-57: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
69-69: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
69-69: f-string without any placeholders
Remove extraneous f
prefix
(F541)
72-72: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
91-91: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/k8s/k8s.py
8-8: workload.pod.wait_for_pod_status
imported but unused
Remove unused import: workload.pod.wait_for_pod_status
(F401)
9-9: workload.pod.get_pod
imported but unused
Remove unused import: workload.pod.get_pod
(F401)
178-178: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
195-195: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/sharemanager/crd.py
44-44: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
52-52: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
55-55: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
68-68: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
🔇 Additional comments (20)
e2e/libs/sharemanager/base.py (1)
8-23
: LGTM! Well-structured abstract interface for share manager operations.
The new abstract methods provide a clean and comprehensive interface for share manager lifecycle operations, which aligns well with the PR's objective of testing component recovery. The method signatures are clear and follow consistent patterns.
e2e/libs/sharemanager/sharemanager.py (3)
21-22
: LGTM! Clean delegation to strategy implementation.
The delete method follows the strategy pattern correctly and maintains a clean interface.
24-25
: LGTM! Consistent with recovery testing requirements.
The wait_for_running method aligns well with the PR's objective of testing component recovery.
27-28
: LGTM! Simple and focused getter implementation.
The get method provides a clean interface to retrieve share manager instances.
e2e/keywords/backing_image.resource (1)
26-27
: LGTM! The keyword follows Robot Framework conventions.
The keyword is well-named and properly maps to its underlying implementation for testing backing image manager recovery.
e2e/libs/backing_image/base.py (1)
33-48
: Well-structured additions for recovery testing!
The new abstract methods form a comprehensive interface for managing backing image managers, which aligns well with the PR's objective of testing Longhorn components recovery. The methods provide the necessary operations for:
- Monitoring manager status (
wait_all_backing_image_managers_running
) - Managing lifecycle (
delete_backing_image_manager
,wait_backing_image_manager_restart
) - Retrieving state (
list_backing_image_manager
)
These additions will enable thorough testing of recovery scenarios.
e2e/keywords/sharemanager.resource (1)
23-23
: LGTM! Good spacing.
The added empty line improves readability by properly separating keyword definitions.
e2e/libs/keywords/backing_image_keywords.py (1)
36-42
:
Improve robustness of the manager deletion and recreation process.
The current implementation has several potential issues:
- No error handling for API responses
- Possible race conditions during iteration
- No timeout for the complete operation
- Lack of logging for test debugging
Here's a suggested improvement:
def delete_all_backing_image_managers_and_wait_for_recreation(self):
+ """Delete all backing image managers and ensure they are recreated.
+
+ Raises:
+ TimeoutError: If operation exceeds timeout
+ RuntimeError: If API calls fail
+ """
+ try:
backing_image_managers = self.backing_image.list_backing_image_manager()
+ if not backing_image_managers.get("items"):
+ return
+
+ # Store all managers first to avoid race conditions
+ managers_to_delete = [
+ (m["metadata"]["name"], m["metadata"]["creationTimestamp"])
+ for m in backing_image_managers["items"]
+ ]
+
- for backing_image in backing_image_managers["items"]:
- name = backing_image["metadata"]["name"]
- last_creation_time = backing_image["metadata"]["creationTimestamp"]
+ for name, last_creation_time in managers_to_delete:
+ print(f"Deleting backing image manager: {name}")
self.backing_image.delete_backing_image_manager(name)
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+ except Exception as e:
+ raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")
Let's verify if there are any existing error handling patterns in the codebase:
✅ Verification successful
Let me gather more information about the logging patterns and timeout handling in the codebase.
Based on the analysis of the codebase, I can now provide the final response:
Add error handling and logging to backing image manager operations
The codebase shows consistent patterns of error handling and logging across other components. Let's align with these patterns:
def delete_all_backing_image_managers_and_wait_for_recreation(self):
+ """Delete all backing image managers and wait for their recreation.
+
+ Raises:
+ RuntimeError: If API calls fail
+ """
+ logging(f'Cleaning up backing image managers')
+ try:
backing_image_managers = self.backing_image.list_backing_image_manager()
+ if not backing_image_managers.get("items"):
+ return
+
for backing_image in backing_image_managers["items"]:
name = backing_image["metadata"]["name"]
last_creation_time = backing_image["metadata"]["creationTimestamp"]
+ logging(f'Deleting backing image manager: {name}')
self.backing_image.delete_backing_image_manager(name)
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+ except Exception as e:
+ raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")
The codebase shows:
- Consistent error handling with try-except blocks raising RuntimeError
- Extensive use of logging for operations and their progress
- Null checks before processing lists
- No explicit timeout handling in similar wait operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for error handling patterns in similar test files
rg -A 3 "try:|except:|raise" "e2e/libs/keywords/"
Length of output: 1531
Script:
#!/bin/bash
# Search for logging patterns
rg -A 2 "logging\(" "e2e/libs/keywords/"
# Search for timeout patterns and wait operations
rg -A 2 "wait.*timeout|timeout.*wait" "e2e/libs/keywords/"
Length of output: 25649
e2e/libs/backing_image/backing_image.py (1)
3-3
: LGTM: Import statement is correctly placed
The CRD import is appropriately grouped with other backing_image related imports.
e2e/libs/sharemanager/crd.py (1)
2-14
: LGTM: Well-structured initialization with configurable retries
The addition of retry configuration through get_retry_count_and_interval()
is a good practice for test automation, allowing for environment-specific tuning.
e2e/libs/keywords/k8s_keywords.py (1)
12-12
: LGTM!
The import follows the established pattern and is properly placed with other k8s imports.
e2e/keywords/longhorn.resource (1)
70-84
: Verify test coverage for recovery scenarios
The new keywords align well with the PR objectives for testing Longhorn components recovery. Let's verify the test coverage:
✅ Verification successful
Test coverage for recovery scenarios is comprehensive
The implementation provides thorough test coverage for Longhorn components recovery:
- Test cases cover all key recovery scenarios:
- Component recovery (CSI plugins, managers, UI, etc.)
- Volume recovery for different volume types (RWO, RWX)
- Backing image volume recovery
- Recovery during replica rebuilding
- Share manager recovery for RWX volumes
The new keywords are effectively utilized in well-structured test cases that verify both the failure injection and recovery verification, with data integrity checks.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if there are corresponding test files using these keywords
# and verify the test scenarios cover all recovery cases mentioned in PR objectives
echo "Checking for test files using the new keywords..."
rg -l "Delete instance-manager of (volume|deployment)" --type robot
echo "Verifying test scenarios for different volume types..."
rg -l "Test.*recovery.*(RWO|RWX|backing image)" --type robot
Length of output: 406
Script:
#!/bin/bash
# Let's examine the actual test scenarios in the component_resilience.robot file
echo "Checking test scenarios in component_resilience.robot..."
cat e2e/tests/negative/component_resilience.robot
# Also check for any other related test files that might be using these keywords
echo -e "\nChecking for other potential test files..."
rg -l "longhorn.resource" --type robot
Length of output: 11001
e2e/libs/backing_image/rest.py (1)
114-124
:
Implement backing image manager methods following existing patterns
The newly added methods are currently returning NotImplemented
. Based on the existing patterns in the class and the PR objectives for testing Longhorn components recovery, these methods should be implemented.
Here's a suggested implementation following the class patterns:
def delete_backing_image_manager(self, name):
- return NotImplemented
+ logging(f"Deleting backing image manager {name}")
+ manager = get_longhorn_client().by_id_backing_image_manager(name)
+ get_longhorn_client().delete(manager)
def wait_all_backing_image_managers_running(self):
- return NotImplemented
+ for i in range(self.retry_count):
+ logging(f"Waiting for all backing image managers running... ({i})")
+ managers = get_longhorn_client().list_backing_image_manager()
+ all_running = True
+ for manager in managers:
+ if manager.status.state != "running":
+ all_running = False
+ break
+ if all_running:
+ return
+ time.sleep(self.retry_interval)
+ assert False, "Timeout waiting for all backing image managers running"
def wait_backing_image_manager_restart(self, name, last_creation_time):
- return NotImplemented
+ for i in range(self.retry_count):
+ logging(f"Waiting for backing image manager {name} restart... ({i})")
+ manager = get_longhorn_client().by_id_backing_image_manager(name)
+ if manager.metadata.creationTimestamp > last_creation_time:
+ return
+ time.sleep(self.retry_interval)
+ assert False, f"Timeout waiting for backing image manager {name} restart"
def list_backing_image_manager(self):
- return NotImplemented
+ return get_longhorn_client().list_backing_image_manager()
The implementation:
- Follows existing error handling and logging patterns
- Uses the retry mechanism consistently
- Maintains similar assertion patterns for timeouts
- Utilizes the Longhorn client methods for operations
Let's verify the Longhorn client API methods exist:
e2e/libs/keywords/workload_keywords.py (2)
49-51
: LGTM: Namespace parameter addition is well-implemented.
The addition of the namespace parameter with a default value maintains backward compatibility while enabling better resource isolation for tests.
71-72
: Consider handling multiple pods scenario.
While the namespace parameter addition is good, returning the first pod from the list might be problematic if there are multiple pods and the order matters.
Consider either:
- Documenting that this method should only be used with single-pod workloads, or
- Adding a parameter to specify which pod to return
Let's verify if this method is used with multi-pod workloads:
e2e/libs/workload/workload.py (3)
17-18
: LGTM: Clean namespace parameter addition
The namespace parameter addition with a default value maintains backward compatibility while enabling cross-namespace pod operations.
Line range hint 24-45
: Verify RBAC permissions for cross-namespace operations
The addition of cross-namespace support requires appropriate RBAC permissions. Please ensure:
- The test service account has necessary permissions to access pods across namespaces
- The test framework validates namespace accessibility before operations
#!/bin/bash
# Description: Check for RBAC configurations and namespace-related permissions
# Search for RBAC definitions
rg "Role|ClusterRole|ServiceAccount" --type yaml -B 5 -A 5
# Look for namespace-related test configurations
rg "namespace.*test" --type py -B 2 -A 2
Line range hint 24-45
: LGTM: Namespace support properly implemented
The namespace parameter is correctly integrated with proper error handling and logging.
Let's verify that all callers of this function handle the namespace parameter correctly:
✅ Verification successful
The previous search might have failed due to syntax. Let's try a different approach to find the callers and verify the namespace parameter usage.
Based on the shell script results, I can now provide a final response:
All callers properly handle the namespace parameter
The code analysis shows that all callers of get_workload_pods()
properly handle the namespace parameter:
get_workload_pod_names()
correctly propagates the namespace parameterwait_for_workload_pods_running()
properly passes the namespace parameterwait_for_workload_pods_stable()
correctly uses the namespace parameterwait_for_workload_pod_kept_in_state()
properly handles the namespace parameteris_workload_pods_has_annotations()
correctly uses the namespace parameter
The function's default namespace value of "default" maintains backward compatibility while allowing explicit namespace specification when needed. All callers either use the default value or explicitly pass the namespace parameter.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any direct calls to get_workload_pods that might need namespace parameter updates
# Search for function calls
rg "get_workload_pods\(" --type py -B 2 -A 2
# Search for potential namespace-related issues in test files
rg "namespace.*=.*default" --type py -B 2 -A 2
Length of output: 6894
Script:
#!/bin/bash
# Let's try a more specific search for get_workload_pods calls
rg "get_workload_pods\(" --type py -l | xargs cat
# Also search for the function definition to understand its full context
rg "def get_workload_pods" --type py -B 5 -A 20
Length of output: 27502
e2e/tests/negative/component_resilience.robot (2)
121-123
: Consistent use of replica node variable
Ensure that the variable replica node
used in deleting volume replicas is consistently defined and accessible within the test scope.
Confirm that replica node
refers to a valid and correctly assigned node in the test environment.
Also applies to: 146-148, 174-176, 202-204
81-91
: Ensure node selection is appropriately defined
In the steps where Longhorn component pods are deleted (lines 81-91), the node is specified as node 1
. Verify that node 1
accurately references an existing node in your cluster.
Run the following script to list all nodes in the cluster and confirm node 1
exists:
@abstractmethod | ||
def wait_for_running(self, name): | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding timeout parameter to wait_for_running.
For robustness in test scenarios, consider adding an optional timeout parameter to control how long the method should wait before giving up.
@abstractmethod
- def wait_for_running(self, name):
+ def wait_for_running(self, name: str, timeout: int = 300) -> bool:
+ """Wait for a share manager to reach running state.
+
+ Args:
+ name: Name of the share manager
+ timeout: Maximum time to wait in seconds (default: 300)
+
+ Returns:
+ bool: True if running state is reached, False if timeout occurs
+ """
return NotImplemented
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@abstractmethod | |
def wait_for_running(self, name): | |
return NotImplemented | |
@abstractmethod | |
def wait_for_running(self, name: str, timeout: int = 300) -> bool: | |
"""Wait for a share manager to reach running state. | |
Args: | |
name: Name of the share manager | |
timeout: Maximum time to wait in seconds (default: 300) | |
Returns: | |
bool: True if running state is reached, False if timeout occurs | |
""" | |
return NotImplemented |
def get(self, name): | ||
return NotImplemented | ||
|
||
def delete(self, name): | ||
return NotImplemented | ||
|
||
def wait_for_running(self, name): | ||
return NotImplemented | ||
|
||
def wait_for_restart(self, name, last_creation_time): | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation required: Methods return NotImplemented
All newly added methods currently return NotImplemented
. This seems inconsistent with the PR objectives which state that the tests have been executed locally. These methods are crucial for testing Longhorn components recovery and should be properly implemented.
Please implement these methods using the longhorn_client
. The implementation should:
- Handle API interactions with the share manager
- Include appropriate error handling
- Follow similar patterns as the existing
list
method
Example implementation structure for get
:
def get(self, name: str) -> dict:
"""Get share manager details by name.
Args:
name: Name of the share manager
Returns:
Share manager details
Raises:
ApiException: If the API request fails
"""
return self.longhorn_client.get_share_manager(name)
💡 Codebase verification
Error handling strategy needs to be implemented in Rest class
The review comment is valid. The code inspection reveals:
- The base class defines abstract methods without error handling guidance
- The Rest class inherits from Base and uses
longhorn_client
but has unimplemented methods returningNotImplemented
- The CRD implementation shows error handling patterns with try-except blocks and logging
- No error handling exists for API calls, timeouts, or resource states in the Rest class
Key areas needing error handling:
list()
method'slonghorn_client.list_share_manager()
call- Unimplemented methods should handle API errors and resource states
- Wait operations need timeout handling and state validation
- Consistent error logging similar to CRD implementation
🔗 Analysis chain
Define error handling strategy
The methods need a clear error handling strategy for various scenarios:
- API errors from
longhorn_client
- Timeouts during wait operations
- Resource not found cases
- Invalid state transitions
Let's verify the error handling in the base class:
Consider implementing a common error handling decorator or base method that can:
- Handle common API exceptions
- Implement retries for transient failures
- Provide consistent error messages
- Log relevant details for debugging
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check error handling patterns in the base class
rg -A 5 "raise|except" "e2e/libs/sharemanager/base.py"
Length of output: 54
Script:
#!/bin/bash
# Let's first check if the file exists and see its content
fd "base.py" "e2e/libs/sharemanager" --exec cat {}
# Also check the rest.py implementation for context
fd "rest.py" "e2e/libs/sharemanager" --exec cat {}
# Look for any error handling patterns in related files
rg -A 5 "raise|except|error" "e2e/libs/sharemanager/"
Length of output: 1736
def wait_for_restart(self, name, last_creation_time): | ||
return self.sharemanager.wait_for_restart(name, last_creation_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Based on the results, I can now generate the final response since I have sufficient information about the restart detection implementations across different classes.
Critical: REST strategy has unimplemented restart detection
The verification revealed important implementation gaps:
- The REST strategy (
rest.py
) returnsNotImplemented
forwait_for_restart
- Only the CRD strategy (
crd.py
) has a concrete implementation with retry logic and error handling - The base class (
base.py
) defines the interface but returnsNotImplemented
This could lead to runtime failures when using the REST strategy for recovery testing. The CRD implementation looks solid with proper retry and error handling, but the REST strategy needs to be implemented to match this behavior.
🔗 Analysis chain
Verify restart detection logic in strategy implementations.
The wait_for_restart method seems critical for recovery testing. Let's verify the underlying implementation handles edge cases properly.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the restart detection implementation in strategy classes
# Look for wait_for_restart implementations in CRD and Rest classes
echo "Checking CRD implementation:"
ast-grep --pattern 'def wait_for_restart($_, $_) { $$$ }'
echo -e "\nChecking error handling patterns:"
rg -A 5 "wait_for_restart"
Length of output: 1552
def delete_backing_image_manager(self, name): | ||
self.backing_image.delete_backing_image_manager(name) | ||
|
||
def wait_all_backing_image_managers_running(self): | ||
self.backing_image.wait_all_backing_image_managers_running() | ||
|
||
def wait_backing_image_manager_restart(self, name, last_creation_time): | ||
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time) | ||
|
||
def list_backing_image_manager(self): | ||
return self.backing_image.list_backing_image_manager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add docstrings and error handling to the new methods.
These test automation keywords would benefit from:
- Docstrings explaining the purpose, parameters, and expected behavior
- Timeout parameters for the wait methods to prevent hanging tests
- Basic input validation and error handling
Here's a suggested improvement for one of the methods as an example:
def wait_backing_image_manager_restart(self, name, last_creation_time):
+ """Wait for a backing image manager to restart after deletion.
+
+ Args:
+ name (str): Name of the backing image manager
+ last_creation_time (str): Previous creation timestamp
+
+ Raises:
+ TimeoutError: If manager doesn't restart within timeout
+ ValueError: If name or timestamp is invalid
+ """
+ if not name or not last_creation_time:
+ raise ValueError("Name and last_creation_time must be provided")
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
Committable suggestion was skipped due to low confidence.
def delete_backing_image_manager(self, name): | ||
self.backing_image = CRD() | ||
return self.backing_image.delete_backing_image_manager(name) | ||
|
||
def wait_all_backing_image_managers_running(self): | ||
self.backing_image = CRD() | ||
return self.backing_image.wait_all_backing_image_managers_running() | ||
|
||
def wait_backing_image_manager_restart(self, name, last_creation_time): | ||
self.backing_image = CRD() | ||
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time) | ||
|
||
def list_backing_image_manager(self): | ||
self.backing_image = CRD() | ||
return self.backing_image.list_backing_image_manager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
The shell script results provide clear evidence about the strategy pattern implementation across the codebase. Let me generate the final response:
Strategy pattern implementation needs revision to match codebase standards
The current implementation in backing_image.py
deviates from the established pattern used consistently across other modules. All other modules:
- Initialize the strategy object only once in
__init__
- Handle both REST and CRD cases properly with if-else blocks
- Use the initialized strategy object throughout their methods
Here's the correct implementation matching the codebase standards:
def __init__(self):
if self._strategy == LonghornOperationStrategy.CRD:
self.backing_image = CRD()
else:
self.backing_image = Rest()
Remove the redundant CRD instantiation from:
- delete_backing_image_manager
- wait_all_backing_image_managers_running
- wait_backing_image_manager_restart
- list_backing_image_manager
🔗 Analysis chain
Strategy pattern implementation needs revision
The current implementation has several architectural concerns:
- The new methods bypass the strategy pattern by directly instantiating CRD, which violates the pattern's purpose.
- Creating a new CRD instance per method call is inefficient and could lead to resource management issues.
- Overwriting
self.backing_image
could affect other methods expecting the REST implementation.
Consider refactoring to:
- Respect the strategy pattern:
class BackingImage(Base):
_strategy = LonghornOperationStrategy.REST
def __init__(self):
if self._strategy == LonghornOperationStrategy.REST:
self.backing_image = Rest()
+ else:
+ self.backing_image = CRD()
def delete_backing_image_manager(self, name):
- self.backing_image = CRD()
return self.backing_image.delete_backing_image_manager(name)
- If both REST and CRD implementations are needed, consider using separate attributes:
def __init__(self):
self.rest_client = Rest()
self.crd_client = CRD()
The wait_backing_image_manager_restart
method doesn't return the result from the delegate call:
def wait_backing_image_manager_restart(self, name, last_creation_time):
self.backing_image = CRD()
- self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+ return self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
Consider renaming list_backing_image_manager
to list_backing_image_managers
for consistency with other method names.
Let's verify the strategy pattern usage across the codebase:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if other classes follow the same pattern
rg -l "LonghornOperationStrategy" | xargs rg -A 5 "_strategy\s*="
Length of output: 5660
backing_image_manager = self.obj_api.get_namespaced_custom_object( | ||
group="longhorn.io", | ||
version="v1beta2", | ||
namespace="longhorn-system", | ||
plural="backingimagemanagers", | ||
name=name | ||
) | ||
except Exception as e: | ||
logging(f"Finding backing image manager {name} failed with error {e}") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Handle exceptions more specifically and improve logging
Catching all exceptions with a bare except
can obscure unexpected errors. It's better to catch specific exceptions. Additionally, consider logging the stack trace for better debugging.
Apply this diff to catch specific exceptions and log the stack trace:
- except Exception as e:
- logging(f"Finding backing image manager {name} failed with error {e}")
+ except client.exceptions.ApiException as e:
+ logging(f"Failed to find backing image manager {name}: {e}")
+ except Exception as e:
+ logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
backing_image_manager = self.obj_api.get_namespaced_custom_object( | |
group="longhorn.io", | |
version="v1beta2", | |
namespace="longhorn-system", | |
plural="backingimagemanagers", | |
name=name | |
) | |
except Exception as e: | |
logging(f"Finding backing image manager {name} failed with error {e}") | |
continue | |
backing_image_manager = self.obj_api.get_namespaced_custom_object( | |
group="longhorn.io", | |
version="v1beta2", | |
namespace="longhorn-system", | |
plural="backingimagemanagers", | |
name=name | |
) | |
except client.exceptions.ApiException as e: | |
logging(f"Failed to find backing image manager {name}: {e}") | |
except Exception as e: | |
logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True) | |
continue |
Resource ../keywords/backing_image.resource | ||
Resource ../keywords/storageclass.resource | ||
Resource ../keywords/persistentvolumeclaim.resource | ||
Resource ../keywords/k8s.resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate resource import
The resource file ../keywords/k8s.resource
is imported twice on lines 11 and 17. This duplication is unnecessary and could lead to confusion or maintenance issues.
Apply this diff to remove the duplicate import:
- Resource ../keywords/k8s.resource
Also applies to: 17-17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chriscchien This review is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
And Wait for volume 0 healthy | ||
And Check volume 0 data is intact | ||
And Wait for deployment 0 pods stable | ||
And And Check deployment 0 data in file data.txt is intact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct duplicate 'And' keywords in test steps
There are typographical errors on lines 96 and 100 where the keyword And
is duplicated.
Apply this diff to fix the typos:
- And And Check deployment 0 data in file data.txt is intact
+ And Check deployment 0 data in file data.txt is intact
- And And Check deployment 1 data in file data.txt is intact
+ And Check deployment 1 data in file data.txt is intact
Also applies to: 100-100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chriscchien This review is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
IF '${DATA_ENGINE}' == 'v1' | ||
When Create backing image bi with url=https://longhorn-backing-image.s3-us-west-1.amazonaws.com/parrot.qcow2 | ||
And Create volume 1 with backingImage=bi dataEngine=${DATA_ENGINE} | ||
And Attach volume 1 | ||
And Wait for volume 1 healthy | ||
And Write data to volume 1 | ||
|
||
When Create storageclass longhorn-test-1 with dataEngine=${DATA_ENGINE} | ||
And Create persistentvolumeclaim 1 using RWX volume with longhorn-test-1 storageclass | ||
And Create deployment 1 with persistentvolumeclaim 1 | ||
And Write 100 MB data to file data.txt in deployment 1 | ||
END | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify conditional checks for ${DATA_ENGINE}
The variable ${DATA_ENGINE}
is set to v1
at line 27. The conditional IF '${DATA_ENGINE}' == 'v1'
will always evaluate to true throughout the test cases.
Consider removing the conditional checks or making ${DATA_ENGINE}
configurable if different engine versions are intended to be tested. This will simplify the code and improve readability.
- IF '${DATA_ENGINE}' == 'v1'
...
- END
- IF '${DATA_ENGINE}' == 'v1'
...
- END
- IF '${DATA_ENGINE}' == 'v1'
...
- END
- IF '${DATA_ENGINE}' == 'v1'
...
- END
Also applies to: 97-102, 138-153, 167-183
And Attach volume 0 | ||
And Wait for volume 0 healthy | ||
And Write data to volume 0 | ||
Then Delete instance-manager of volume 0 and wait for recover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential race conditions during instance manager deletion
Deleting the instance manager while a volume is rebuilding may lead to inconsistent states or test flakiness.
Consider adding a wait or verification step to ensure the rebuild process has properly initiated before deleting the instance manager. This can help in making the test more reliable.
+ And Wait for replica rebuilding progress
Also applies to: 123-123, 145-145, 148-148, 172-172, 176-176, 200-200, 204-204
Resource ../keywords/backing_image.resource | ||
Resource ../keywords/storageclass.resource | ||
Resource ../keywords/persistentvolumeclaim.resource | ||
Resource ../keywords/k8s.resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chriscchien This review is valid.
And Wait for volume 0 healthy | ||
And Check volume 0 data is intact | ||
And Wait for deployment 0 pods stable | ||
And And Check deployment 0 data in file data.txt is intact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chriscchien This review is valid.
This pull request is now in conflict. Could you fix it @chriscchien? 🙏 |
7597939
to
5aa1885
Compare
5aa1885
to
beebd8c
Compare
Updated. cc @yangchiu , @PhanLe1010 , @shuo-wu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (12)
e2e/libs/sharemanager/base.py (2)
9-11
: Add type hints and docstring to get method.For better maintainability and implementation guidance, consider adding type hints and documentation.
@abstractmethod - def get(self, name): + def get(self, name: str) -> dict: + """Retrieve share manager details. + + Args: + name: Name of the share manager + + Returns: + dict: Share manager details + """ return NotImplemented
13-15
: Enhance delete method signature with type hints and force option.Consider adding return type and force delete option for test scenarios where cleanup is critical.
@abstractmethod - def delete(self, name): + def delete(self, name: str, force: bool = False) -> bool: + """Delete a share manager. + + Args: + name: Name of the share manager + force: Whether to force delete even if resources are in use + + Returns: + bool: True if deletion was successful + """ return NotImplementede2e/libs/sharemanager/crd.py (1)
Line range hint
10-68
: Consider architectural improvements for better maintainabilityTo improve the maintainability and reusability of this class, consider these architectural improvements:
- Move common CRD configurations to class attributes:
class CRD(Base): GROUP = "longhorn.io" VERSION = "v1beta2" NAMESPACE = "longhorn-system" PLURAL = "sharemanagers"
- Create a decorator for common error handling:
def handle_k8s_errors(func): def wrapper(self, *args, **kwargs): try: return func(self, *args, **kwargs) except client.rest.ApiException as e: logging(f"Operation failed: {e}") raise return wrapperThis would make the code more DRY and easier to maintain.
Would you like me to help implement these improvements?
🧰 Tools
🪛 Ruff
44-44: Loop control variable
i
not used within loop bodyRename unused
i
to_i
(B007)
52-52: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
55-55: Loop control variable
i
not used within loop bodyRename unused
i
to_i
(B007)
68-68: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
e2e/keywords/longhorn.resource (1)
70-84
: Consider implementing a more comprehensive recovery testing pattern.The current implementation provides basic building blocks for recovery testing. Consider enhancing it with:
A structured recovery testing pattern that includes:
- Pre-failure state validation
- Controlled failure injection
- Recovery monitoring
- Post-recovery state validation
Metrics collection during recovery:
- Recovery time measurements
- Component state transitions
- Resource usage during recovery
Recovery scenarios matrix:
- Single component failures
- Multiple component failures
- Cascading failures
- Network partition scenarios
This would provide more thorough validation of Longhorn's recovery capabilities.
e2e/libs/keywords/workload_keywords.py (2)
64-70
: Consider adding error handling for pod retrieval.While the implementation is correct, consider adding error handling for cases where no pods are found matching the criteria. This would improve the robustness of the test automation.
def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default", label_selector=""): pods = get_workload_pods(workload_name, namespace=namespace, label_selector=label_selector) + if not pods: + logging(f'No pods found for workload {workload_name} on node {node_name}') + return for pod in pods: if pod.spec.node_name == node_name: logging(f'Deleting pod {pod.metadata.name} on node {node_name}') delete_pod(pod.metadata.name, namespace=namespace)
Line range hint
1-190
: Consider architectural improvements for better async/sync separation.The class mixes synchronous and asynchronous methods, along with different concurrency approaches (multiprocessing and asyncio). Consider:
- Separating sync and async methods into different classes or modules
- Standardizing on either multiprocessing or asyncio for concurrent operations
- Grouping related methods (e.g., pod operations, volume operations) together
This would improve maintainability and make the testing framework's behavior more predictable.
e2e/keywords/workload.resource (3)
190-190
: Remove unnecessary empty lines.These empty lines don't serve any purpose and should be removed for better code organization.
Also applies to: 192-192
193-201
: Add documentation for label selector logic.The conditional logic for label selectors is well-implemented, but would benefit from documentation explaining:
- Why different components need different label selectors
- What each label selector accomplishes
- When to use empty label selector
Add documentation above the IF block:
Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id} ${node_name} = get_node_by_index ${node_id} + # Select appropriate label for Longhorn components: + # - engine-image: Targets engine image pods + # - instance-manager: Targets instance manager pods + # - others: No specific label selection IF '${workload_name}' == 'engine-image'
202-205
: Add documentation and improve error handling.The keyword implementation is clean but would benefit from:
- Documentation explaining its purpose and when to use it vs. the node-specific version
- Error handling for cases where the pod doesn't exist
Consider this enhancement:
+[Documentation] Deletes a Longhorn workload pod from the longhorn-system namespace. +... Use this keyword when the node location is not relevant. Delete Longhorn ${workload_kind} ${workload_name} pod ${pod_name} = get_workload_pod_name ${workload_name} longhorn-system + Should Not Be Empty ${pod_name} msg=Pod ${workload_name} not found in longhorn-system namespace Log ${pod_name} delete_pod ${pod_name} longhorn-systeme2e/tests/negative/component_resilience.robot (3)
118-118
: Use 'When' instead of 'Then' for actions in test stepsIn the test steps at lines 118, 123, 143, 175, 199, and 203, the keyword
Then
is used to perform actions (Delete instance-manager...
), whereasWhen
is more appropriate for action steps.Then
is generally used for assertions or verifying outcomes.Apply this diff to replace
Then
withWhen
:-Then Delete instance-manager of volume 0 and wait for recover +When Delete instance-manager of volume 0 and wait for recoverAlso applies to: 123-123, 143-143, 175-175, 199-199, 203-203
68-68
: Avoid hardcoding URLs; consider parameterizing the backing image URLThe backing image URL is hardcoded at lines 68 and 138. Hardcoding URLs can lead to maintenance issues if the URL changes. Consider defining the URL as a variable to improve flexibility and maintainability.
Apply this diff to define the URL as a variable:
*** Variables *** ${LOOP_COUNT} 1 ${RETRY_COUNT} 300 ${RETRY_INTERVAL} 1 ${RWX_VOLUME_FAST_FAILOVER} false ${DATA_ENGINE} v1 +${BACKING_IMAGE_URL} https://longhorn-backing-image.s3-us-west-1.amazonaws.com/parrot.qcow2 *** Test Cases *** Test Longhorn components recovery ... - When Create backing image bi with url=https://longhorn-backing-image.s3-us-west-1.amazonaws.com/parrot.qcow2 + When Create backing image bi with url=${BACKING_IMAGE_URL}Also applies to: 138-138
43-101
: Extract common test steps into reusable keywords to improve maintainabilityThe test cases from lines 43 to 203 contain several similar steps that are repeated across different tests. Extracting these common steps into reusable keywords will reduce code duplication, enhance readability, and make the tests easier to maintain.
Consider defining new keywords for recurring actions, such as:
- Setup Test Environment: Creating volumes, storage classes, and deployments.
- Write Data: Writing data to volumes or deployments.
- Delete Instance Manager and Wait for Recovery: Deleting instance managers and verifying recovery.
- Verify Data Integrity: Checking that data is intact after recovery.
Applying this structure will make your test cases cleaner and more modular.
Also applies to: 102-123, 124-152, 153-182, 183-203
🛑 Comments failed to post (7)
e2e/libs/sharemanager/base.py (1)
21-23: 🛠️ Refactor suggestion
Enhance wait_for_restart method signature.
For consistency with wait_for_running and better type safety:
- Add timeout and polling interval parameters
- Properly type the timestamp parameter
- Add comprehensive documentation
@abstractmethod - def wait_for_restart(self, name, last_creation_time): + def wait_for_restart( + self, + name: str, + last_creation_time: datetime, + timeout: int = 300, + poll_interval: int = 5 + ) -> bool: + """Wait for a share manager to restart after a specific time. + + Args: + name: Name of the share manager + last_creation_time: Timestamp to compare against for restart detection + timeout: Maximum time to wait in seconds (default: 300) + poll_interval: Time between checks in seconds (default: 5) + + Returns: + bool: True if restart is detected, False if timeout occurs + """ return NotImplementedDon't forget to add the datetime import at the top of the file:
+from datetime import datetime from abc import ABC, abstractmethod
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.@abstractmethod def wait_for_restart( self, name: str, last_creation_time: datetime, timeout: int = 300, poll_interval: int = 5 ) -> bool: """Wait for a share manager to restart after a specific time. Args: name: Name of the share manager last_creation_time: Timestamp to compare against for restart detection timeout: Maximum time to wait in seconds (default: 300) poll_interval: Time between checks in seconds (default: 5) Returns: bool: True if restart is detected, False if timeout occurs """ return NotImplemented
e2e/libs/keywords/sharemanager_keywords.py (2)
51-52: 🛠️ Refactor suggestion
Add input validation and error handling.
The method should include basic validation and error handling to ensure robustness:
def delete_sharemanager(self, name): + """Delete a share manager. + + Args: + name (str): Name of the share manager to delete + + Raises: + ValueError: If name is empty + RuntimeError: If deletion fails + """ + if not name: + raise ValueError("Share manager name cannot be empty") + try: return self.sharemanager.delete(name) + except Exception as e: + raise RuntimeError(f"Failed to delete share manager {name}: {str(e)}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def delete_sharemanager(self, name): """Delete a share manager. Args: name (str): Name of the share manager to delete Raises: ValueError: If name is empty RuntimeError: If deletion fails """ if not name: raise ValueError("Share manager name cannot be empty") try: return self.sharemanager.delete(name) except Exception as e: raise RuntimeError(f"Failed to delete share manager {name}: {str(e)}")
60-61: 🛠️ Refactor suggestion
Add timeout and error handling for robustness.
The method should include timeout, error handling, and proper documentation:
-def wait_for_share_manager_running(self, name): - return self.sharemanager.wait_for_running(name) +def wait_for_share_manager_running(self, name, timeout=300): + """Wait for a share manager to reach running state. + + Args: + name (str): Name of the share manager + timeout (int): Maximum time to wait in seconds + + Raises: + ValueError: If name is empty + TimeoutError: If share manager doesn't reach running state within timeout + """ + if not name: + raise ValueError("Share manager name cannot be empty") + try: + return self.sharemanager.wait_for_running(name, timeout=timeout) + except Exception as e: + raise TimeoutError(f"Share manager {name} failed to reach running state: {str(e)}")Committable suggestion skipped: line range outside the PR's diff.
e2e/libs/backing_image/crd.py (3)
37-54: 🛠️ Refactor suggestion
Add configuration options and error handling for manager operations.
Consider the following improvements:
- Make the namespace configurable instead of hardcoding "longhorn-system"
- Add error handling for API operations
- Add logging for successful deletion
def list_backing_image_manager(self): label_selector = 'longhorn.io/component=backing-image-manager' + try: return self.obj_api.list_namespaced_custom_object( group="longhorn.io", version="v1beta2", - namespace="longhorn-system", + namespace=self.namespace, # Add to __init__ plural="backingimagemanagers", label_selector=label_selector) + except client.exceptions.ApiException as e: + logging(f"Failed to list backing image managers: {e}") + raise def delete_backing_image_manager(self, name): logging(f"deleting backing image manager {name} ...") + try: self.obj_api.delete_namespaced_custom_object( group="longhorn.io", version="v1beta2", - namespace="longhorn-system", + namespace=self.namespace, # Add to __init__ plural="backingimagemanagers", name=name ) + logging(f"Successfully deleted backing image manager {name}") + except client.exceptions.ApiException as e: + logging(f"Failed to delete backing image manager {name}: {e}") + raiseCommittable suggestion skipped: line range outside the PR's diff.
56-69: 🛠️ Refactor suggestion
Improve error handling and assertions in wait_all_backing_image_managers_running.
The method needs several improvements:
- Safe dictionary access (as noted in previous review)
- Better error handling
- Proper assertion usage
- Unused loop variable fix
- def wait_all_backing_image_managers_running(self): - for i in range(self.retry_count): + def wait_all_backing_image_managers_running(self): + for _ in range(self.retry_count): all_running = True - backing_image_managers = self.list_backing_image_manager() + try: + backing_image_managers = self.list_backing_image_manager() + except Exception as e: + logging(f"Failed to list backing image managers: {e}") + time.sleep(self.retry_interval) + continue + for backing_image_manager in backing_image_managers["items"]: - current_state = backing_image_manager["status"]["currentState"] name = backing_image_manager["metadata"]["name"] + current_state = backing_image_manager.get("status", {}).get("currentState", "unknown") logging(f"backing image manager {name} currently in {current_state} state") if current_state != "running": all_running = False if all_running is True: return time.sleep(self.retry_interval) - assert False, f"Waiting all backing image manager in running state timeout" + raise AssertionError("Timeout waiting for all backing image managers to reach running state")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def wait_all_backing_image_managers_running(self): for _ in range(self.retry_count): all_running = True try: backing_image_managers = self.list_backing_image_manager() except Exception as e: logging(f"Failed to list backing image managers: {e}") time.sleep(self.retry_interval) continue for backing_image_manager in backing_image_managers["items"]: name = backing_image_manager["metadata"]["name"] current_state = backing_image_manager.get("status", {}).get("currentState", "unknown") logging(f"backing image manager {name} currently in {current_state} state") if current_state != "running": all_running = False if all_running is True: return time.sleep(self.retry_interval) raise AssertionError("Timeout waiting for all backing image managers to reach running state")
🧰 Tools
🪛 Ruff
57-57: Loop control variable
i
not used within loop bodyRename unused
i
to_i
(B007)
69-69: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
69-69: f-string without any placeholders
Remove extraneous
f
prefix(F541)
71-91: 🛠️ Refactor suggestion
Improve error handling and datetime comparison in wait_backing_image_manager_restart.
The method needs several improvements:
- Better exception handling
- Proper assertion usage
- More robust datetime comparison
- Unused loop variable fix
- def wait_backing_image_manager_restart(self, name, last_creation_time): - for i in range(self.retry_count): + def wait_backing_image_manager_restart(self, name, last_creation_time): + fmt = "%Y-%m-%dT%H:%M:%SZ" + try: + last_creation_dt = datetime.strptime(last_creation_time, fmt) + except ValueError as e: + raise ValueError(f"Invalid last_creation_time format: {e}") + + for _ in range(self.retry_count): time.sleep(self.retry_interval) try: backing_image_manager = self.obj_api.get_namespaced_custom_object( group="longhorn.io", version="v1beta2", namespace="longhorn-system", plural="backingimagemanagers", name=name ) - except Exception as e: + except client.exceptions.ApiException as e: logging(f"Finding backing image manager {name} failed with error {e}") continue creation_time = backing_image_manager["metadata"]["creationTimestamp"] - fmt = "%Y-%m-%dT%H:%M:%SZ" - if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt): - return + try: + current_creation_dt = datetime.strptime(creation_time, fmt) + if current_creation_dt > last_creation_dt: + return + except ValueError as e: + logging(f"Invalid creation_time format from API: {e}") + continue - assert False, f"Wait backing image manager {name} restart failed ..." + raise AssertionError(f"Timeout waiting for backing image manager {name} to restart")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def wait_backing_image_manager_restart(self, name, last_creation_time): fmt = "%Y-%m-%dT%H:%M:%SZ" try: last_creation_dt = datetime.strptime(last_creation_time, fmt) except ValueError as e: raise ValueError(f"Invalid last_creation_time format: {e}") for _ in range(self.retry_count): time.sleep(self.retry_interval) try: backing_image_manager = self.obj_api.get_namespaced_custom_object( group="longhorn.io", version="v1beta2", namespace="longhorn-system", plural="backingimagemanagers", name=name ) except client.exceptions.ApiException as e: logging(f"Finding backing image manager {name} failed with error {e}") continue creation_time = backing_image_manager["metadata"]["creationTimestamp"] try: current_creation_dt = datetime.strptime(creation_time, fmt) if current_creation_dt > last_creation_dt: return except ValueError as e: logging(f"Invalid creation_time format from API: {e}") continue raise AssertionError(f"Timeout waiting for backing image manager {name} to restart")
🧰 Tools
🪛 Ruff
72-72: Loop control variable
i
not used within loop bodyRename unused
i
to_i
(B007)
91-91: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
e2e/libs/keywords/workload_keywords.py (1)
71-72:
⚠️ Potential issueHandle potential IndexError in get_workload_pod_name.
The method assumes at least one pod exists and may raise an IndexError if no pods are found. Consider adding error handling.
def get_workload_pod_name(self, workload_name, namespace="default"): - return get_workload_pod_names(workload_name, namespace)[0] + pod_names = get_workload_pod_names(workload_name, namespace) + if not pod_names: + raise ValueError(f"No pods found for workload {workload_name} in namespace {namespace}") + return pod_names[0]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def get_workload_pod_name(self, workload_name, namespace="default"): pod_names = get_workload_pod_names(workload_name, namespace) if not pod_names: raise ValueError(f"No pods found for workload {workload_name} in namespace {namespace}") return pod_names[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (15)
e2e/libs/sharemanager/base.py (2)
9-11
: Add type hints and documentation for the get method.Consider adding type hints and docstring to better define the interface contract.
@abstractmethod - def get(self, name): + def get(self, name: str) -> dict: + """Retrieve share manager details. + + Args: + name: Name of the share manager + + Returns: + dict: Share manager details + """ return NotImplemented
13-15
: Add type hints and documentation for the delete method.Consider adding type hints and docstring to better define the interface contract.
@abstractmethod - def delete(self, name): + def delete(self, name: str) -> bool: + """Delete a share manager. + + Args: + name: Name of the share manager + + Returns: + bool: True if deletion was successful, False otherwise + """ return NotImplementede2e/libs/sharemanager/crd.py (1)
43-52
: Consider enhancing logging and reusability.While the basic functionality is correct, consider these improvements:
- Add more detailed logging (e.g., retry attempt number, time elapsed)
- Extract the retry logic into a common utility function that can be reused by other wait methods
Example enhancement:
def wait_for_condition(self, name, condition_fn, error_msg): start_time = time.time() for attempt in range(self.retry_count): logging(f"Attempt {attempt + 1}/{self.retry_count} ({time.time() - start_time:.1f}s elapsed)") if condition_fn(): return time.sleep(self.retry_interval) raise AssertionError(error_msg)🧰 Tools
🪛 Ruff
44-44: Loop control variable
i
not used within loop bodyRename unused
i
to_i
(B007)
52-52: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
e2e/libs/backing_image/crd.py (3)
14-36
: Add docstrings to NotImplemented methods.These methods lack documentation describing their intended functionality, parameters, and return values. This documentation would be valuable for future implementation and maintenance.
Example docstring format:
def create(self, bi_name, source_type, url, expected_checksum): """Create a new backing image resource. Args: bi_name (str): Name of the backing image source_type (str): Type of the source url (str): URL of the source expected_checksum (str): Expected checksum for validation Returns: NotImplemented: Currently not implemented """ return NotImplemented
37-45
: Make namespace and API version configurable.The namespace "longhorn-system" and API version "v1beta2" are hardcoded. Consider making these configurable through class initialization or environment variables for better flexibility.
def __init__(self): self.obj_api = client.CustomObjectsApi() self.retry_count, self.retry_interval = get_retry_count_and_interval() + self.namespace = os.getenv("LONGHORN_NAMESPACE", "longhorn-system") + self.api_version = os.getenv("LONGHORN_API_VERSION", "v1beta2")
86-89
: Add buffer time to creation timestamp comparison.The direct timestamp comparison might be unreliable if the timestamps are very close. Consider adding a small buffer time to ensure the restart is detected correctly.
- if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt): + new_time = datetime.strptime(creation_time, fmt) + old_time = datetime.strptime(last_creation_time, fmt) + # Add a small buffer (e.g., 1 second) to avoid race conditions + if (new_time - old_time).total_seconds() > 1:e2e/keywords/workload.resource (2)
200-211
: Improve maintainability and documentation of the pod deletion keyword.The implementation looks good but could benefit from some improvements:
- Consider using a dictionary for label selectors to make it more maintainable:
COMPONENT_LABELS = { 'engine-image': 'longhorn.io/component=engine-image', 'instance-manager': 'longhorn.io/component=instance-manager' }
- Add documentation to explain the purpose and parameters:
[Documentation] Deletes a Longhorn pod of specified workload type on a given node ... ${workload_kind} - Type of the workload ... ${workload_name} - Name of the workload (engine-image/instance-manager) ... ${node_id} - Target node ID
212-215
: Add error handling and documentation to the pod deletion keyword.The implementation is good but could be enhanced with:
- Documentation to explain the purpose and parameters:
[Documentation] Deletes a Longhorn pod of specified workload type ... ${workload_kind} - Type of the workload ... ${workload_name} - Name of the workload
- Error handling and deletion verification:
Delete Longhorn ${workload_kind} ${workload_name} pod ${pod_name} = get_workload_pod_name ${workload_name} longhorn-system Should Not Be Empty ${pod_name} msg=Pod name not found for workload ${workload_name} Log Deleting pod: ${pod_name} delete_pod ${pod_name} longhorn-system Wait Until Keyword Succeeds 30s 5s Should Not Exist pod ${pod_name} longhorn-systeme2e/libs/workload/workload.py (5)
Line range hint
17-40
: LGTM! Consider enhancing error message clarity.The namespace parameter addition and pod filtering logic look good. The error handling is improved with namespace context.
Consider making the log message more explicit when no pods are found:
- logging(f"No pods found for workload {workload_name} in namespace {namespace}") + logging(f"No running pods found for workload '{workload_name}' in namespace '{namespace}'. Either the workload doesn't exist or all pods were terminated by kubelet.")
Line range hint
54-67
: Enhance error handling for PVC retrieval.While the namespace support is correctly implemented, the error handling could be more robust.
Consider these improvements:
def get_workload_persistent_volume_claim_names(workload_name, namespace="default"): claim_names = [] api = client.CoreV1Api() label_selector = f"app={workload_name}" - claim = api.list_namespaced_persistent_volume_claim( - namespace=namespace, - label_selector=label_selector - ) + try: + claim = api.list_namespaced_persistent_volume_claim( + namespace=namespace, + label_selector=label_selector + ) + except client.exceptions.ApiException as e: + if e.status == 404: + raise ValueError(f"Namespace '{namespace}' not found") + raise RuntimeError(f"Failed to list PVCs for workload '{workload_name}' in namespace '{namespace}': {e}") for item in claim.items: claim_names.append(item.metadata.name) claim_names.sort() - assert len(claim_names) > 0, f"Failed to get PVC names for workload {workload_name}" + assert len(claim_names) > 0, f"No PVCs found for workload '{workload_name}' in namespace '{namespace}' with label selector '{label_selector}'" return claim_names
Line range hint
171-187
: Improve logging for better debugging.The namespace support and retry logic are well implemented, but the logging could be more informative.
Consider these improvements:
for i in range(retry_count): pods = get_workload_pods(workload_name, namespace=namespace) + total_pods = len(pods) + running_pods = [] if len(pods) > 0: - running_pods = [] for pod in pods: if pod.status.phase == "Running": running_pods.append(pod.metadata.name) - if len(running_pods) == len(pods): + if len(running_pods) == total_pods: return - logging(f"Waiting for {workload_name} pods {running_pods} running, retry ({i}) ...") + logging(f"Waiting for workload '{workload_name}' pods in namespace '{namespace}' to be running ({len(running_pods)}/{total_pods} ready), retry {i+1}/{retry_count}") time.sleep(retry_interval) - assert False, f"Timeout waiting for {workload_name} pods running" + assert False, f"Timeout waiting for workload '{workload_name}' pods in namespace '{namespace}' to be running after {retry_count} retries"
Line range hint
232-270
: Improve code structure and readability.While the namespace support and state tracking logic are correct, the code structure could be improved.
Consider these improvements:
+ def is_pod_in_expected_state(pod, expect_state): + if expect_state == "ContainerCreating": + return pod.status.phase == "Pending" + if expect_state == "Terminating": + return hasattr(pod.metadata, "deletion_timestamp") and pod.status.phase == "Running" + if expect_state == "Running": + return pod.status.phase == "Running" + return False def count_pod_in_specifc_state_duration(count_pod_in_state_duration, pods, expect_state): for pod in pods: pod_name = pod.metadata.name if pod_name not in count_pod_in_state_duration: count_pod_in_state_duration[pod_name] = 0 - elif (expect_state == "ContainerCreating" and pod.status.phase == "Pending") or \ - ((expect_state == "Terminating" and hasattr(pod.metadata, "deletion_timestamp") and pod.status.phase == "Running")) or \ - (expect_state == "Running" and pod.status.phase == "Running"): + elif is_pod_in_expected_state(pod, expect_state): count_pod_in_state_duration[pod_name] += 1 else: count_pod_in_state_duration[pod_name] = 0
Line range hint
277-282
: Rename function and improve error handling.While the namespace and label selector support are correctly implemented, there are some improvements needed.
Consider these improvements:
-def is_workload_pods_has_annotations(workload_name, annotation_key, namespace="default", label_selector=""): +def do_workload_pods_have_annotation(workload_name, annotation_key, namespace="default", label_selector=""): pods = get_workload_pods(workload_name, namespace=namespace, label_selector=label_selector) + if not pods: + logging(f"No pods found for workload '{workload_name}' in namespace '{namespace}' with label selector '{label_selector}'") + return False + for pod in pods: if not (pod.metadata.annotations and annotation_key in pod.metadata.annotations): + logging(f"Pod '{pod.metadata.name}' in namespace '{namespace}' is missing annotation '{annotation_key}'") return False return Truee2e/tests/negative/component_resilience.robot (2)
29-41
: Refactor similar keywords to reduce duplicationThe keywords
Delete instance-manager of volume ${volume_id} and wait for recover
andDelete instance-manager of deployment ${deployment_id} volume and wait for recover
have similar steps. Refactoring common actions into a single, parameterized keyword will improve maintainability and reduce code duplication.
43-203
: Extract repeated test steps into reusable keywordsThe test cases include repeated sequences such as creating volumes, attaching them, writing data, and deleting instance managers. Consider extracting these sequences into reusable keywords to enhance readability and maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (19)
e2e/keywords/backing_image.resource
(1 hunks)e2e/keywords/longhorn.resource
(1 hunks)e2e/keywords/sharemanager.resource
(1 hunks)e2e/keywords/workload.resource
(1 hunks)e2e/libs/backing_image/backing_image.py
(2 hunks)e2e/libs/backing_image/base.py
(1 hunks)e2e/libs/backing_image/crd.py
(1 hunks)e2e/libs/backing_image/rest.py
(1 hunks)e2e/libs/k8s/k8s.py
(3 hunks)e2e/libs/keywords/backing_image_keywords.py
(1 hunks)e2e/libs/keywords/k8s_keywords.py
(2 hunks)e2e/libs/keywords/sharemanager_keywords.py
(1 hunks)e2e/libs/keywords/workload_keywords.py
(2 hunks)e2e/libs/sharemanager/base.py
(1 hunks)e2e/libs/sharemanager/crd.py
(2 hunks)e2e/libs/sharemanager/rest.py
(1 hunks)e2e/libs/sharemanager/sharemanager.py
(1 hunks)e2e/libs/workload/workload.py
(1 hunks)e2e/tests/negative/component_resilience.robot
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- e2e/keywords/backing_image.resource
- e2e/keywords/sharemanager.resource
- e2e/libs/backing_image/backing_image.py
- e2e/libs/backing_image/base.py
- e2e/libs/backing_image/rest.py
- e2e/libs/keywords/backing_image_keywords.py
- e2e/libs/keywords/k8s_keywords.py
- e2e/libs/keywords/sharemanager_keywords.py
- e2e/libs/keywords/workload_keywords.py
- e2e/libs/sharemanager/rest.py
- e2e/libs/sharemanager/sharemanager.py
🧰 Additional context used
🪛 Ruff
e2e/libs/backing_image/crd.py
57-57: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
69-69: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
69-69: f-string without any placeholders
Remove extraneous f
prefix
(F541)
72-72: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
91-91: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/k8s/k8s.py
8-8: workload.pod.wait_for_pod_status
imported but unused
Remove unused import: workload.pod.wait_for_pod_status
(F401)
9-9: workload.pod.get_pod
imported but unused
Remove unused import: workload.pod.get_pod
(F401)
178-178: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
195-195: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/sharemanager/crd.py
44-44: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
52-52: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
55-55: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
68-68: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
🔇 Additional comments (12)
e2e/libs/sharemanager/base.py (1)
17-19
: Add type hints and documentation for the wait_for_running method.
Consider adding type hints and docstring to better define the interface contract.
Note: A previous review already suggested adding a timeout parameter to this method.
e2e/libs/sharemanager/crd.py (3)
2-14
: LGTM: Clean imports and initialization.
The new imports and retry parameter initialization are well-structured and necessary for the added functionality.
24-31
: Duplicate of previous review comment.
A previous review already suggested adding error handling and making group, version, and namespace configurable.
33-41
: Duplicate of previous review comment.
A previous review already suggested adding error handling for the delete operation.
e2e/keywords/longhorn.resource (3)
70-74
: Add error handling and verification steps.
The previous review comment about error handling is still applicable. Additionally, consider verifying the volume state after pod deletion to ensure the operation's success.
#!/bin/bash
# Description: Check if there are any existing error handling patterns in the codebase
# that we should follow for consistency.
# Test: Look for similar pod deletion patterns with error handling
rg -A 5 "delete_pod.*longhorn-system"
# Test: Look for volume state verification patterns
ast-grep --pattern 'get_volume_state'
76-81
: Reduce code duplication and enhance error handling.
The previous review comment about code duplication and error handling improvements is still applicable.
83-84
: Enhance component recovery verification.
- The previous review comment about adding a configurable timeout parameter is still applicable.
- Ensure the keyword properly handles all Longhorn component types during recovery scenarios.
#!/bin/bash
# Description: Verify that all Longhorn component types are properly handled.
# Test: Check for any component types defined in the codebase that we should explicitly wait for
rg -A 5 "longhorn-system.*component.*type"
# Test: Look for any existing component recovery patterns
ast-grep --pattern 'wait_for.*component.*running'
e2e/libs/backing_image/crd.py (1)
1-13
: LGTM! Class structure and imports are well-organized.
The class follows good practices with appropriate inheritance and initialization.
e2e/libs/k8s/k8s.py (1)
8-9
: Skip comment - covered by past review.
🧰 Tools
🪛 Ruff
8-8: workload.pod.wait_for_pod_status
imported but unused
Remove unused import: workload.pod.wait_for_pod_status
(F401)
9-9: workload.pod.get_pod
imported but unused
Remove unused import: workload.pod.get_pod
(F401)
e2e/keywords/workload.resource (1)
200-215
: Verify coverage of all Longhorn component types.
The implementation supports the PR's objective of testing Longhorn components recovery. Let's verify if all component types are covered.
✅ Verification successful
Implementation correctly covers all Longhorn component types
Based on the verification results, the implementation adequately handles all major Longhorn component types:
- The code supports specific component deletion through labels for:
engine-image
instance-manager
- Other components without specific labels (CSI components, UI, etc.)
This is validated by:
- Test cases in
component_resilience.robot
showing comprehensive coverage of components: CSI plugin, attacher, provisioner, resizer, snapshotter, manager, engine-image, instance-manager, UI, and driver-deployer - Consistent label handling in the codebase for component identification (
longhorn.io/component
) - Support for both node-specific and general pod deletion scenarios
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if all Longhorn component types are covered in the test files
# Expected: Find references to all major Longhorn component types in the tests
echo "Checking for Longhorn component types in test files..."
rg -l "longhorn.io/component=" --type yaml
echo "Checking for component deletion test cases..."
rg "Delete Longhorn .* pod" --type robot
echo "Checking for recovery test implementations..."
rg "Wait.*recovery" --type robot
Length of output: 1645
Script:
#!/bin/bash
# Let's check the component labels in the Longhorn YAML manifests
echo "Checking component labels in Longhorn manifests..."
rg "component: " --type yaml -B 2 -A 2
# Let's also check for any other potential component types in the codebase
echo -e "\nChecking for other component references..."
rg "longhorn.io/component" -B 2 -A 2
Length of output: 3465
e2e/libs/workload/workload.py (1)
Line range hint 190-229
: LGTM! Consider improving error messages.
The async implementation and stability checking logic are solid. The namespace support is correctly implemented.
Let's verify the async implementation is used correctly throughout the codebase:
Consider improving the error messages:
- logging(f"Waiting for {workload_name} pods {wait_for_stable_pod} stable, retry ({i}) ...")
+ logging(f"Waiting for workload '{workload_name}' pods in namespace '{namespace}' to stabilize. Unstable pods: {wait_for_stable_pod}, retry {i+1}/{retry_count}")
- assert False, f"Timeout waiting for {workload_name} pods {wait_for_stable_pod} stable)"
+ assert False, f"Timeout waiting for workload '{workload_name}' pods in namespace '{namespace}' to stabilize. Remaining unstable pods: {wait_for_stable_pod}"
✅ Verification successful
Async implementation is correctly used across the codebase
The verification shows that wait_for_workload_pods_stable
is:
- Properly defined as an async function
- Correctly awaited in
workload_keywords.py
usingawait
andasyncio.create_task()
- Used in robot framework keywords (
deployment.resource
andstatefulset.resource
) which handle async/await internally
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the async function is properly awaited
# Expected: All calls to wait_for_workload_pods_stable should use await
rg -l "wait_for_workload_pods_stable" | xargs rg "wait_for_workload_pods_stable" -B 2 -A 2
Length of output: 2600
e2e/tests/negative/component_resilience.robot (1)
26-26
: Consider making ${DATA_ENGINE}
configurable or remove redundant conditionals
${DATA_ENGINE}
is set to 'v1'
at line 26. All conditionals checking if ${DATA_ENGINE} == 'v1'
will always evaluate to True
. If testing different engine versions is not intended, consider removing these conditionals to simplify the code. Alternatively, make ${DATA_ENGINE}
configurable to allow testing with other engine versions.
@abstractmethod | ||
def wait_for_restart(self, name, last_creation_time): | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add type hints, documentation, and timeout parameter for consistency.
The method should follow similar patterns to wait_for_running
, including a timeout parameter for consistency.
@abstractmethod
- def wait_for_restart(self, name, last_creation_time):
+ def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> bool:
+ """Wait for a share manager to restart.
+
+ Args:
+ name: Name of the share manager
+ last_creation_time: Previous creation timestamp to compare against
+ timeout: Maximum time to wait in seconds (default: 300)
+
+ Returns:
+ bool: True if restart is confirmed, False if timeout occurs
+ """
return NotImplemented
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@abstractmethod | |
def wait_for_restart(self, name, last_creation_time): | |
return NotImplemented | |
@abstractmethod | |
def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> bool: | |
"""Wait for a share manager to restart. | |
Args: | |
name: Name of the share manager | |
last_creation_time: Previous creation timestamp to compare against | |
timeout: Maximum time to wait in seconds (default: 300) | |
Returns: | |
bool: True if restart is confirmed, False if timeout occurs | |
""" | |
return NotImplemented |
def wait_for_restart(self, name, last_creation_time): | ||
for i in range(self.retry_count): | ||
time.sleep(self.retry_interval) | ||
try: | ||
sharemanager = self.get(name) | ||
except Exception as e: | ||
logging(f"Finding sharemanager {name} failed with error {e}") | ||
continue | ||
|
||
creation_time = sharemanager["metadata"]["creationTimestamp"] | ||
fmt = "%Y-%m-%dT%H:%M:%SZ" | ||
if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt): | ||
return | ||
|
||
assert False, f"Wait share manager {name} restart failed ..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify timestamp comparison logic.
The datetime comparison logic could be simplified and made more robust:
- Parse timestamps once at the start
- Use UTC consistently
- Add validation for timestamp format
Here's a suggested improvement:
def wait_for_restart(self, name, last_creation_time):
+ fmt = "%Y-%m-%dT%H:%M:%SZ"
+ try:
+ last_time = datetime.strptime(last_creation_time, fmt)
+ except ValueError as e:
+ raise ValueError(f"Invalid last_creation_time format: {e}")
+
for i in range(self.retry_count):
try:
sharemanager = self.get(name)
creation_time = sharemanager["metadata"]["creationTimestamp"]
- fmt = "%Y-%m-%dT%H:%M:%SZ"
- if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt):
+ current_time = datetime.strptime(creation_time, fmt)
+ if current_time > last_time:
return
except Exception as e:
logging(f"Finding sharemanager {name} failed with error {e}")
continue
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff
55-55: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
68-68: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
def delete_backing_image_manager(self, name): | ||
logging(f"deleting backing image manager {name} ...") | ||
self.obj_api.delete_namespaced_custom_object( | ||
group="longhorn.io", | ||
version="v1beta2", | ||
namespace="longhorn-system", | ||
plural="backingimagemanagers", | ||
name=name | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling to delete_backing_image_manager.
The method should handle potential API exceptions and verify deletion success.
def delete_backing_image_manager(self, name):
logging(f"deleting backing image manager {name} ...")
- self.obj_api.delete_namespaced_custom_object(
- group="longhorn.io",
- version="v1beta2",
- namespace="longhorn-system",
- plural="backingimagemanagers",
- name=name
- )
+ try:
+ self.obj_api.delete_namespaced_custom_object(
+ group="longhorn.io",
+ version="v1beta2",
+ namespace="longhorn-system",
+ plural="backingimagemanagers",
+ name=name
+ )
+ except client.exceptions.ApiException as e:
+ logging(f"Failed to delete backing image manager {name}: {e}")
+ raise
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def delete_backing_image_manager(self, name): | |
logging(f"deleting backing image manager {name} ...") | |
self.obj_api.delete_namespaced_custom_object( | |
group="longhorn.io", | |
version="v1beta2", | |
namespace="longhorn-system", | |
plural="backingimagemanagers", | |
name=name | |
) | |
def delete_backing_image_manager(self, name): | |
logging(f"deleting backing image manager {name} ...") | |
try: | |
self.obj_api.delete_namespaced_custom_object( | |
group="longhorn.io", | |
version="v1beta2", | |
namespace="longhorn-system", | |
plural="backingimagemanagers", | |
name=name | |
) | |
except client.exceptions.ApiException as e: | |
logging(f"Failed to delete backing image manager {name}: {e}") | |
raise |
def wait_for_namespace_pods_running(namespace): | ||
retry_count, retry_interval = get_retry_count_and_interval() | ||
|
||
for i in range(retry_count): | ||
time.sleep(retry_interval) | ||
pod_list = list_namespace_pods(namespace) | ||
all_running = True | ||
|
||
for pod in pod_list.items: | ||
pod_name = pod.metadata.name | ||
pod_status = pod.status.phase | ||
|
||
if pod_status != "Running": | ||
logging(f"Pod {pod_name} is in {pod_status} state, waiting...") | ||
all_running = False | ||
|
||
if all_running: | ||
logging(f"All pods in namespace {namespace} are in Running state!") | ||
return | ||
|
||
assert False, f"wait all pod in namespace {namespace} running failed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and documentation to the function.
The function needs additional error handling and documentation:
- Add docstring describing purpose, parameters, and return value
- Handle empty pod list scenario
- Add safety checks for pod status
- Consider CrashLoopBackOff scenarios
Here's the suggested implementation:
def wait_for_namespace_pods_running(namespace):
+ """Wait for all pods in the specified namespace to reach Running state.
+
+ Args:
+ namespace (str): The namespace to check pods in
+
+ Raises:
+ AssertionError: If pods don't reach Running state within retry limit
+ """
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
time.sleep(retry_interval)
pod_list = list_namespace_pods(namespace)
+ if not pod_list.items:
+ logging(f"No pods found in namespace {namespace}")
+ return
+
all_running = True
for pod in pod_list.items:
pod_name = pod.metadata.name
- pod_status = pod.status.phase
+ # Safety check for pod status
+ if not pod.status or not pod.status.phase:
+ logging(f"Pod {pod_name} status not available")
+ all_running = False
+ continue
+
+ pod_status = pod.status.phase
+ # Check for CrashLoopBackOff
+ if pod.status.container_statuses:
+ for container in pod.status.container_statuses:
+ if container.state.waiting and \
+ container.state.waiting.reason == 'CrashLoopBackOff':
+ logging(f"Pod {pod_name} container {container.name} " \
+ "is in CrashLoopBackOff state")
+ all_running = False
if pod_status != "Running":
logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
all_running = False
if all_running:
logging(f"All pods in namespace {namespace} are in Running state!")
return
raise AssertionError(f"Timed out waiting for all pods in namespace {namespace} to reach Running state")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def wait_for_namespace_pods_running(namespace): | |
retry_count, retry_interval = get_retry_count_and_interval() | |
for i in range(retry_count): | |
time.sleep(retry_interval) | |
pod_list = list_namespace_pods(namespace) | |
all_running = True | |
for pod in pod_list.items: | |
pod_name = pod.metadata.name | |
pod_status = pod.status.phase | |
if pod_status != "Running": | |
logging(f"Pod {pod_name} is in {pod_status} state, waiting...") | |
all_running = False | |
if all_running: | |
logging(f"All pods in namespace {namespace} are in Running state!") | |
return | |
assert False, f"wait all pod in namespace {namespace} running failed" | |
def wait_for_namespace_pods_running(namespace): | |
"""Wait for all pods in the specified namespace to reach Running state. | |
Args: | |
namespace (str): The namespace to check pods in | |
Raises: | |
AssertionError: If pods don't reach Running state within retry limit | |
""" | |
retry_count, retry_interval = get_retry_count_and_interval() | |
for i in range(retry_count): | |
time.sleep(retry_interval) | |
pod_list = list_namespace_pods(namespace) | |
if not pod_list.items: | |
logging(f"No pods found in namespace {namespace}") | |
return | |
all_running = True | |
for pod in pod_list.items: | |
pod_name = pod.metadata.name | |
# Safety check for pod status | |
if not pod.status or not pod.status.phase: | |
logging(f"Pod {pod_name} status not available") | |
all_running = False | |
continue | |
pod_status = pod.status.phase | |
# Check for CrashLoopBackOff | |
if pod.status.container_statuses: | |
for container in pod.status.container_statuses: | |
if container.state.waiting and \ | |
container.state.waiting.reason == 'CrashLoopBackOff': | |
logging(f"Pod {pod_name} container {container.name} " \ | |
"is in CrashLoopBackOff state") | |
all_running = False | |
if pod_status != "Running": | |
logging(f"Pod {pod_name} is in {pod_status} state, waiting...") | |
all_running = False | |
if all_running: | |
logging(f"All pods in namespace {namespace} are in Running state!") | |
return | |
raise AssertionError(f"Timed out waiting for all pods in namespace {namespace} to reach Running state") |
🧰 Tools
🪛 Ruff
178-178: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
195-195: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue(s) this PR fixes:
Issue #9536
What this PR does / why we need it:
Automate manutal test case Test Longhorn components recovery into blow sub test cases
Special notes for your reviewer:
Tested on my local env
Additional documentation or context
Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes