fix: always ensure endpoint creation #176

derekbit · 2024-12-08T07:27:24Z

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9919

fix: always ensure endpoint creation
fix: reload device numbers while reusing linear dm device
refactor: improve log messages and api docs

Signed-off-by: Derek Su [email protected]

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai · 2024-12-08T07:27:31Z

Warning

Rate limit exceeded

@derekbit has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 22 minutes and 40 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 9894a92 and f329f32.

📒 Files selected for processing (6)

pkg/nvme/initiator.go (21 hunks)
pkg/nvme/nvme.go (4 hunks)
pkg/nvme/nvmecli.go (3 hunks)
pkg/spdk/client/advanced.go (4 hunks)
pkg/types/types.go (3 hunks)
pkg/util/device.go (1 hunks)

Walkthrough

The pull request introduces various modifications across multiple files in the NVMe package and related components. Key changes include updates to method signatures, improved error handling, and logging enhancements. New methods for device management are added in the Client struct, while constants and functions related to error messages are introduced in the types package. The overall aim of these changes is to enhance code readability, consistency, and error management.

Changes

File Path	Change Summary
pkg/nvme/initiator.go	Updated `Initiator` struct methods for error handling and logging; added `waitAndLoadNVMeDeviceInfoWithoutLock` and `connectTarget` methods. Renamed parameters for clarity.
pkg/nvme/nvme.go	Removed constant `ErrorMessageCannotFindValidNvmeDevice`, updated error handling in `GetDevices`, and modified logging messages.
pkg/nvme/nvmecli.go	Simplified function signatures by removing `executor` parameter and renamed `listControllers` to `listRecognizedNvmeDevices`.
pkg/spdk/client/advanced.go	Added methods for device management: `AddDevice`, `DeleteDevice`, `StartExposeBdev`, and `StopExposeBdev`.
pkg/types/types.go	Introduced new error message constants and functions for NVMe device management.
pkg/util/device.go	Updated `RemoveDevice` function to change parameter name and improve error handling logic.

Assessment against linked issues

Objective	Addressed	Explanation
Volume should auto salvage then reattach and become healthy (#9919)	❓	It is unclear if the changes directly address the bug's resolution.

Possibly related PRs

feat: decrease the reconnect delay of v2 volume nvme initiator #164: This PR modifies the defaultReconnectDelay constant in pkg/nvme/nvmecli.go, which is related to the reconnect behavior of the NVMe initiator, similar to the changes made to retry handling in the main PR.
fix: decrease ctrlr_loss_timeout_sec for base bdev #170: This PR updates the DefaultCtrlrLossTimeoutSec constant in pkg/types/types.go, which is relevant to timeout configurations that may impact the error handling and retry logic in the NVMe initiator, aligning with the changes in the main PR.

Suggested reviewers

innobead
c3y1huang
shuo-wu
DamiaSan

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-12-08T07:29:16Z

Codecov Report

Attention: Patch coverage is 40.82840% with 100 lines in your changes missing coverage. Please review.

Project coverage is 22.88%. Comparing base (1582c36) to head (f329f32).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/nvme/initiator.go	38.92%	77 Missing and 14 partials ⚠️
pkg/types/types.go	0.00%	4 Missing ⚠️
pkg/nvme/nvme.go	40.00%	3 Missing ⚠️
pkg/nvme/nvmecli.go	60.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #176      +/-   ##
==========================================
- Coverage   22.89%   22.88%   -0.02%     
==========================================
  Files          34       34              
  Lines        4813     4842      +29     
==========================================
+ Hits         1102     1108       +6     
- Misses       3537     3557      +20     
- Partials      174      177       +3

Flag	Coverage Δ
unittests	`22.88% <40.82%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (12)

pkg/nvme/initiator.go (2)
344-354: Consider adding timeout configuration.

The waitAndLoadNVMeDeviceInfoWithoutLock uses hardcoded retry values. Consider making these configurable for different environments.
+const (
+    defaultMaxWaitDeviceRetries = 60
+    defaultWaitDeviceInterval = time.Second
+)
+
+func (i *Initiator) waitAndLoadNVMeDeviceInfoWithoutLock(
+    transportAddress, 
+    transportServiceID string,
+    maxRetries int,
+    retryInterval time.Duration,
+) (err error) {
-    for r := 0; r < maxNumWaitDeviceRetries; r++ {
+    if maxRetries <= 0 {
+        maxRetries = defaultMaxWaitDeviceRetries
+    }
+    if retryInterval <= 0 {
+        retryInterval = defaultWaitDeviceInterval
+    }
+    for r := 0; r < maxRetries; r++ {
544-553: Consider using os.PathError for more specific error handling.

The isEndpointExist function could benefit from more specific error handling using os.PathError.
func (i *Initiator) isEndpointExist() (bool, error) {
    _, err := os.Stat(i.Endpoint)
-    if err != nil {
-        if os.IsNotExist(err) {
-            return false, nil
-        }
-        return false, err
+    if err != nil {
+        if pathErr, ok := err.(*os.PathError); ok {
+            if os.IsNotExist(pathErr) {
+                return false, nil
+            }
+            return false, fmt.Errorf("path error checking endpoint: %v", pathErr)
+        }
+        return false, fmt.Errorf("unexpected error checking endpoint: %v", err)
    }
    return true, nil
}
pkg/spdk/client/advanced.go (4)
Line range hint 11-39: Potential Naming Conflict with Default Names

When name is empty, the function uses filepath.Base(devicePath) as the name for both the AIO bdev and the LVS. If multiple devices have the same filename (e.g., disk.img), this could lead to name collisions in SPDK, causing unexpected behavior.

Consider generating unique names by incorporating additional identifiers like a UUID or the full device path to prevent potential conflicts.

Line range hint 15-17: Handle Existing Bdev Creation Errors Gracefully

If BdevAioCreate fails because the bdev already exists, the function returns an error. This could happen if the device was previously added and not properly cleaned up.

Before attempting to create the bdev, check if it already exists using BdevGetBdevs or handle the specific error indicating an existing bdev to avoid unnecessary failures.

Line range hint 43-54: Ensure Comprehensive Cleanup on Partial Deletion

In DeleteDevice, if BdevLvolDeleteLvstore succeeds but BdevAioDelete fails, the AIO bdev remains, leaving residual resources.

Implement additional error handling or a rollback mechanism to ensure that both the LVS and the AIO bdev are deleted successfully, or provide clear logging and recovery steps if partial deletion occurs.

Line range hint 56-71: Handle Nil Transport List Appropriately

The condition if nvmfTransportList != nil && len(nvmfTransportList) == 0 may skip transport creation if nvmfTransportList is nil. This could happen if NvmfGetTransports returns nil without an error, causing subsequent operations to fail due to missing transport.

Modify the condition to create the transport when nvmfTransportList is nil or empty.

Apply this diff:
 if err != nil {
     return err
 }
-if nvmfTransportList != nil && len(nvmfTransportList) == 0 {
+if nvmfTransportList == nil || len(nvmfTransportList) == 0 {
     if _, err := c.NvmfCreateTransport(spdktypes.NvmeTransportTypeTCP); err != nil && !jsonrpc.IsJSONRPCRespErrorTransportTypeAlreadyExists(err) {
         return err
     }
 }
pkg/types/types.go (1)
79-85: Consistent Case Handling in Error Checks

In ErrorIsDeviceOrResourceBusy, the error message is converted to lowercase, but ErrorMessageDeviceOrResourceBusy is not, which might lead to case sensitivity issues.

Convert both strings to lowercase for a reliable comparison.

Apply this diff:
 func ErrorIsDeviceOrResourceBusy(err error) bool {
-    return strings.Contains(strings.ToLower(err.Error()), ErrorMessageDeviceOrResourceBusy)
+    return strings.Contains(strings.ToLower(err.Error()), strings.ToLower(ErrorMessageDeviceOrResourceBusy))
 }
 
 func ErrorIsValidNvmeDeviceNotFound(err error) bool {
-    return strings.Contains(err.Error(), ErrorMessageCannotFindValidNvmeDevice)
+    return strings.Contains(err.Error(), ErrorMessageCannotFindValidNvmeDevice)
 }
Ensure similar consistency in ErrorIsValidNvmeDeviceNotFound.
pkg/nvme/nvme.go (3)
73-78: Enhance Error Context When Listing NVMe Devices

If listRecognizedNvmeDevices fails, the returned error lacks context about the operation being performed.

Wrap the error to provide additional context for easier troubleshooting.

Apply this diff:
 nvmeDevices, err := listRecognizedNvmeDevices(executor)
 if err != nil {
-    return nil, err
+    return nil, errors.Wrap(err, "failed to list recognized NVMe devices")
 }
80-88: Provide Detailed Error Messages for Subsystem Issues

The error messages when no subsystems are found or multiple subsystems are detected lack specific details, which can hinder debugging.

Include the device path and any relevant identifiers in the error messages for better clarity.

Apply this diff:
 if len(subsystems) == 0 {
-    return nil, fmt.Errorf("no subsystem found for NVMe device %s", d.DevicePath)
+    return nil, fmt.Errorf("no NVMe subsystem found for device %s", d.DevicePath)
 }
 if len(subsystems) > 1 {
-    return nil, fmt.Errorf("multiple subsystems found for NVMe device %s", d.DevicePath)
+    return nil, fmt.Errorf("multiple NVMe subsystems found for device %s", d.DevicePath)
 }
166-167: Enhance Error Messaging for Device Not Found

The error message uses types.ErrorMessageCannotFindValidNvmeDevice but may not include dynamic details like NQN and address, making it less informative.

Include the NQN and address in the error message to aid in diagnosing the issue.

Apply this diff:
 return nil, fmt.Errorf("%s with subsystem NQN %s and address %s:%s", 
     types.ErrorMessageCannotFindValidNvmeDevice, nqn, ip, port)
pkg/util/device.go (1)

Line range hint 56-62: Ensure removeAsync Always Sends to Channel

In the remove function, if removeAsync encounters an error before sending to the done channel, it might cause a deadlock.

Guarantee that removeAsync sends an error or nil to the channel in all code paths.

Verify that removeAsync does not exit without sending a value to done.

pkg/nvme/nvmecli.go (1)

Line range hint 1-378: Consider documenting the connection to the main issue

While these changes improve code clarity, their connection to fixing the volume detaching/attaching loop issue (longhorn/longhorn#9919) isn't immediately clear. Consider:

Adding comments explaining how these changes contribute to the fix

Ensuring that the core fixes for the issue are included in other files

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9425f07 and d9c40b0.

📒 Files selected for processing (6)

pkg/nvme/initiator.go (15 hunks)
pkg/nvme/nvme.go (4 hunks)
pkg/nvme/nvmecli.go (3 hunks)
pkg/spdk/client/advanced.go (4 hunks)
pkg/types/types.go (3 hunks)
pkg/util/device.go (1 hunks)

🔇 Additional comments (5)

pkg/nvme/initiator.go (2)

254-256: Good addition of parameter validation.

The validation of transport parameters before proceeding with the initiator start is a good defensive programming practice.

694-704: Good practice: Reloading device numbers after device creation.

The addition of device number reloading after creating the linear device mapper is a good practice to ensure consistency.

pkg/nvme/nvmecli.go (3)

157-157: LGTM: Simplified function signature

Good improvement. The function only processes JSON data and doesn't need the executor parameter.

172-172: LGTM: Consistent with V1 implementation

Good change. The function matches the V1 implementation in processing only JSON data without needing the executor.

200-200: LGTM: Improved function naming

The rename from listControllers to listRecognizedNvmeDevices better reflects the function's purpose. However, let's verify that all callers have been updated.

✅ Verification successful

Function rename is properly implemented across the codebase

The verification confirms that:

No references to the old function name listControllers exist in the codebase
The new function name listRecognizedNvmeDevices is used consistently in both its definition (pkg/nvme/nvmecli.go) and its only caller (pkg/nvme/nvme.go)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining references to the old function name
rg "listControllers" --type go

# Search for references to the new function name to verify updates
rg "listRecognizedNvmeDevices" --type go

Length of output: 248

pkg/util/device.go

pkg/nvme/nvmecli.go

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

pkg/nvme/initiator.go (4)
398-405: Consider enhancing error messages for device busy scenarios.

While the error handling logic is correct, consider adding more context to help diagnose issues:
 if types.ErrorIsDeviceOrResourceBusy(err) {
     if returnErrorForBusyDevice {
-        return true, err
+        return true, errors.Wrapf(err, "device %s is busy and cannot be removed", i.Name)
     }
     dmDeviceIsBusy = true
 } else {
-    return false, err
+    return false, errors.Wrapf(err, "failed to remove device %s", i.Name)
 }
🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 398-405: pkg/nvme/initiator.go#L398-L405
Added lines #L398 - L405 were not covered by tests

326-336: Consider adding validation for endpoint creation status.

While the endpoint creation logic is improved, consider adding validation after makeEndpoint:
 i.logger.Infof("Creating endpoint %v", i.Endpoint)
 exist, err := i.isEndpointExist()
 if err != nil {
     return dmDeviceIsBusy, errors.Wrapf(err, "failed to check if endpoint %v exists for NVMe initiator %s", i.Endpoint, i.Name)
 }
 if exist {
     i.logger.Infof("Skipping endpoint %v creation for NVMe initiator", i.Endpoint)
 } else {
     if err := i.makeEndpoint(); err != nil {
         return dmDeviceIsBusy, err
     }
+    // Validate endpoint creation
+    if exist, err := i.isEndpointExist(); err != nil || !exist {
+        return dmDeviceIsBusy, errors.Wrapf(err, "endpoint %v creation validation failed", i.Endpoint)
+    }
 }
🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 329-330: pkg/nvme/initiator.go#L329-L330
Added lines #L329 - L330 were not covered by tests

[warning] 332-332: pkg/nvme/initiator.go#L332
Added line #L332 was not covered by tests

[warning] 335-335: pkg/nvme/initiator.go#L335
Added line #L335 was not covered by tests

Line range hint 247-341: Critical: Improve test coverage for core initiator operations.

The Start method and its dependencies have significant test coverage gaps. Given this is a critical path for volume operations, consider adding tests for:

Transport parameter validation

Device busy scenarios

Endpoint creation/validation

Error handling paths

This is particularly important as this code fixes a volume stuck issue.

Would you like me to help create a test plan or generate test cases for these scenarios?

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 276-277: pkg/nvme/initiator.go#L276-L277
Added lines #L276 - L277 were not covered by tests

[warning] 283-283: pkg/nvme/initiator.go#L283
Added line #L283 was not covered by tests

[warning] 290-290: pkg/nvme/initiator.go#L290
Added line #L290 was not covered by tests

[warning] 297-297: pkg/nvme/initiator.go#L297
Added line #L297 was not covered by tests

[warning] 302-302: pkg/nvme/initiator.go#L302
Added line #L302 was not covered by tests

[warning] 308-308: pkg/nvme/initiator.go#L308
Added line #L308 was not covered by tests

[warning] 310-310: pkg/nvme/initiator.go#L310
Added line #L310 was not covered by tests

[warning] 312-313: pkg/nvme/initiator.go#L312-L313
Added lines #L312 - L313 were not covered by tests

[warning] 322-322: pkg/nvme/initiator.go#L322
Added line #L322 was not covered by tests

[warning] 329-330: pkg/nvme/initiator.go#L329-L330
Added lines #L329 - L330 were not covered by tests

[warning] 332-332: pkg/nvme/initiator.go#L332
Added line #L332 was not covered by tests

[warning] 335-335: pkg/nvme/initiator.go#L335
Added line #L335 was not covered by tests

[warning] 351-351: pkg/nvme/initiator.go#L351
Added line #L351 was not covered by tests

[warning] 361-363: pkg/nvme/initiator.go#L361-L363
Added lines #L361 - L363 were not covered by tests

[warning] 368-370: pkg/nvme/initiator.go#L368-L370
Added lines #L368 - L370 were not covered by tests

Line range hint 1-709: Consider adding retry limits and deadlock prevention.

While the changes improve endpoint creation reliability, consider adding:

Maximum retry limits for the entire operation to prevent infinite loops

Deadlock detection for device busy scenarios

Metrics/monitoring for tracking device states and operation durations

This would help prevent and diagnose similar issues in the future.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 276-277: pkg/nvme/initiator.go#L276-L277
Added lines #L276 - L277 were not covered by tests

[warning] 283-283: pkg/nvme/initiator.go#L283
Added line #L283 was not covered by tests

[warning] 290-290: pkg/nvme/initiator.go#L290
Added line #L290 was not covered by tests

[warning] 297-297: pkg/nvme/initiator.go#L297
Added line #L297 was not covered by tests

[warning] 302-302: pkg/nvme/initiator.go#L302
Added line #L302 was not covered by tests

[warning] 308-308: pkg/nvme/initiator.go#L308
Added line #L308 was not covered by tests

[warning] 310-310: pkg/nvme/initiator.go#L310
Added line #L310 was not covered by tests

[warning] 312-313: pkg/nvme/initiator.go#L312-L313
Added lines #L312 - L313 were not covered by tests

[warning] 322-322: pkg/nvme/initiator.go#L322
Added line #L322 was not covered by tests

[warning] 329-330: pkg/nvme/initiator.go#L329-L330
Added lines #L329 - L330 were not covered by tests

[warning] 332-332: pkg/nvme/initiator.go#L332
Added line #L332 was not covered by tests

[warning] 335-335: pkg/nvme/initiator.go#L335
Added line #L335 was not covered by tests

[warning] 351-351: pkg/nvme/initiator.go#L351
Added line #L351 was not covered by tests

[warning] 361-363: pkg/nvme/initiator.go#L361-L363
Added lines #L361 - L363 were not covered by tests

[warning] 368-370: pkg/nvme/initiator.go#L368-L370
Added lines #L368 - L370 were not covered by tests

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d9c40b0 and a6020c5.

📒 Files selected for processing (1)

pkg/nvme/initiator.go (15 hunks)

🧰 Additional context used

🪛 GitHub Check: codecov/patch

pkg/nvme/initiator.go

[warning] 168-168: pkg/nvme/initiator.go#L168
Added line #L168 was not covered by tests

[warning] 255-256: pkg/nvme/initiator.go#L255-L256
Added lines #L255 - L256 were not covered by tests

[warning] 276-277: pkg/nvme/initiator.go#L276-L277
Added lines #L276 - L277 were not covered by tests

[warning] 283-283: pkg/nvme/initiator.go#L283
Added line #L283 was not covered by tests

[warning] 290-290: pkg/nvme/initiator.go#L290
Added line #L290 was not covered by tests

[warning] 297-297: pkg/nvme/initiator.go#L297
Added line #L297 was not covered by tests

[warning] 302-302: pkg/nvme/initiator.go#L302
Added line #L302 was not covered by tests

[warning] 308-308: pkg/nvme/initiator.go#L308
Added line #L308 was not covered by tests

[warning] 310-310: pkg/nvme/initiator.go#L310
Added line #L310 was not covered by tests

[warning] 312-313: pkg/nvme/initiator.go#L312-L313
Added lines #L312 - L313 were not covered by tests

[warning] 322-322: pkg/nvme/initiator.go#L322
Added line #L322 was not covered by tests

[warning] 329-330: pkg/nvme/initiator.go#L329-L330
Added lines #L329 - L330 were not covered by tests

[warning] 332-332: pkg/nvme/initiator.go#L332
Added line #L332 was not covered by tests

[warning] 335-335: pkg/nvme/initiator.go#L335
Added line #L335 was not covered by tests

[warning] 351-351: pkg/nvme/initiator.go#L351
Added line #L351 was not covered by tests

[warning] 361-363: pkg/nvme/initiator.go#L361-L363
Added lines #L361 - L363 were not covered by tests

[warning] 368-370: pkg/nvme/initiator.go#L368-L370
Added lines #L368 - L370 were not covered by tests

[warning] 398-405: pkg/nvme/initiator.go#L398-L405
Added lines #L398 - L405 were not covered by tests

[warning] 417-417: pkg/nvme/initiator.go#L417
Added line #L417 was not covered by tests

[warning] 529-529: pkg/nvme/initiator.go#L529
Added line #L529 was not covered by tests

[warning] 550-550: pkg/nvme/initiator.go#L550
Added line #L550 was not covered by tests

[warning] 552-552: pkg/nvme/initiator.go#L552
Added line #L552 was not covered by tests

[warning] 565-566: pkg/nvme/initiator.go#L565-L566
Added lines #L565 - L566 were not covered by tests

[warning] 687-687: pkg/nvme/initiator.go#L687
Added line #L687 was not covered by tests

[warning] 689-692: pkg/nvme/initiator.go#L689-L692
Added lines #L689 - L692 were not covered by tests

[warning] 695-699: pkg/nvme/initiator.go#L695-L699
Added lines #L695 - L699 were not covered by tests

[warning] 701-705: pkg/nvme/initiator.go#L701-L705
Added lines #L701 - L705 were not covered by tests

🔇 Additional comments (1)

pkg/nvme/initiator.go (1)

254-256: LGTM: Good input validation.

The added validation for transport parameters helps prevent invalid initiator configurations.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 255-256: pkg/nvme/initiator.go#L255-L256
Added lines #L255 - L256 were not covered by tests

pkg/nvme/initiator.go

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

pkg/nvme/initiator.go (2)

30-36: Consider adding documentation for retry constants.

The retry parameters look reasonable, but adding documentation would help explain their purpose and chosen values.

 const (
+	// maxConnectTargetRetries is the maximum number of attempts to connect to a target
 	maxConnectTargetRetries    = 15
+	// retryConnectTargetInterval is the duration to wait between connection attempts
 	retryConnectTargetInterval = 1 * time.Second

+	// maxWaitDeviceRetries is the maximum number of attempts to wait for device creation
 	maxWaitDeviceRetries = 60
+	// waitDeviceInterval is the duration to wait between device checks
 	waitDeviceInterval   = 1 * time.Second
 )

697-712: Enhance error handling in reloadLinearDmDevice.

Consider wrapping errors with additional context for better debugging.

 	err = util.DmsetupReload(i.Name, table, i.executor)
 	if err != nil {
-		return err
+		return errors.Wrapf(err, "failed to reload dm device %s with table %s", i.Name, table)
 	}

 	// Reload the device numbers
 	dmDevPath := getDmDevicePath(i.Name)
 	major, minor, err := util.GetDeviceNumbers(dmDevPath, i.executor)
 	if err != nil {
-		return err
+		return errors.Wrapf(err, "failed to get device numbers for %s", dmDevPath)
 	}

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a6020c5 and b898646.

📒 Files selected for processing (1)

pkg/nvme/initiator.go (22 hunks)

🧰 Additional context used

📓 Learnings (1)

pkg/nvme/initiator.go (1)

Learnt from: derekbit
PR: longhorn/go-spdk-helper#176
File: pkg/nvme/initiator.go:344-354
Timestamp: 2024-12-08T07:56:35.135Z
Learning: Retry parameters such as `maxNumWaitDeviceRetries` and `waitDeviceInterval` are defined in another Go file and are used in `pkg/nvme/initiator.go`. Before suggesting to extract magic numbers, check if they are already defined and used elsewhere in the codebase.

🔇 Additional comments (2)

pkg/nvme/initiator.go (2)

346-355: Add test coverage for the new waitAndLoadNVMeDeviceInfoWithoutLock method.

This method is critical for handling device info loading with retries, but it lacks test coverage according to the codecov report.

Would you like me to help generate unit tests for this method? The tests should cover:

Successful device info loading
Retry behavior with temporary failures
Permanent failure cases

716-718: Verify path handling security.

The getDmDevicePath function handles device paths. While it uses filepath.Join which is good, we should verify that the name parameter is properly sanitized to prevent path traversal attacks.

Consider adding input validation for device names:

func validateDeviceName(name string) error {
    if strings.Contains(name, "/") || strings.Contains(name, "..") {
        return fmt.Errorf("invalid device name %s: must not contain path separators", name)
    }
    return nil
}

✅ Verification successful

The previous search didn't yield results, so let's try a different approach to understand:

How getDmDevicePath is used in the codebase
If there's any validation before calling it
How device names are handled in the nvme package

This will help us determine if there are any security concerns with the path handling.

From the previous results, we see that getDmDevicePath is only used internally with i.Name. We need to understand how i.Name is validated when the Initiator is created and used. This will help us determine if there are sufficient safeguards against path traversal.

From previous results, we see the Name field is used extensively with dmsetup operations. We need to find:

How the Name is validated during Initiator creation
Any validation in dmsetup operations
Any validation in the underlying util package

This will help us determine if there are sufficient safeguards against path traversal before the name is used in getDmDevicePath.

Device name validation is sufficient for security

The code has adequate safeguards against path traversal attacks:

The Name field is validated during Initiator creation, rejecting empty values
The name comes from a controlled source (Initiator struct) and is not directly user-supplied
The device mapper (dmsetup) commands use the name as a direct argument, not as a path component
filepath.Join in getDmDevicePath is used correctly to create the final path under /dev/mapper

The current implementation is secure because:

Device mapper names are handled as identifiers, not paths
The fixed base path (/dev/mapper) cannot be escaped due to filepath.Join's proper path handling
All device operations are performed through the system's device mapper interface

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for path validation in the codebase
rg -A 3 "func.*validateDeviceName|func.*sanitizeDeviceName" --type go

# Check for any existing path traversal tests
rg -l "TestPathTraversal|test.*path.*traversal" --type go

Length of output: 131

Script:

#!/bin/bash
# Let's check how getDmDevicePath is used in the codebase
rg -A 5 "getDmDevicePath" --type go

# Check if there's any validation before calling getDmDevicePath
ast-grep --pattern 'getDmDevicePath($_)'

# Look for any device name validation patterns in the code
rg -A 3 "name.*=|name :=" --type go pkg/nvme/

Length of output: 1899

Script:

#!/bin/bash
# Let's check the Initiator struct and how Name is set/validated
ast-grep --pattern 'type Initiator struct {
  $$$
}'

# Check for any validation of the Name field
rg -B 5 -A 5 "i\.Name" pkg/nvme/initiator.go

# Look for where the Initiator struct is created/initialized
ast-grep --pattern 'NewInitiator($$$)'

Length of output: 10512

Script:

#!/bin/bash
# Let's find where the Initiator is created and how Name is validated
rg -B 5 -A 10 "func NewInitiator" pkg/nvme/initiator.go

# Check for any name validation in dmsetup operations
rg -A 3 "DmsetupCreate|DmsetupRemove|DmsetupReload" pkg/nvme/

# Look for any name validation in the util package
rg -A 3 "func.*Dmsetup" --type go

Length of output: 2787

pkg/nvme/initiator.go

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (7)

pkg/nvme/initiator.go (3)

256-258: Improve error message clarity.

The error message could be more specific about which parameter is empty.

-		return false, fmt.Errorf("invalid transportAddress %s and transportServiceID %s for starting initiator %s", transportAddress, transportServiceID, i.Name)
+		return false, fmt.Errorf("empty transportAddress or transportServiceID (got: address=%q, service_id=%q) for starting initiator %s", transportAddress, transportServiceID, i.Name)

357-378: Enhance error logging in connectTarget.

Consider adding more context to error logs to aid debugging.

-			i.logger.WithError(err).Warn("Failed to discover target")
+			i.logger.WithError(err).Warnf("Failed to discover target (attempt %d/%d) for address %s:%s", r+1, maxRetries, transportAddress, transportServiceID)

-			i.logger.WithError(err).Warn("Failed to connect target")
+			i.logger.WithError(err).Warnf("Failed to connect target (attempt %d/%d) for address %s:%s", r+1, maxRetries, transportAddress, transportServiceID)

416-418: Consider unwrapping error before checking type.

The error from DisconnectTarget is wrapped, which might affect error type checking in the caller.

-	err = DisconnectTarget(i.SubsystemNQN, i.executor)
-	if err != nil {
-		return dmDeviceIsBusy, errors.Wrapf(err, "failed to disconnect target for NVMe-oF initiator %s", i.Name)
+	if err := DisconnectTarget(i.SubsystemNQN, i.executor); err != nil {
+		return dmDeviceIsBusy, fmt.Errorf("failed to disconnect target for NVMe-oF initiator %s: %w", i.Name, err)

pkg/spdk/client/advanced.go (4)

Line range hint 11-42: Add input validation for devicePath and clusterSize.

Consider validating input parameters to ensure robustness.

 func (c *Client) AddDevice(devicePath, name string, clusterSize uint32) (bdevAioName, lvsName, lvsUUID string, err error) {
+	if devicePath == "" {
+		return "", "", "", fmt.Errorf("devicePath cannot be empty")
+	}
+	if clusterSize == 0 {
+		return "", "", "", fmt.Errorf("clusterSize must be greater than 0")
+	}
+
 	// Use the file name as aio name and lvs name if name is not specified.

Line range hint 43-55: Add input validation and improve error handling.

Consider validating input parameters and handling errors more gracefully.

 func (c *Client) DeleteDevice(bdevAioName, lvsName string) (err error) {
+	if bdevAioName == "" || lvsName == "" {
+		return fmt.Errorf("bdevAioName and lvsName cannot be empty")
+	}
+
 	if _, err := c.BdevLvolDeleteLvstore(lvsName, ""); err != nil {
+		// If LVS doesn't exist, continue with block device deletion
+		if !jsonrpc.IsJSONRPCRespErrorLvolstoreNotFound(err) {
 			return err
+		}
 	}

Line range hint 56-82: Add input validation and improve error handling.

Consider validating input parameters and handling cleanup on partial failures.

 func (c *Client) StartExposeBdev(nqn, bdevName, nguid, ip, port string) error {
+	if nqn == "" || bdevName == "" || ip == "" || port == "" {
+		return fmt.Errorf("nqn, bdevName, ip, and port cannot be empty")
+	}
+
 	nvmfTransportList, err := c.NvmfGetTransports("", "")
 	if err != nil {
 		return err
 	}
+
+	var cleanup func()
+	defer func() {
+		if err != nil && cleanup != nil {
+			cleanup()
+		}
+	}()
+
 	if nvmfTransportList != nil && len(nvmfTransportList) == 0 {
 		if _, err := c.NvmfCreateTransport(spdktypes.NvmeTransportTypeTCP); err != nil && !jsonrpc.IsJSONRPCRespErrorTransportTypeAlreadyExists(err) {
 			return err
 		}
 	}
+
 	if _, err := c.NvmfCreateSubsystem(nqn); err != nil {
 		return err
 	}
+	cleanup = func() { c.NvmfDeleteSubsystem(nqn, "") }

Line range hint 83-117: Improve error handling for cleanup operations.

Consider continuing cleanup even if some operations fail.

 func (c *Client) StopExposeBdev(nqn string) error {
+	var errs []error
+
 	subsystemList, err := c.NvmfGetSubsystems("", "")
 	if err != nil {
 		return err
 	}
 	// ... existing code ...
 	for _, l := range listenerList {
 		if _, err := c.NvmfSubsystemRemoveListener(nqn, l.Address.Traddr, l.Address.Trsvcid, l.Address.Trtype, l.Address.Adrfam); err != nil {
-			return err
+			errs = append(errs, fmt.Errorf("failed to remove listener: %w", err))
+			continue
 		}
 	}
 	// ... similar changes for namespace removal ...
+	if len(errs) > 0 {
+		return fmt.Errorf("cleanup errors: %v", errs)
+	}

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b898646 and 70d5aaa.

📒 Files selected for processing (6)

pkg/nvme/initiator.go (22 hunks)
pkg/nvme/nvme.go (4 hunks)
pkg/nvme/nvmecli.go (3 hunks)
pkg/spdk/client/advanced.go (4 hunks)
pkg/types/types.go (3 hunks)
pkg/util/device.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/util/device.go
pkg/types/types.go
pkg/nvme/nvmecli.go

🧰 Additional context used

📓 Learnings (1)

pkg/nvme/initiator.go (2)

Learnt from: derekbit
PR: longhorn/go-spdk-helper#176
File: pkg/nvme/initiator.go:328-339
Timestamp: 2024-12-08T15:56:31.575Z
Learning: In the NVMe-oF initiator code (`pkg/nvme/initiator.go`), additional locks within methods like `Start` are unnecessary because locking is handled in the caller side.

Learnt from: derekbit
PR: longhorn/go-spdk-helper#176
File: pkg/nvme/initiator.go:344-354
Timestamp: 2024-12-08T07:56:35.135Z
Learning: Retry parameters such as `maxNumWaitDeviceRetries` and `waitDeviceInterval` are defined in another Go file and are used in `pkg/nvme/initiator.go`. Before suggesting to extract magic numbers, check if they are already defined and used elsewhere in the codebase.

🔇 Additional comments (8)

pkg/nvme/initiator.go (6)

6-6: LGTM! Improved constant naming and imports.

The constant renames make their purpose clearer, and the addition of path/filepath ensures cross-platform path handling.

Also applies to: 30-36

57-60: LGTM! Improved error message clarity.

The error message now correctly specifies "NVMe-oF initiator" for better context.

Line range hint 248-343: LGTM! Improved robustness of Start function.

The changes enhance reliability by:

Validating transport parameters
Adding structured logging
Ensuring clean state before starting
Adding retry mechanism for target connection

346-355: LGTM! Good encapsulation of retry logic.

The function effectively encapsulates the retry logic for loading NVMe device info.

429-456: LGTM! Well-implemented getter methods.

The getter methods follow Go conventions and provide clear access to internal state.

716-717: LGTM! Improved path handling.

Using filepath.Join ensures correct path handling across different platforms.

pkg/nvme/nvme.go (2)

73-87: LGTM! Improved error handling and messages.

The changes enhance error handling with more descriptive messages and better subsystem validation.

166-166: LGTM! Improved error message consistency.

Using types.ErrorMessageCannotFindValidNvmeDevice ensures consistent error messaging across the codebase.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 70d5aaa and 9894a92.

📒 Files selected for processing (6)

pkg/nvme/initiator.go (21 hunks)
pkg/nvme/nvme.go (4 hunks)
pkg/nvme/nvmecli.go (3 hunks)
pkg/spdk/client/advanced.go (4 hunks)
pkg/types/types.go (3 hunks)
pkg/util/device.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

pkg/util/device.go
pkg/nvme/nvme.go
pkg/nvme/nvmecli.go
pkg/spdk/client/advanced.go
pkg/types/types.go

🧰 Additional context used

📓 Learnings (1)

pkg/nvme/initiator.go (2)

Learnt from: derekbit
PR: longhorn/go-spdk-helper#176
File: pkg/nvme/initiator.go:328-339
Timestamp: 2024-12-08T15:56:31.575Z
Learning: In the NVMe-oF initiator code (`pkg/nvme/initiator.go`), additional locks within methods like `Start` are unnecessary because locking is handled in the caller side.

Learnt from: derekbit
PR: longhorn/go-spdk-helper#176
File: pkg/nvme/initiator.go:344-354
Timestamp: 2024-12-08T07:56:35.135Z
Learning: Retry parameters such as `maxNumWaitDeviceRetries` and `waitDeviceInterval` are defined in another Go file and are used in `pkg/nvme/initiator.go`. Before suggesting to extract magic numbers, check if they are already defined and used elsewhere in the codebase.

🔇 Additional comments (5)

pkg/nvme/initiator.go (5)

30-36: LGTM! Improved constant naming for clarity.

The renamed constants better reflect their purpose and usage in the code.

256-258: LGTM! Added parameter validation.

Early validation of required parameters improves error handling and prevents downstream issues.

274-283: LGTM! Enhanced device initialization checks.

The code now properly handles cases where the initiator is already launched, with appropriate checks for parameter matching and endpoint loading.

702-712: LGTM! Added device numbers reload.

Properly reloading device numbers after device mapper operations ensures consistency and prevents potential device mapping issues.

716-717: LGTM! Improved path handling.

Using filepath.Join for path construction is the correct approach for cross-platform compatibility and path manipulation safety.

pkg/nvme/initiator.go

- Refactor initiator and nvme codes - Improve log messages - Update API docs Longhorn 9919 Signed-off-by: Derek Su <[email protected]>

A linear dm device can be reused if it is somehow not deleted before. The situation is usually caused by "Device or Resource Busy" error. While creating an engine and reusing the existing dm device, it still needs to load the device numbers of the dm device. Longhorn 9919 Signed-off-by: Derek Su <[email protected]>

Longhorn 9919 Signed-off-by: Derek Su <[email protected]>

derekbit requested review from innobead, PhanLe1010, c3y1huang, shuo-wu and DamiaSan December 8, 2024 07:27

derekbit self-assigned this Dec 8, 2024

coderabbitai bot reviewed Dec 8, 2024

View reviewed changes

pkg/util/device.go Show resolved Hide resolved

pkg/nvme/nvmecli.go Show resolved Hide resolved

derekbit force-pushed the issue-9919 branch from d9c40b0 to a6020c5 Compare December 8, 2024 07:34

coderabbitai bot reviewed Dec 8, 2024

View reviewed changes

pkg/nvme/initiator.go Outdated Show resolved Hide resolved

derekbit mentioned this pull request Dec 8, 2024

[BUG] Longhorn v2 volume is stuck in detaching/attaching loop forever if replica crash longhorn/longhorn#9919

Closed

coderabbitai bot reviewed Dec 8, 2024

View reviewed changes

pkg/nvme/initiator.go Show resolved Hide resolved

derekbit force-pushed the issue-9919 branch from b898646 to 70d5aaa Compare December 8, 2024 16:00

coderabbitai bot reviewed Dec 8, 2024

View reviewed changes

derekbit force-pushed the issue-9919 branch from 70d5aaa to 9894a92 Compare December 9, 2024 00:28

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

pkg/nvme/initiator.go Show resolved Hide resolved

derekbit force-pushed the issue-9919 branch from 9894a92 to 07de993 Compare December 9, 2024 00:34

derekbit added 4 commits December 9, 2024 08:34

refactor: improve log messages and api docs

9911668

- Refactor initiator and nvme codes - Improve log messages - Update API docs Longhorn 9919 Signed-off-by: Derek Su <[email protected]>

fix: always ensure endpoint creation

fa90a3a

Longhorn 9919 Signed-off-by: Derek Su <[email protected]>

chore: update api docs

f329f32

Longhorn 9919 Signed-off-by: Derek Su <[email protected]>

derekbit force-pushed the issue-9919 branch from 07de993 to f329f32 Compare December 9, 2024 00:35

c3y1huang approved these changes Dec 9, 2024

View reviewed changes

derekbit merged commit 079275c into longhorn:main Dec 9, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: always ensure endpoint creation #176

fix: always ensure endpoint creation #176

derekbit commented Dec 8, 2024

coderabbitai bot commented Dec 8, 2024 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

codecov bot commented Dec 8, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

fix: always ensure endpoint creation #176

fix: always ensure endpoint creation #176

Conversation

derekbit commented Dec 8, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai bot commented Dec 8, 2024 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

codecov bot commented Dec 8, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot commented Dec 8, 2024 •

edited

Loading

codecov bot commented Dec 8, 2024 •

edited

Loading