-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: decrease transport_ack_timeout #251
Conversation
WalkthroughThe changes in this pull request involve modifying two constants related to timeout settings in the SPDK package. Specifically, the Changes
Assessment against linked issues
📜 Recent review detailsConfiguration used: .coderabbit.yaml ⛔ Files ignored due to path filters (4)
📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
pkg/spdk/types.go (1)
52-52
: LGTM! Timeout reduction aligns with documented strategyThe reduction of
replicaTransportAckTimeout
from 14s to 10s aligns perfectly with the documented timing strategy in the comments:
- NVMe-oF initiator ctrl-loss-tmo: 30s
- Replica controller loss timeout: 15s
- Fast I/O fail timeout: 10s
- Transport ACK timeout: 10s (new value)
This creates a clear timing hierarchy that should help prevent the race condition where the initiator incorrectly determines target failure.
The timing hierarchy ensures:
- Transport issues are detected within 10s
- I/O failures are triggered within 10s
- Base bdev removal occurs after 15s
- Controller loss is determined at 30s
This layered approach should provide better failure detection and recovery.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (4)
go.mod
is excluded by!go.mod
go.sum
is excluded by!**/*.sum
,!go.sum
vendor/github.com/longhorn/go-spdk-helper/pkg/types/types.go
is excluded by!vendor/**
vendor/modules.txt
is excluded by!vendor/**
📒 Files selected for processing (2)
pkg/spdk/disk/nvme/nvme.go
(1 hunks)pkg/spdk/types.go
(1 hunks)
🔇 Additional comments (1)
pkg/spdk/disk/nvme/nvme.go (1)
23-23
: LGTM! Verify timeout hierarchy alignment
The reduction of diskTransportAckTimeout
from 14s to 10s aligns well with the timeout hierarchy:
diskCtrlrLossTimeoutSec
: 30s (main timeout)diskFastIOFailTimeoutSec
: 15s (half of controller loss)diskTransportAckTimeout
: 10s (new value)
This change should help fail faster in case of connection issues, potentially preventing the race condition described in issue #9874.
Let's verify if this timeout is used consistently across the codebase:
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #251 +/- ##
=====================================
Coverage 8.38% 8.38%
=====================================
Files 21 21
Lines 7159 7159
=====================================
Hits 600 600
Misses 6478 6478
Partials 81 81
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@DamiaSan This is ready for review. Thank you. |
Just one question: shouldn't we modify also the constants defined in |
Which const? |
These are used by the nvme cli initiator, shouldn't it be affected by the same issue? |
No, the change of transport_ack_timeout is for the SPDK's transport layer and is nothing to do with the three parameters you mentioned. |
Longhorn 9874 Signed-off-by: Derek Su <[email protected]>
Decreasing transport_ack_timeout value for imporving the error detection of the transport layer. Longhorn 9874 Signed-off-by: Derek Su <[email protected]>
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9874
Signed-off-by: Derek Su [email protected]
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context