Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lep): auto-salvage support for v2 volumes #9486

Merged
merged 1 commit into from
Nov 7, 2024

Conversation

c3y1huang
Copy link
Contributor

Which issue(s) this PR fixes:

Issue #8430

What this PR does / why we need it:

Proposes extending Longhorn's auto-salvage feature to support v2 volumes

Special notes for your reviewer:

None

Additional documentation or context

https://suse.slack.com/archives/C02DR3N5T24/p1713926457976349

@c3y1huang c3y1huang self-assigned this Sep 19, 2024
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch from 97bd3bd to 4ca8b43 Compare September 19, 2024 03:05
@c3y1huang c3y1huang marked this pull request as ready for review September 19, 2024 03:06
@c3y1huang c3y1huang requested a review from a team as a code owner September 19, 2024 03:06
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 2 times, most recently from 47e9a88 to effb7b7 Compare September 19, 2024 04:28
Copy link
Contributor

@DamiaSan DamiaSan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality has some overlaps with the delta rebuilding, for example in the case when the node where the raid resides restart (we must elect an healty replica and compare all others replica with this one to ensure they are aligned). So I think we need an unique implementation for these 2 functionalities. WDYT?
cc @innobead

@c3y1huang
Copy link
Contributor Author

c3y1huang commented Sep 19, 2024

This functionality has some overlaps with the delta rebuilding, for example in the case when the node where the raid resides restart (we must elect an healty replica and compare all others replica with this one to ensure they are aligned). So I think we need an unique implementation for these 2 functionalities. WDYT?

Based on the Slack discussion: There is no way to guarantee that there won't be any data block missing in the head when the volume suddenly crashes. As @shuo-wu suggested, our best approach, for now, is to ensure that the replica candidates have the complete snapshot set. I don't see the overlap between this proposal and the future integration with the delta rebuilding as this proposal leans more towards the control plan selection of the reusable candidate. Could you elaborate on the potential overlap?

IMO we can enhance the handling of the scenario when the RAID node goes offline with delta rebuilding. An approach is to address this in the engine instance, similar to how v1 volume handles it, or incorporate additional information in Replica.Status.Snapshots? What do you think?

@derekbit @shuo-wu , any thoughts about this?

@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 5 times, most recently from bbf6865 to 99350f9 Compare September 19, 2024 07:30
@c3y1huang c3y1huang requested a review from DamiaSan September 19, 2024 07:31
@DamiaSan
Copy link
Contributor

This functionality has some overlaps with the delta rebuilding, for example in the case when the node where the raid resides restart (we must elect an healty replica and compare all others replica with this one to ensure they are aligned). So I think we need an unique implementation for these 2 functionalities. WDYT?

Based on the Slack discussion: There is no way to guarantee that there won't be any data block missing in the head when the volume suddenly crashes. As @shuo-wu suggested, our best approach, for now, is to ensure that the replica candidates have the complete snapshot set. I don't see the overlap between this proposal and the future integration with the delta rebuilding as this proposal leans more towards the control plan selection of the reusable candidate. Could you elaborate on the potential overlap?

IMO we can enhance the handling of the scenario when the RAID node goes offline with delta rebuilding. An approach is to address this in the engine instance, similar to how v1 volume handles it, or incorporate additional information in Replica.Status.Snapshots? What do you think?

@derekbit @shuo-wu , any thoughts about this?

The overlap is the selection method and the check that replicas have the complete snapshot set.

@c3y1huang c3y1huang marked this pull request as draft September 20, 2024 02:47
@c3y1huang
Copy link
Contributor Author

c3y1huang commented Sep 20, 2024

Summary from the LEP meeting:

  1. Remove Replica.Status.Snapshots from this proposal. There shouldn't be a scenario for inconsistent snapshot sets between failed reusable replicas.
  2. Filter replica candidates based on the size of the head volume.
  3. Block trim operations when the volume is degraded to maintain reliable volume head size.

@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 5 times, most recently from 942c8e1 to d8b3896 Compare September 23, 2024 04:52
@c3y1huang c3y1huang marked this pull request as ready for review September 23, 2024 04:53
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 2 times, most recently from 87d1798 to 3d0a937 Compare September 23, 2024 06:42
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch from 3d0a937 to a295616 Compare September 24, 2024 00:40
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 2 times, most recently from 9aa5d5a to 462f283 Compare September 24, 2024 08:37
@c3y1huang c3y1huang requested a review from DamiaSan September 24, 2024 08:39
DamiaSan
DamiaSan previously approved these changes Sep 24, 2024
@c3y1huang
Copy link
Contributor Author

c3y1huang commented Sep 27, 2024

The CI codespell check failed to build. I'm not sure if it's a temporary thing. It's working again.

@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 4 times, most recently from 7df79cf to e891cae Compare September 27, 2024 05:26
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 3 times, most recently from e784d87 to 68aec86 Compare October 28, 2024 02:19
@c3y1huang c3y1huang requested a review from shuo-wu October 28, 2024 03:08
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 2 times, most recently from 3c5488b to 3f07e2c Compare November 7, 2024 03:11
@c3y1huang c3y1huang requested a review from derekbit November 7, 2024 03:15
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch 2 times, most recently from fd870fb to 566f7de Compare November 7, 2024 05:05
@c3y1huang c3y1huang requested a review from derekbit November 7, 2024 05:05
longhorn/longhorn-8430

Signed-off-by: Chin-Ya Huang <[email protected]>
@c3y1huang c3y1huang force-pushed the 8430-lep-feat-v2-auto-salvage branch from 566f7de to 0355674 Compare November 7, 2024 05:06
Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The LEP is quite straightforward.

@derekbit derekbit merged commit c4d7b43 into longhorn:master Nov 7, 2024
5 checks passed
@c3y1huang c3y1huang deleted the 8430-lep-feat-v2-auto-salvage branch November 7, 2024 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants