[Feature Request] Shallow Copy Snapshot Enhancements #12023

harishbhakuni · 2024-01-25T19:15:43Z

Is your feature request related to a problem? Please describe

With full copy snapshots where we upload all the local segments data to snapshot repository, snapshot creation latency is dependent on the amount of data to be snapshotted.

With shallow copy snapshots where we are just keeping the reference of shard data stored in remote store, we have removed that dependency to some extent. but snapshot still triggers flush in cases of new ingested and uncommitted data.

Describe the solution you'd like

We can further enhance the shallow copy snapshots and make them more faster and light weight using following enhancements:

Snapshot at refresh level:

When we designed shallow copy feature initially, remote store shard level metadata file was per commit point and was being updated with each refresh.
Also existing full copy snapshots supports shard snapshots at commit level. So we have gone ahead with commit level shallow copy snapshots.
however, since remote store shard level metadata file is immutable now, which means we create new metadata at every refresh. we can also directly use this for snapshot operation.
After that, we do not have to flush on every shard snapshot to create a new commit point. we can just rely on remote store feature to push the latest data and we can lock the latest metadata from remote store. which will make shard snapshot operation light weight as well.

Global state shallow copy:

Currently, as part of every snapshot operation we also upload cluster state as well to snapshot repository.
Since, we have remote cluster state now. we can think of doing something similar to shard level snapshots where we will capture current remote cluster state and keep a reference of it in snapshot repository instead of uploading entire cluster state.

Pull based model for cluster state:

Currently, snapshot protocol uses cluster state to keep different transition states or to share info between the nodes which is broadcasted by cluster manager node to all data nodes.
In cases when there are too many data nodes or cluster state is too large, snapshot operation can take too much time.
remote cluster state is kind of mirroring the cluster state right now. But in future as mentioned in [Remote State] All the nodes should download full/ diff cluster state from remote store #11744, if we go ahead with pulling the cluster state in data nodes directly from remote store.
it would further reduce the time taken by snapshot operations.

With these enhancements, shallow copy snapshots would mostly not have any dependency on amount of data to be snapshotted and size of cluster state. And we would be one more step closer to supporting PITR (which i think we can kind of still support with maybe few minutes of granularity) if we plan to use the same protocol.

Related component

Storage:Snapshots

Describe alternatives you've considered

No response

Additional context

No response

peternied · 2024-01-31T16:47:43Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@harishbhakuni Thanks for filing, looking forward to seeing PRs to improve these areas.

ashking94 · 2024-02-08T20:30:49Z

Thanks for this @harishbhakuni.

I want to add another point here - I see we create locks for a shard's segment metadata file even if there is no incremental indexing done since last snapshot. This adds more load to the remote store and I feel this should be optimised. I would like to get your opinion on this. I should be able to pick this up in coming days/weeks unless you think there is some limitation or major rehaul required to achieve this.

harishbhakuni · 2024-02-12T18:35:34Z

I see we create locks for a shard's segment metadata file even if there is no incremental indexing done since last snapshot.

@ashking94, the idea here was to create one lock per file per acquirerID. so that releasing locks would be clean and there would not be issues/race conditions where one acquirer releases locks for another acquirer.

sachinpkale · 2024-03-21T15:40:28Z

@harishbhakuni As this tries to solve/optimise on many tasks, can we create a meta issue out of it?

harishbhakuni · 2024-04-05T14:56:22Z

sure @sachinpkale will do.

shourya035 · 2024-09-05T15:27:44Z

@harishbhakuni Please create a META Issue out of this so that this can be tracked better.

harishbhakuni · 2024-10-27T07:39:41Z

@sachinpkale /@shourya035 sorry missed this earlier. Created this META issue to further track this: #16492.

harishbhakuni added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 25, 2024

github-actions bot added the Storage:Snapshots label Jan 25, 2024

harishbhakuni added Storage:Snapshots and removed Storage:Snapshots labels Jan 25, 2024

peternied removed the untriaged label Jan 31, 2024

Bukhtawar added the Storage-Lifecycle label Feb 15, 2024

github-project-automation bot added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

Bukhtawar removed the Storage-Lifecycle label Feb 15, 2024

sachinpkale moved this from 🆕 New to 🏗 In progress in Storage Project Board Mar 21, 2024

harishbhakuni mentioned this issue Oct 27, 2024

[META] Shallow Copy Snapshot Enhancements. #16492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Shallow Copy Snapshot Enhancements #12023

[Feature Request] Shallow Copy Snapshot Enhancements #12023

harishbhakuni commented Jan 25, 2024 •

edited

Loading

peternied commented Jan 31, 2024

ashking94 commented Feb 8, 2024 •

edited

Loading

harishbhakuni commented Feb 12, 2024 •

edited

Loading

sachinpkale commented Mar 21, 2024

harishbhakuni commented Apr 5, 2024

shourya035 commented Sep 5, 2024

harishbhakuni commented Oct 27, 2024 •

edited

Loading

[Feature Request] Shallow Copy Snapshot Enhancements #12023

[Feature Request] Shallow Copy Snapshot Enhancements #12023

Comments

harishbhakuni commented Jan 25, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

peternied commented Jan 31, 2024

ashking94 commented Feb 8, 2024 • edited Loading

harishbhakuni commented Feb 12, 2024 • edited Loading

sachinpkale commented Mar 21, 2024

harishbhakuni commented Apr 5, 2024

shourya035 commented Sep 5, 2024

harishbhakuni commented Oct 27, 2024 • edited Loading

harishbhakuni commented Jan 25, 2024 •

edited

Loading

ashking94 commented Feb 8, 2024 •

edited

Loading

harishbhakuni commented Feb 12, 2024 •

edited

Loading

harishbhakuni commented Oct 27, 2024 •

edited

Loading