Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Shallow Copy Snapshot Enhancements #12023

Open
harishbhakuni opened this issue Jan 25, 2024 · 7 comments
Open

[Feature Request] Shallow Copy Snapshot Enhancements #12023

harishbhakuni opened this issue Jan 25, 2024 · 7 comments
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Snapshots

Comments

@harishbhakuni
Copy link
Contributor

harishbhakuni commented Jan 25, 2024

Is your feature request related to a problem? Please describe

With full copy snapshots where we upload all the local segments data to snapshot repository, snapshot creation latency is dependent on the amount of data to be snapshotted.

With shallow copy snapshots where we are just keeping the reference of shard data stored in remote store, we have removed that dependency to some extent. but snapshot still triggers flush in cases of new ingested and uncommitted data.

Describe the solution you'd like

We can further enhance the shallow copy snapshots and make them more faster and light weight using following enhancements:

Snapshot at refresh level:

  • When we designed shallow copy feature initially, remote store shard level metadata file was per commit point and was being updated with each refresh.
  • Also existing full copy snapshots supports shard snapshots at commit level. So we have gone ahead with commit level shallow copy snapshots.
  • however, since remote store shard level metadata file is immutable now, which means we create new metadata at every refresh. we can also directly use this for snapshot operation.
  • After that, we do not have to flush on every shard snapshot to create a new commit point. we can just rely on remote store feature to push the latest data and we can lock the latest metadata from remote store. which will make shard snapshot operation light weight as well.

Global state shallow copy:

  • Currently, as part of every snapshot operation we also upload cluster state as well to snapshot repository.
  • Since, we have remote cluster state now. we can think of doing something similar to shard level snapshots where we will capture current remote cluster state and keep a reference of it in snapshot repository instead of uploading entire cluster state.

Pull based model for cluster state:

  • Currently, snapshot protocol uses cluster state to keep different transition states or to share info between the nodes which is broadcasted by cluster manager node to all data nodes.
  • In cases when there are too many data nodes or cluster state is too large, snapshot operation can take too much time.
  • remote cluster state is kind of mirroring the cluster state right now. But in future as mentioned in [Remote State] All the nodes should download full/ diff cluster state from remote store #11744, if we go ahead with pulling the cluster state in data nodes directly from remote store.
  • it would further reduce the time taken by snapshot operations.

With these enhancements, shallow copy snapshots would mostly not have any dependency on amount of data to be snapshotted and size of cluster state. And we would be one more step closer to supporting PITR (which i think we can kind of still support with maybe few minutes of granularity) if we plan to use the same protocol.

Related component

Storage:Snapshots

Describe alternatives you've considered

No response

Additional context

No response

@harishbhakuni harishbhakuni added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 25, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@harishbhakuni Thanks for filing, looking forward to seeing PRs to improve these areas.

@ashking94
Copy link
Member

ashking94 commented Feb 8, 2024

Thanks for this @harishbhakuni.

I want to add another point here - I see we create locks for a shard's segment metadata file even if there is no incremental indexing done since last snapshot. This adds more load to the remote store and I feel this should be optimised. I would like to get your opinion on this. I should be able to pick this up in coming days/weeks unless you think there is some limitation or major rehaul required to achieve this.

@harishbhakuni
Copy link
Contributor Author

harishbhakuni commented Feb 12, 2024

I see we create locks for a shard's segment metadata file even if there is no incremental indexing done since last snapshot.

@ashking94, the idea here was to create one lock per file per acquirerID. so that releasing locks would be clean and there would not be issues/race conditions where one acquirer releases locks for another acquirer.

@sachinpkale
Copy link
Member

@harishbhakuni As this tries to solve/optimise on many tasks, can we create a meta issue out of it?

@harishbhakuni
Copy link
Contributor Author

sure @sachinpkale will do.

@shourya035
Copy link
Member

@harishbhakuni Please create a META Issue out of this so that this can be tracked better.

@harishbhakuni
Copy link
Contributor Author

harishbhakuni commented Oct 27, 2024

@sachinpkale /@shourya035 sorry missed this earlier. Created this META issue to further track this: #16492.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Snapshots
Projects
Status: 🏗 In progress
Development

No branches or pull requests

6 participants