Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Remote store-based warm index #8446

Open
2 of 7 tasks
andrross opened this issue Jul 5, 2023 · 6 comments
Open
2 of 7 tasks

[Meta] Remote store-based warm index #8446

andrross opened this issue Jul 5, 2023 · 6 comments
Labels
feature New feature or request Meta Meta issue, not directly linked to a PR Roadmap:Cost/Performance/Scale Project-wide roadmap label Storage Issues and PRs relating to data and metadata storage

Comments

@andrross
Copy link
Member

andrross commented Jul 5, 2023

Goals

Create a proof-of-concept that shows end-to-end functionality of a remote-backed index where the data may not all reside locally and can be fetched on-demand from the remote store when necessary. This is the initial implementation of the feature described in #6528. This will build upon the design and prototype started in #7331 in order to demonstrate an end-to-end capability.

This code touches much of the same code as the remote store feature, which is nearing promotion out from behind a feature flag. In order to avoid complicating that effort in the immediate short term, we’ll start development on a feature branch. Once remote store is no longer behind a feature flag, then we’ll move this effort from the feature branch to behind a feature flag on main.

Non-goals

Make final decisions on naming or APIs. The term “warm” is used extensively here as that is a sort of term-of-art and is generally well understood, but one of the larger goals of these efforts is to remove the need for users to think about discrete storage tiers and allow the system to more intelligently optimize based on usage patterns.

Tasks


The above tasks are the initial priority for building the basic functionality. After that, we will implement the functionality described below to dynamically change the "warm" property on an index:

  • Implement support for adding “warm” setting to hot index (hot-to-warm)
    • All files on local disk are logically moved to the file cache (no physical movement). If the file cache is filled, then files will be deleted from local disk based on a TBD heuristic.
    • The index will remain readable and writable with no extended downtime.
  • Implement support for removing “warm” setting from warm index (warm-to-hot)
    • All block-based files will be deleted from disk and complete files will be restored from the remote store.
    • The index will remain readable and writable with no extended downtime.
@andrross andrross added Meta Meta issue, not directly linked to a PR distributed framework labels Jul 5, 2023
@andrross andrross added feature New feature or request and removed untriaged labels Jul 5, 2023
@anasalkouz
Copy link
Member

All block-based files will be deleted from disk and complete files will be restored from the remote store.

Why this is required since we will have the hybrid directory and we can read from both complete files and block-based files?

@ankitkala
Copy link
Member

ankitkala commented Jan 8, 2024

@ankitkala
Copy link
Member

Here is the sorted list of tasks we have to start the efforts on writable warm.

  • Add initial bootstrap code for warm primary (issue1 issue2).
    • Uses file cache & composite directory.
    • Implement IndexInput for local files with file cache.
  • Support for replicas (issue).
    • Replica recovery flow (start replica with partial data on remote)
    • Supporting segment replication on warm replicas.
      • Prefetch optimizations for replica (data needed to open a reader).
      • Add support for remote store backed index inputs.
  • Primary promotion (replica to primary, primary relocation) (issue).
    • Testing write path on primary with tiered data.
  • Support encryption with writable warm(issue)
  • Prefetch optimizations (TBD).
  • Support for File cache evictions. TTL & overflow criteria based.
    • open questions: how long to retain the new files locally on primary.
    • Overflow criteria for File Cache.
  • [Observability] Expose stats & metrics for FileCache (issue).

@ns-sladani
Copy link

Hey @andrross,

Thank you for your & team work on the search functionality for remote-backed indexes. Could you please provide an estimated timeline (ETA) for the release of this feature? I have reviewed several GitHub issues, including #6528 and #8446, but did not find an ETA mentioned.

We are planning to implement remote-backed indexes for our on-premises OpenSearch deployment, and this feature will significantly help us reduce local disk storage by leveraging AWS S3 object storage.

Thank you once again! Very excited for this feature.

@andrross
Copy link
Member Author

andrross commented Dec 6, 2024

Hey @ns-sladani, thanks for your interest! Unfortunately I don't have an answer to your question as I am not actively working on this feature myself. Tagging some others who might be able to help: @ankitkala @rayshrey

@ns-sladani
Copy link

Thanks @andrross!

Hey @ankitkala | @rayshrey,

Do you guys know an estimated timeline (ETA) for the release of this feature? OR any other PR to track it? It would be significant help.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request Meta Meta issue, not directly linked to a PR Roadmap:Cost/Performance/Scale Project-wide roadmap label Storage Issues and PRs relating to data and metadata storage
Projects
Status: New
Status: 🆕 New
Development

No branches or pull requests

5 participants