Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Searchable Snapshots] [Low Level Design] Block Based Storage #4033

Closed
kotwanikunal opened this issue Jul 28, 2022 · 7 comments
Closed

[Searchable Snapshots] [Low Level Design] Block Based Storage #4033

kotwanikunal opened this issue Jul 28, 2022 · 7 comments
Assignees
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request Indexing & Search

Comments

@kotwanikunal
Copy link
Member

This document outlines the low level design proposal for implementing block-based storage. High level design document and proposal: #3869

Overview

Block based file system will enable fetching parts of the Lucene IndexInput files from the snapshot within the repository instead of downloading the entire file on disk — only download the bytes accessed by the query.

BlockedIndexInput
The solution implements a wrapper around the IndexInput class to manage the block calculation, fetching and seeking mechanisms. This wrapper will work as a virtual file which will the utilized by Lucene to read index files and will internally keep track of the necessary blocks and the calculation required to fetch other blocks as per the query.

VirtualFileIndexInput
Additionally, another wrapper will be used to fetch the virtual file data (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L205)which consists of the metadata around the segment files for Lucene. This will work on a “fetch-when-read” basis to download the entire file onto disk when needed by the service.

The design has been broken down into two phases -

  1. Phase 1: This phase will consist of adding in the block based fetching and virtual file mechanism without any cache. To overcome the refetch logic for blocks and files, there will be an on-disk check for the file to verify it has been already fetched previously. This naive implementation does not account the disk cleanup for unused blocks and is solely focused on implementing the block-based fetch logic.
  2. Phase 2: This phase will add in the caching mechanism required to keep the necessary blocks on disk. The implementation will add a wrapper mechanism with a RefCount mechanism to keep track of the files which are in use and will also utilize an eviction mechanism to further optimize disk storage.

Low Level Design

Phase 1: Without Cache

As described above, phase 1 will only implement the mechanisms necessary to enable a block-based fetching mechanism for segment files. In addition to the interface definitions, we will implement the block calculation and fetch logic within the BlockedIndexInput class.

The properties and methods like getBlock*, getCurrentBlock*, blockSize, blockMask, blockSizeShift will be utilized to enable the block calculation logic for the actual segment file described by fileInfo.

The fetchBlock, downloadBlockAsync, downloadTo will handle the fetching logic for the blocks as well as virtual files.

FileSystem-GH - Index Input wo Cache

Phase 2: With Cache

The implementation from Phase 1 will be followed up to enable caching for the blocks leading to deletion based on cache eviction for reduced storage needs.
FileCachedIndexInput will implement a RefCount mechanism to keep track of open handles as well as ensure the file is cached within a BlockCache/FileCache.

FileSystem-GH - Index Input Cache

Directory

ReadOnlyDirectory will be implemented to block writes for the Store and will complement the ReadOnlyEngine for Searchable Snapshots. It will utilize the IndexInput classes described above to open the index for reads for a remote snapshot.
FileSystem-Directory

@AmiStrn
Copy link
Contributor

AmiStrn commented Jul 28, 2022

Quality design! Thanks @kotwanikunal

What about the snapshots themselves? today we can create a snapshot under any name, containing any subset of the indices from the cluster. And to make matters more complex - several snapshots could contain the same index. Will this feature require the snapshot to be named a certain way? or that it contains only one index?

...
Answering myself here :) see proposal for API that describes creating snapshots with "storage_type": "remote_snapshot"
see [Searchable Snapshot] Propose API

@kotwanikunal
Copy link
Member Author

kotwanikunal commented Jul 28, 2022

Thanks @AmiStrn!
The mechanism that we are currently designing will restore the metadata for the indices within the snapshot when that property is passed in and will skip restoring the segment files.
It will use the same methodology as the current snapshot restore process - naming conventions/alias conventions and other restrictions.
Once the metadata restore is complete, the new file system will be utilized to perform block-based fetching of the segment files.

@andrross can chime in if I have missed anything on the API front.

@kotwanikunal kotwanikunal added the discuss Issues intended to help drive brainstorming and decision making label Jul 28, 2022
@nir-logzio
Copy link

I'm a bit confused by the terminology - block storage vs. object storage. Today snapshots are mainly stored on object storage. Where are the details about implementing block storage over an object storage?

@andrross
Copy link
Member

andrross commented Aug 16, 2022

@nir-logzio The term "block" is being used a bit loosely here. The high level idea is that when Lucene needs to read a part of a segment file, instead of downloading the entire file onto the local disk cache this will only download the part of the file necessary and store them as logical "blocks" on the local disk. The terminology is perhaps a bit confusing but this doesn't mean that we'll be using a remote block store (e.g. GCP Persistent Disk, AWS EBS, etc). The remote storage remains the object stores (all supported repository implementations).

@reta
Copy link
Collaborator

reta commented Aug 17, 2022

Good one @kotwanikunal , super minor comment / suggestion to go from BlockedIndexInput to BlockIndexInput: blocked prefix is a bit confusing in this context (at least to me, but that is sibjective for sure). Also fits well into block device, block storage ...

From implementation perspective, does it make sense to implement such index reader using Lucene's BufferedIndexInput / SlicedIndexInput? It seems like "block" may fit well into the "slice" in this case, just throwing an idea out there ...

@kotwanikunal
Copy link
Member Author

Thanks @reta! Sure. Matching the conventions sounds like a good plan.

I will look into the SlicedIndexInput implementation to check if there is any pre-existing logic that can be wired in. We do have a POC implementation out here (still a WIP and not production ready) - andrross#101 for the design described above.
It would be great to have any feedback or inputs.

@kotwanikunal
Copy link
Member Author

Implemented as a part of #4892

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request Indexing & Search
Projects
Status: Done
Development

No branches or pull requests

5 participants