[Searchable Snapshots] [Low Level Design] Block Based Storage #4033

kotwanikunal · 2022-07-28T17:04:40Z

This document outlines the low level design proposal for implementing block-based storage. High level design document and proposal: #3869

Overview

Block based file system will enable fetching parts of the Lucene IndexInput files from the snapshot within the repository instead of downloading the entire file on disk — only download the bytes accessed by the query.

BlockedIndexInput
The solution implements a wrapper around the IndexInput class to manage the block calculation, fetching and seeking mechanisms. This wrapper will work as a virtual file which will the utilized by Lucene to read index files and will internally keep track of the necessary blocks and the calculation required to fetch other blocks as per the query.

VirtualFileIndexInput
Additionally, another wrapper will be used to fetch the virtual file data (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L205)which consists of the metadata around the segment files for Lucene. This will work on a “fetch-when-read” basis to download the entire file onto disk when needed by the service.

The design has been broken down into two phases -

Phase 1: This phase will consist of adding in the block based fetching and virtual file mechanism without any cache. To overcome the refetch logic for blocks and files, there will be an on-disk check for the file to verify it has been already fetched previously. This naive implementation does not account the disk cleanup for unused blocks and is solely focused on implementing the block-based fetch logic.
Phase 2: This phase will add in the caching mechanism required to keep the necessary blocks on disk. The implementation will add a wrapper mechanism with a RefCount mechanism to keep track of the files which are in use and will also utilize an eviction mechanism to further optimize disk storage.

Low Level Design

Phase 1: Without Cache

As described above, phase 1 will only implement the mechanisms necessary to enable a block-based fetching mechanism for segment files. In addition to the interface definitions, we will implement the block calculation and fetch logic within the BlockedIndexInput class.

The properties and methods like getBlock*, getCurrentBlock*, blockSize, blockMask, blockSizeShift will be utilized to enable the block calculation logic for the actual segment file described by fileInfo.

The fetchBlock, downloadBlockAsync, downloadTo will handle the fetching logic for the blocks as well as virtual files.

Phase 2: With Cache

The implementation from Phase 1 will be followed up to enable caching for the blocks leading to deletion based on cache eviction for reduced storage needs.
FileCachedIndexInput will implement a RefCount mechanism to keep track of open handles as well as ensure the file is cached within a BlockCache/FileCache.

Directory

ReadOnlyDirectory will be implemented to block writes for the Store and will complement the ReadOnlyEngine for Searchable Snapshots. It will utilize the IndexInput classes described above to open the index for reads for a remote snapshot.

AmiStrn · 2022-07-28T18:45:25Z

Quality design! Thanks @kotwanikunal

What about the snapshots themselves? today we can create a snapshot under any name, containing any subset of the indices from the cluster. And to make matters more complex - several snapshots could contain the same index. Will this feature require the snapshot to be named a certain way? or that it contains only one index?

...
Answering myself here :) see proposal for API that describes creating snapshots with "storage_type": "remote_snapshot"
see [Searchable Snapshot] Propose API

kotwanikunal · 2022-07-28T19:42:05Z

Thanks @AmiStrn!
The mechanism that we are currently designing will restore the metadata for the indices within the snapshot when that property is passed in and will skip restoring the segment files.
It will use the same methodology as the current snapshot restore process - naming conventions/alias conventions and other restrictions.
Once the metadata restore is complete, the new file system will be utilized to perform block-based fetching of the segment files.

@andrross can chime in if I have missed anything on the API front.

nir-logzio · 2022-08-16T12:30:59Z

I'm a bit confused by the terminology - block storage vs. object storage. Today snapshots are mainly stored on object storage. Where are the details about implementing block storage over an object storage?

andrross · 2022-08-16T16:01:13Z

@nir-logzio The term "block" is being used a bit loosely here. The high level idea is that when Lucene needs to read a part of a segment file, instead of downloading the entire file onto the local disk cache this will only download the part of the file necessary and store them as logical "blocks" on the local disk. The terminology is perhaps a bit confusing but this doesn't mean that we'll be using a remote block store (e.g. GCP Persistent Disk, AWS EBS, etc). The remote storage remains the object stores (all supported repository implementations).

reta · 2022-08-17T18:28:41Z

Good one @kotwanikunal , super minor comment / suggestion to go from BlockedIndexInput to BlockIndexInput: blocked prefix is a bit confusing in this context (at least to me, but that is sibjective for sure). Also fits well into block device, block storage ...

From implementation perspective, does it make sense to implement such index reader using Lucene's BufferedIndexInput / SlicedIndexInput? It seems like "block" may fit well into the "slice" in this case, just throwing an idea out there ...

kotwanikunal · 2022-08-24T22:47:22Z

Thanks @reta! Sure. Matching the conventions sounds like a good plan.

I will look into the SlicedIndexInput implementation to check if there is any pre-existing logic that can be wired in. We do have a POC implementation out here (still a WIP and not production ready) - andrross#101 for the design described above.
It would be great to have any feedback or inputs.

kotwanikunal · 2022-10-28T21:51:34Z

Implemented as a part of #4892

kotwanikunal added enhancement Enhancement or improvement to existing feature or request untriaged feature New feature or request Indexing & Search and removed untriaged labels Jul 28, 2022

kotwanikunal mentioned this issue Jul 28, 2022

[Searchable Snapshot] Design and define the interfaces for file and directory structures for Remote Storage #3114

Closed

kotwanikunal added the discuss Issues intended to help drive brainstorming and decision making label Jul 28, 2022

anasalkouz added this to Searchable Snapshots Oct 14, 2022

kotwanikunal closed this as completed Oct 28, 2022

kotwanikunal moved this to Done in Searchable Snapshots Oct 28, 2022

kotwanikunal self-assigned this Nov 4, 2022

navneet1v mentioned this issue Aug 12, 2024

[FEATURE] Use IndexInput to load the graph files for Native Index opensearch-project/k-NN#1951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Searchable Snapshots] [Low Level Design] Block Based Storage #4033

[Searchable Snapshots] [Low Level Design] Block Based Storage #4033

kotwanikunal commented Jul 28, 2022

AmiStrn commented Jul 28, 2022 •

edited

Loading

kotwanikunal commented Jul 28, 2022 •

edited

Loading

nir-logzio commented Aug 16, 2022

andrross commented Aug 16, 2022 •

edited

Loading

reta commented Aug 17, 2022 •

edited

Loading

kotwanikunal commented Aug 24, 2022

kotwanikunal commented Oct 28, 2022

[Searchable Snapshots] [Low Level Design] Block Based Storage #4033

[Searchable Snapshots] [Low Level Design] Block Based Storage #4033

Comments

kotwanikunal commented Jul 28, 2022

Overview

Low Level Design

Phase 1: Without Cache

Phase 2: With Cache

Directory

AmiStrn commented Jul 28, 2022 • edited Loading

kotwanikunal commented Jul 28, 2022 • edited Loading

nir-logzio commented Aug 16, 2022

andrross commented Aug 16, 2022 • edited Loading

reta commented Aug 17, 2022 • edited Loading

kotwanikunal commented Aug 24, 2022

kotwanikunal commented Oct 28, 2022

AmiStrn commented Jul 28, 2022 •

edited

Loading

kotwanikunal commented Jul 28, 2022 •

edited

Loading

andrross commented Aug 16, 2022 •

edited

Loading

reta commented Aug 17, 2022 •

edited

Loading