-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Proposal] Add Remote Storage Options for Improved Durability #1968
Comments
Can we change the title to Let's also change "tradeoffs" to "considerations". This isn't a design document so we need to make sure we aren't capturing the technical tradeoffs here without communicating the technical design and concrete implementation (which will illuminate the tradeoffs upon further prototyping and concrete benchmarking). |
While it does improve the mean time to recovery when then there are no backing in-sync replicas, it isn't directly improving availability since(for now) we still need sufficient replica copies on disk for high availability. |
Will be able to choose cloud provider on the basis of indexes? |
Not cloud provider but the remote store. Proposal is to integrate with a remote storage to store the indexed data. There are no assumptions on how the remote store is hosted. For example, if the remote store is HDFS, it could either be self hosted or a managed service. |
In Transactional DBs the first entry goes to WAL(Write Ahead Log) and that guarantees that the data won't be lost if it is written to WAL. But here, we are first indexing and then writing to WAL, so there is still a chance of data lost between indexing and WAL append. |
Won't this storing data in remote storage (fastest being some streaming service) itself be a major latency bottleneck ? |
In OpensSearch all operations are written to the translog after being processed by the internal Lucene index to avoid any unexpected document failure upfront. The guarantee with remote translog store is no acknowledged writes would ever be lost
No this probably is not a major bottleneck since only one indexing thread would be at any point writing translog to the remote store, while other threads would still be processing indexing requests. While per request latencies might increase slightly, the throughput overall shouldn't have a major impact. I recommend you read through #1319 for more details. |
It improves the availability story as new replicas can be spun up w/ an rsync of segments on remote storage. Besides, remote storage is used as a mechanism in the segment replication implementation which is not durability so we need to loosely couple the durability story from this issue. |
Agree but we should not see remote storage as the option to increase availability. Even though overall availability may increase by using remote store, we can't guarantee availability as it depends on the availability of configured remote store as well as the cluster configuration.
There are two use cases of remote storage.
The proposal in this doc revolves around the second use case. It is independent of the replication strategy used (Even with two different use cases, implementation details can be same) If we start storing committed and uncommitted data to the remote store, wouldn't that improve durability? What else is required to make durability guarantees? |
This is a good proposal 👍 I understand the limitations of the existing system (translog, then Lucene), but if we're talking potential redesign I would at least try to step back and think bigger. The three features I think I'd really enjoy are 1) use S3 as my only store, 2) the ability to have multiple nodes use the same store, and 3) the ability to specify durability requirements per write. Longer version: I like that we are thinking about durable storage replacing the need for replicas. In that spirit, I'd want a consistent way of configuring durability intent in both global configuration (I want all writes to be committed to the majority), and writes ala MongoDB (e.g. I want write operation A to be a transaction and return only after it was committed, while another write return immediately, written optimistically, and accept eventual loss) regardless of what storage I use. Ideally I'd want the current local store and replica implementation to become just the default storage with certain promises and capabilities, thus "replica" is not "another feature", but a "feature of (durable) storage". If you do the above, then data does not need to be written to local disk, but local disk may be the default storage, or an optimization of certain durable storage (given that S3 is able to sustain hundreds of MB/s read throughput I wouldn't be so sure it's always an optimization). Ideally, I'd like to be able to run nodes without disk, and to be able to run multiple nodes on top of the same remote durable storage. All that said, I wouldn't want to feature creep this proposal, and wouldn't hold anything back as designed because it's a step forward in the right direction. |
I like the idea of this proposal and how it might lead us, as @dblock mentioned, to be able to query data directly from S3 (with caching on local disk if needed). On the other hand, when talking about only committed data being uploaded to Remote storage - i.e. just relying on the durability of replica copies and their respective translog - it's a different story all together, since here you're uploading a much bigger unmodified data, which may allow the usage of parallel uploading of same file to S3 or compatible remote storage (chunks), which may allow reaching same throughput as without it or very close to it. Same thing here I guess - if you can't keep up with upload rate, it will forever increase until you're lagging too much behind, so rate will be lower, question is: by how much. Final question - are you aware of any similar systems using Remote Storage (i.e. S3) for WAL? |
@asafm Yes there are trade-offs and we are working on clearly documenting them for eg with remote store you don't need to replicate operations to replica for segment based replication and with document based replication where a high search availability isn't needed(you can achieve high throughput with just replicating to remote store) With remote store we need to ensure it meets the availability, consistency and performance characteristics having said that today's shard local copies can lag as well(network backed disk IO etc) and hence we have a back pressure mechanism built. We will optimize cases as needed(batching, streaming etc) once we have a baseline performance. "Premature optimization is the root of all evil" :) S3 is not the only option for WAL, it could for instance have support for streaming services like Kafka |
I mean other databases that use Remote Storage for WAL |
Hi @llermaly , we keep only one copy of data in remote store, so there is no effect on using 0 or 1 replica from this feature's point of view.
Yes
No
You need to set replicas based on availability requirement. Durability will be taken care by this feature. |
@rramachand21 @sachinpkale are you still tracking this for 2.8? |
We moved this to 2.9.0 |
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: opensearch-project#1968 Upload segments to remote store post refresh: opensearch-project#3460 Add rest endpoint for remote store restore: opensearch-project#3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]>
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: opensearch-project#1968 Upload segments to remote store post refresh: opensearch-project#3460 Add rest endpoint for remote store restore: opensearch-project#3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]>
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: opensearch-project#1968 Upload segments to remote store post refresh: opensearch-project#3460 Add rest endpoint for remote store restore: opensearch-project#3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]>
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: #1968 Upload segments to remote store post refresh: #3460 Add rest endpoint for remote store restore: #3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: #4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]>
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: opensearch-project#1968 Upload segments to remote store post refresh: opensearch-project#3460 Add rest endpoint for remote store restore: opensearch-project#3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]>
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: #1968 Upload segments to remote store post refresh: #3460 Add rest endpoint for remote store restore: #3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: #4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]>
opensearch-project#8047) I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: opensearch-project#1968 Upload segments to remote store post refresh: opensearch-project#3460 Add rest endpoint for remote store restore: opensearch-project#3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]>
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: opensearch-project#1968 Upload segments to remote store post refresh: opensearch-project#3460 Add rest endpoint for remote store restore: opensearch-project#3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]> Signed-off-by: Rishab Nahata <[email protected]>
@anasalkouz Should this be moved to 2.11? |
This is tracking green for 2.10. |
Closing it, since it's already released on 2.10. |
I have nominated and maintainers have agreed to invite Sachin Kale (@sachinpkale) to be a co-maintainer. Sachin has kindly accepted. Sachin has led the design and implementation of the remote backed storage feature in OpenSearch. This feature was introduced as experimental in OpenSearch 2.3 and is planned for general availability in 2.9. Some significant issues and PRs authored by Sachin for this effort are as follows: Feature proposal: opensearch-project#1968 Upload segments to remote store post refresh: opensearch-project#3460 Add rest endpoint for remote store restore: opensearch-project#3576 Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020 In total, Sachin has authored 57 PRs going back to May 2022. He also frequently reviews contributions from others and has reviewed nearly 100 PRs in the same time frame. Signed-off-by: Andrew Ross <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
Feature Proposal : Add Remote Storage Options for Improved Durability
Overview
OpenSearch is the search and analytics suite powering popular use cases such as application search, log analytics, and more. While OpenSearch is part of critical business and application workflows, it is seldom used as a primary data store because there are no strong guarantees on data durability as the cluster is susceptible to data loss in case of hardware failure. In this document, we propose providing strong durability guarantees in OpenSearch using remote storage.
Challenges
Today, data indexed in OpenSearch clusters is stored on a local disk. To achieve durability, that is, to ensure that committed operations are not lost on hardware failure, users either back segments up using snapshots or add more replica copies. However, snapshots do not fully guarantee durability: data indexed since the last snapshot can be lost in case of failure. On the other hand, adding replicas is cost prohibitive. We propose a robust durability mechanism.
Proposed Solution
In addition to storing data on a local disk, we propose providing an option to store indexed data in a remote durable data store. These new APIs will enable building integrations with remote durable storage options like Amazon S3, OCI Object Storage, Azure Blob Storage and GCP Cloud Storage. This will give users the flexibility to choose the storage option that best fits their requirements.
Indexed data is of two types: committed or un-committed. After the indexing operation is successful, and before OpenSearch invokes commit, data is stored in translog. After commit, data becomes part of a segment on disk. To achieve the durability of indexed data, both committed and un-committed data need to be saved durably. We propose storing uncommitted data in a remote translog store and committed data in a remote segment store. This will pave the way to support PITR (Point In Time Recovery) in the future.
Remote Translog Storage
The translog (Transaction Log) contains data that is indexed successfully but yet to be committed. Each successful indexing operation makes an entry to the translog, which is a write-ahead transaction log. With remote translog storage, we keep a translog copy on a remote store as well. OpenSearch invokes a “flush” operation to perform Lucene commit. This process starts a new translog. We need to checkpoint the remote store with the committed changes in order to keep data consistent with primary. Feature Proposal - Pluggable Translog
Remote Segment Storage
Segments represent committed indexed data. Using remote segment storage, all the newly created segments in primary will be stored to the configured remote store. New segment is created either by OpenSearch “flush” operation or a “refresh” operation that creates segments on the page cache or by merging the already created segments. Segment deletions on primary will mark remote segments for deletion. These marked segments will be kept for certain configurable period.
While remote storage enables durability, there are tradeoffs for both remote translog store and remote segment store.
Considerations
Durability
Durability of indexed data depends on the durability of the configured remote store. The user can choose a remote store based on their durability requirements.
Availability
Data must be consistent between the remote and primary stores. Write availability of indexed data depends on the availability of the data node and the remote translog store. OpenSearch writes will fail if the data node or remote translog store is unavailable. Read operations can still be supported. However, in case of a failover, data availability for read will depend on the availability of remote store. The user can choose a remote data store based on their availability requirements.
Performance
We need a synchronous remote translog store call during an indexing operation to guarantee that all operations have been durably persisted to the store. This affects the average response time of the write API, however there shouldn’t be any significant throughput degradation due to this which will be covered with more details in the design doc. But as indexed data becomes durable, if extra availability provided by the replicas is not required, users can opt out of replica. This will remove the sync network call from primary to replica which will decrease the average response time of the write API. The overhead of the network call to remote store depends on the performance of the remote store.
We will have one more network call to the remote segment store when segments are created. As segment creation happens in the background, there won’t be any direct impact on the performance of search and indexing. But there are dedicated APIs to create segments (
flush, refresh
). Response time of these APIs will increase because of the extra network call to the remote store.Consistency
Data in the remote segment store will be eventually consistent with the data on the local disk. It will not impact data visibility for read operations as queries will be served from the local disk.
Cost
Any change in cost will be determined by two factors: configured remote stores and current cluster configuration. The change in cost of an OpenSearch cluster depends on different factors like change in the number of replica nodes and remote stores used for translog as well as segment.
As an illustration, consider a user who has configured replicas only for durability and who uses snapshots for periodic backup. Replacing replicas with remote storage will lower cost. The cost of storing remote segments will remain the same because the store used for snapshots be used for storing remote segments. However, the additional storage cost for the remote translog will lead to an increase in cost. But since durable storage options like Amazon S3 are not expensive, the increase in cost is minimal.
Recovery
In the current OpenSearch implementation for peer recovery, data is copied from the primary to replicas, which consumes system resources (disk, network) of the primary. By using remote store to copy the indexed data to replicas, the scalability of primary will be increased, since data sync operations will be performed on the replica; but the overall recovery time will also increase.
Next steps
Based on the feedback from this Feature Proposal, we’ll be drafting a design (link to be added). We look forward to collaborating with the community.
FAQs
1. With remote storage option, will the indexed data be stored to local disk as well?
Yes, we will continue to write data to local disk as well. This is to ensure that search performance is not impacted. Data in the remote store will be used for restore and recovery operations.
2. What would be the durability SLA?
Durability SLA would be same as that of configured remote store. If we use 2 different remote stores for translog and segment, durability of OpenSearch will be defined as minimum of these two stores.
3. Is it possible to use one remote data store for translog and segments?
Yes it is possible. While translog is an append only log, segment is an immutable file. Write throughput to translog is same as indexing throughput whereas segment creation is a background process and not affected by the indexing throughput. Due to these differences, we will have different set of APIs for translog and segment store. If one data store can be integrated with both set of APIs, it can be used as remote translog store as well as remote segment store.
4. Do I still need to use snapshots for backup and restore?
Architecture of snapshot and remote storage will be unified. If you configure remote store option, then using snapshot would be a no-op. But if durability provided by snapshot is as per your requirement, then you can disable the remote storage and continue using snapshots.
5. Can the remote data be used across different OpenSearch versions?
Version compatibility would be similar to what OpenSearch supports with snapshots.
Open Questions
The text was updated successfully, but these errors were encountered: