-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Capability to remove _recovery_source per field #13490
Comments
Updated the flame graphs in the appendix section of the issue. |
Thanks @navneet1v for sharing these insights.
|
Today, in case original source differs from source being written (due to including/excluding any specific fields), recovery source is always written using original source to ensure ops based recovery. Won't this change divert from the intent it was serving? |
@mgodwan I have ans this question in the FAQ section of the above issue. For vector field we are working on a PR which will allow the vector field values to be read from doc values and put in _source response. This will ensure that recovery is possible after a crash. Plus given that _recovery_source gets deleted after a certain point of time recovery source was never a full proof way to recover from crashes. |
If I understand the feature correctly, I don't think this was the use case for it ever. |
Is there a way I can check this? any IT or anything else that can help me here. |
Is your feature request related to a problem? Please describe
While doing the indexing all the fields which are getting ingested in Opensearch are stored as _source. If user requires they can disable the _source per field or completely for all the fields. But if user does this, the _recovery_source gets added(ref), which gets removed later on.
So overall the whole payload will still be used as a StoredField and impacts the indexing time. The impact on indexing time is high if one of the field is a vector field. In my experiments with 768D 1M dataset I can see a 50% reduction in indexing latency at p90 level.
Benchmarking Results
Below are the benchmarking results. Refer Appendix A for the flame graphs
Cluster Configuration
Baseline Results
Removing Recovering Source and _source
POC code: navneet1v@6c5896a
Describe the solution you'd like
Just like _source where we can specify what fields are included/excluded in _source or completely disable _source, I was thinking to have same capability for _recovery_source. This will ensure that users can remove their fields from _recovery source if required.
Related component
Indexing:Performance
Describe alternatives you've considered
In terms of alternative there is no alternative to disable the _recovery_source for an index.
FAQ
Q1: If a user needs vector field how they can retrieve the vector field? also Recovery source and _source is used for other purpose like update by query, disaster recovery etc etc, how we are going to support that?
In k-NN repo we are working on a PR which will ensure that we read k-NN vector field from doc values. Ref: opensearch-project/k-NN#1571. For other fields that require such capabilities can be added in core in incremental fashion if needed.
Appendix A
Flame graph when _source/_recovery_source is getting stored during indexing for vector fields.
Flame graph when _source and _recovery_source is not getting stored.
The text was updated successfully, but these errors were encountered: