-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance CV and MV idempotency via deterministic ID generation #946
base: main
Are you sure you want to change the base?
Enhance CV and MV idempotency via deterministic ID generation #946
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A couple of points to emphasize in the documentation:
- The
id_expression
generates a unique_id
to be used during OpenSearch write operations with upsert semantics, guaranteeing idempotency. - It would be beneficial to provide more examples about the common and available
id_expression
options supported.
Sure, will update the documentation. Meanwhile will block this PR merge until the follow-up PR for testing dashboard query. Thanks! |
I'm testing with AOSS and found time-series collection seems not support create/index request with doc ID specified. If this is true, that means the approach in this PR as well as Flint skipping index (both relies on doc ID) won't work. Will double confirm. |
Using the AOSS time series collection, it was confirmed that the following exception is thrown when attempting to create or update a document with ID:
The implications of these findings are as follows:
For this PR, it remains functional with both AOS and AOSS search collections. However, since I couldn't find a way to determine the type of an AOSS collection, I will temporarily remove the auto-generated ID column logic for aggregate MVs. A separate PR will be submitted once an alternative approach is identified. |
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
796b45c
to
15ed31b
Compare
Signed-off-by: Chen Dai <[email protected]>
Description
This PR enhances the Covering Index (CV) and Materialized View (MV) write operations by introducing deterministic ID generation. Deterministic IDs uniquely identify each document in a CV/MV based on its data, enabling idempotency. This ensures consistent results during retries or restarts of index refreshes. By eliminating duplicates, this approach maintains data integrity and operational consistency, even in failure scenarios.
Algorithm
The ID generation logic follows the precedence below:
Aggregated MVIf MV queries involves aggregation, IDs are generated using SHA-1 on concatenated output columns.SHA-1 is chosen for balancing collision resistance, performance, and space efficiency, compared to other options in Spark such ashash
,xxhash64
,md5
andsha-2
.Side Effects
This deterministic document ID approach introduces these impacts and requires further exploration to determine if a better solution is viable for long-term scalability and performance.
Documentation
https://github.com/dai-chen/opensearch-spark/blob/support-covering-index-and-mv-idempotency-rework/docs/index.md#create-index-options
Related Issues
#88
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.