Enhance CV and MV idempotency via deterministic ID generation #946

dai-chen · 2024-11-22T17:28:16Z

Description

This PR enhances the Covering Index (CV) and Materialized View (MV) write operations by introducing deterministic ID generation. Deterministic IDs uniquely identify each document in a CV/MV based on its data, enabling idempotency. This ensures consistent results during retries or restarts of index refreshes. By eliminating duplicates, this approach maintains data integrity and operational consistency, even in failure scenarios.

Algorithm

The ID generation logic follows the precedence below:

User-Provided ID Expression
- Enables users to define custom ID generation logic based on specific columns or expressions.
- If empty, no ID column is generated, which is useful for testing or disabling the feature.
~~Aggregated MV~~
- ~~If MV queries involves aggregation, IDs are generated using SHA-1 on concatenated output columns.~~
- ~~SHA-1 is chosen for balancing collision resistance, performance, and space efficiency, compared to other options in Spark such as hash, xxhash64, md5 and sha-2.~~
Default Behavior
- Otherwise, no ID column is generated and idempotency is not guaranteed.

Side Effects

This deterministic document ID approach introduces these impacts and requires further exploration to determine if a better solution is viable for long-term scalability and performance.

Spark Computation Overhead
- The SHA-1 hash and string concat operations adds computation overhead on Spark side.
OpenSearch Ingestion Overhead
- Each document write involves a document ID lookup to identify and deduplicate existing documents.
OpenSearch Document Size
- The SHA-1 hash used for document IDs (160 bits) consumes more space compared to OpenSearch's default UUID-based IDs (128 bits).
Possibility of Collision
- While the likelihood of collisions with SHA-1 is extremely low (e.g., negligible even at PB-scale data), a collision could result in data loss by overwriting an existing document.

Documentation

https://github.com/dai-chen/opensearch-spark/blob/support-covering-index-and-mv-idempotency-rework/docs/index.md#create-index-options

Related Issues

#88

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

noCharger

LGTM. A couple of points to emphasize in the documentation:

The id_expression generates a unique _id to be used during OpenSearch write operations with upsert semantics, guaranteeing idempotency.
It would be beneficial to provide more examples about the common and available id_expression options supported.

dai-chen · 2024-12-06T22:43:17Z

LGTM. A couple of points to emphasize in the documentation:

The id_expression generates a unique _id to be used during OpenSearch write operations with upsert semantics, guaranteeing idempotency.

It would be beneficial to provide more examples about the common and available id_expression options supported.

Sure, will update the documentation. Meanwhile will block this PR merge until the follow-up PR for testing dashboard query. Thanks!

dai-chen · 2024-12-17T22:35:39Z

I'm testing with AOSS and found time-series collection seems not support create/index request with doc ID specified. If this is true, that means the approach in this PR as well as Flint skipping index (both relies on doc ID) won't work. Will double confirm.

dai-chen · 2024-12-20T17:23:23Z

Using the AOSS time series collection, it was confirmed that the following exception is thrown when attempting to create or update a document with ID:

24/12/19 23:58:46 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 3447) ([2600:1f14:38a0:a801:1a95:8524:74ed:54a8] executor 4): java.lang.RuntimeException: failure in bulk execution:
[0]: index [flint_glue_default_mv_idempotent_test_1], id [51fd2d0929ebb39c79ee9492ba064f654e56709c], message [OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=Document ID is not supported in create/index operation request]]]
[1]: index [flint_glue_default_mv_idempotent_test_1], id [53b97544fb9831e0714a12ca243930539af06c3d], message [OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=Document ID is not supported in create/index operation request]]]
[2]: index [flint_glue_default_mv_idempotent_test_1], id [5896fb03aa01e3962215b19eddefa07a8764fc55], message [OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=Document ID is not supported in create/index operation request]]]
[3]: index [flint_glue_default_mv_idempotent_test_1], id [1ec78f9c430d1e7dfc682b76b0f8e27eb7e3c32d], message [OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=Document ID is not supported in create/index operation request]]]
[4]: index [flint_glue_default_mv_idempotent_test_1], id [fcd5f0c77810f6382ff00ec942e7f1fcb529d61f], message [OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=Document ID is not supported in create/index operation request]]]
	at org.opensearch.flint.core.storage.OpenSearchWriter.flush(OpenSearchWriter.java:64)
	at shaded.flint.com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.flush(WriterBasedJsonGenerator.java:983)
	at org.apache.spark.sql.flint.json.FlintJacksonGenerator.flush(FlintJacksonGenerator.scala:257)
	at org.apache.spark.sql.flint.FlintPartitionWriter.write(FlintPartitionWriter.scala:64)
	at org.apache.spark.sql.flint.FlintPartitionWriter.write(FlintPartitionWriter.scala:24)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.write(WriteToDataSourceV2Exec.scala:493)
	at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.$anonfun$run$1(WriteToDataSourceV2Exec.scala:448)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1410)
	at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:486)
	at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:425)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:491)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:388)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:174)
	at org.apache.spark.scheduler.Task.run(Task.scala:152)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:632)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:96)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:635)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

The implications of these findings are as follows:

Flint's skipping index creation will fail because it automatically uses source file path as the document ID.
The deduplication approach proposed for APPEND mode in this PR will not function as intended.
A similar deduplication approach (upsert with document ID) for future UPDATE mode support will also be ineffective.

For this PR, it remains functional with both AOS and AOSS search collections. However, since I couldn't find a way to determine the type of an AOSS collection, I will temporarily remove the auto-generated ID column logic for aggregate MVs. A separate PR will be submitted once an alternative approach is identified.

Signed-off-by: Chen Dai <[email protected]>

dai-chen added enhancement New feature or request 0.7 labels Nov 22, 2024

dai-chen self-assigned this Nov 22, 2024

dai-chen marked this pull request as ready for review November 25, 2024 16:53

dai-chen requested review from mengweieric, vmmusings, penghuo, seankao-az, anirudha, kaituo, YANG-DB, noCharger, LantaoJin and ykmr1224 as code owners November 25, 2024 16:53

noCharger approved these changes Dec 3, 2024

View reviewed changes

dai-chen mentioned this pull request Dec 17, 2024

Introduce basic sanity test for MV used by Observability Integrations #995

Open

5 tasks

dai-chen added 8 commits December 20, 2024 10:06

Generate id column for CV and MV

0771827

Signed-off-by: Chen Dai <[email protected]>

Add UT for CV and MV

e3155d5

Signed-off-by: Chen Dai <[email protected]>

Update with doc and UT

042b30e

Signed-off-by: Chen Dai <[email protected]>

Handle struct type in tumble function

32169e2

Signed-off-by: Chen Dai <[email protected]>

Refactor UT and doc

71a9a34

Signed-off-by: Chen Dai <[email protected]>

Add logging and more IT

003a1d6

Signed-off-by: Chen Dai <[email protected]>

Fix id expression comment

cd10c8e

Signed-off-by: Chen Dai <[email protected]>

Refactor UT assertions

b1fc848

Signed-off-by: Chen Dai <[email protected]>

dai-chen marked this pull request as draft December 20, 2024 18:07

Remove auto gen logic for MV

15ed31b

Signed-off-by: Chen Dai <[email protected]>

dai-chen force-pushed the support-covering-index-and-mv-idempotency-rework branch from 796b45c to 15ed31b Compare December 20, 2024 21:42

Update user manual and scaladoc

01b48fa

Signed-off-by: Chen Dai <[email protected]>

dai-chen marked this pull request as ready for review December 20, 2024 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance CV and MV idempotency via deterministic ID generation #946

Enhance CV and MV idempotency via deterministic ID generation #946

dai-chen commented Nov 22, 2024 •

edited

Loading

noCharger left a comment

dai-chen commented Dec 6, 2024

dai-chen commented Dec 17, 2024

dai-chen commented Dec 20, 2024 •

edited

Loading

Enhance CV and MV idempotency via deterministic ID generation #946

Are you sure you want to change the base?

Enhance CV and MV idempotency via deterministic ID generation #946

Conversation

dai-chen commented Nov 22, 2024 • edited Loading

Description

Algorithm

Side Effects

Documentation

Related Issues

Check List

noCharger left a comment

Choose a reason for hiding this comment

dai-chen commented Dec 6, 2024

dai-chen commented Dec 17, 2024

dai-chen commented Dec 20, 2024 • edited Loading

dai-chen commented Nov 22, 2024 •

edited

Loading

dai-chen commented Dec 20, 2024 •

edited

Loading