Exactly-once guarantee for covering index and MV incremental refresh #143

dai-chen · 2023-11-09T02:38:52Z

$${\color{red} DON'T \space MERGE \space DUE \space TO \space IMPACT \space ON \space INTEGRATION \space FEATURE }$$

Description

As proposed in issue below, generate ID column in FlintSparkIndex.generateIdColumn():

Add a new option id_expression in create index statement: will use this expression as ID column
If not provided, use all result columns (concat and sha1) if aggregated query (applied to MV)
If manual refresh or auto refresh without checkpoint location, continue without ID column
Otherwise, fail and ask user to provide ID expression

Documentation

Documentation: https://github.com/dai-chen/opensearch-spark/blob/support-covering-index-and-mv-idempotency/docs/index.md#create-index-options

Algorithm

In specifically, the approach in case 2 is first concat each column with ASCII NULL as separator (different from empty or whitespace string) and then pass it to SHA1:

spark-sql> SELECT concat_ws('\0', 'hello', 123, 4.56, true, TIMESTAMP '2023-11-01 10:01:00', null, DATE '2023-11-02', array(7, 8));
hello1234.56true2023-11-01 10:01:002023-11-0278

spark-sql> SELECT sha1(concat_ws('\0', 'hello', 123, 4.56, true, TIMESTAMP '2023-11-01 10:01:00', null, DATE '2023-11-02', array(7, 8)));
ce785d2ecbd1039c5028dc3478b2f5e579f3a880

spark-sql> SELECT sha1(concat_ws(' ', 'hello', 123, 4.56, true, TIMESTAMP '2023-11-01 10:01:00', null, DATE '2023-11-02', array(7, 8)));
2e44ab3226a7d8cf0f4190489d62ea34b5628d9f

spark-sql> SELECT sha1(concat_ws('', 'hello', 123, 4.56, true, TIMESTAMP '2023-11-01 10:01:00', null, DATE '2023-11-02', array(7, 8)));
06aed815b4654249db4994946c5133e372ad42ab

Testing

Case 1:Use ID expression provided in the index options:

scala> spark.sql("""
CREATE INDEX clientip_and_status ON ds_tables.http_logs
(clientip, status)
WITH (
  auto_refresh = true,
  id_expression = 'uuid()'
)
""")

23/11/10 19:58:23 INFO FlintSparkIndex: Generated ID column based on expression Some(uuid())
23/11/10 19:58:23 INFO FlintSparkCoveringIndex: Building covering index by == Physical Plan ==
*(1) Project [clientip#86, status#88, uuid(Some(4863749536212552132)) AS __id__#101]
+- *(1) Scan ExistingRDD[@timestamp#85,clientip#86,request#87,status#88,size#89,year#90,month#91,day#92]

      {
        "_index": "flint_myglue_ds_tables_http_logs_clientip_and_status_index",
        "_id": "aae1a580-ee85-4b9b-b2cd-8f2b80255d9d",
        "_score": 1,
        "_source": {
          "clientip": "138.64.16.0",
          "status": 304
        }
      }

Case 2:Generate ID column based on all output columns if MV with aggregation:

spark.sql("""
CREATE MATERIALIZED VIEW http_logs_metrics
AS
SELECT
  window.start AS startTime,
  COUNT(*) AS count
FROM ds_tables.http_logs
WHERE year = 1998 AND month = 6 AND day = 11 
  AND status BETWEEN 400 AND 599
GROUP BY TUMBLE(`@timestamp`, '1 Hour')
WITH (
  auto_refresh = true,
  checkpoint_location = "s3://checkpoints/",
  watermark_delay = '1 Minute',
  extra_options = '{"myglue.ds_tables.http_logs": {"maxFilesPerTrigger": "10"}}'
)
""")

23/11/10 22:54:44 INFO FlintSparkIndex:
  Generated ID column based on expression Some(sha1(concat_ws(, startTime, count))) 

    "hits": [
            {
        "_index": "flint_myglue_default_http_logs_metrics",
        "_id": "3aad6637a22adc431b032c826df367fea1ed424c",
        "_score": 1,
        "_source": {
          "startTime": "1998-06-11T12:00:00.000000+0000",
          "count": 2
        }
      },

Case 3: ID expression is not mandatory if manual refresh or auto refresh without checkpoint location:

scala> spark.sql("CREATE INDEX clientip_and_status ON ds_tables.http_logs (clientip, status)")
scala> spark.sql("REFRESH INDEX clientip_and_status ON ds_tables.http_logs")

23/11/10 19:55:35 INFO FlintSparkIndex: Generated ID column based on expression None

    "hits": [
      {
        "_index": "flint_myglue_ds_tables_http_logs_clientip_and_status_index",
        "_id": "S4XOuosBZG4KSy0OKmxu",
        "_score": 1,
        "_source": {
          "clientip": "212.167.12.0",
          "status": 304
        }

Case 4: Otherwise, throwThrow exception if doesn't match any case above:

spark.sql("""
CREATE INDEX clientip_and_status ON ds_tables.http_logs
(clientip, status)
WITH (
  auto_refresh = true,
  checkpoint_location = "s3://checkpoints"
)
""")

23/11/10 20:11:44 ERROR MicroBatchExecution:
Query flint_myglue_ds_tables_http_logs_clientip_and_status_index 
  [id = 3d20e1d2-dbc6-4c93-9811-cd94ae98d184, runId = 8872253a-bdb9-47a1-88e3-f2a8169192f5
   terminated with error
 java.lang.IllegalStateException: 
  ID expression is required to avoid duplicate data when index refresh job restart

Issues Resolved

#88

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <[email protected]>

…empotency

Signed-off-by: Chen Dai <[email protected]>

dai-chen · 2023-12-01T19:49:09Z

Will reopen if still want to go with this implementation.

dai-chen added 11 commits October 31, 2023 17:17

Add optional ID column

84e5eec

Signed-off-by: Chen Dai <[email protected]>

Add id expression index option

9feae2a

Signed-off-by: Chen Dai <[email protected]>

Add UT for covering index ID column

7fbe487

Signed-off-by: Chen Dai <[email protected]>

Add more UT

24f578b

Signed-off-by: Chen Dai <[email protected]>

Refactor build logic

4bc67a6

Signed-off-by: Chen Dai <[email protected]>

Add more IT

1c2f3b2

Signed-off-by: Chen Dai <[email protected]>

Merge branch 'opensearch-project:main' into support-covering-index-id…

693214e

…empotency

Extract common generate ID column method

0164173

Signed-off-by: Chen Dai <[email protected]>

Add more logging

e2d3f80

Signed-off-by: Chen Dai <[email protected]>

Merge branch 'main' into support-covering-index-idempotency

c6a7f17

Add id column for MV and update IT and doc

6e511c1

Signed-off-by: Chen Dai <[email protected]>

dai-chen added the enhancement New feature or request label Nov 9, 2023

dai-chen self-assigned this Nov 9, 2023

dai-chen mentioned this pull request Nov 9, 2023

Exactly-once guarantee for covering index incremental refresh #122

Closed

dai-chen added 8 commits November 8, 2023 18:43

Fix recover index IT

c73906a

Signed-off-by: Chen Dai <[email protected]>

Change to auto refresh and checkpoint location provided

507e72a

Signed-off-by: Chen Dai <[email protected]>

Revert timestamp column to time

031eb9e

Signed-off-by: Chen Dai <[email protected]>

Fix broken IT

4f2c8a8

Signed-off-by: Chen Dai <[email protected]>

Add UT for MV id column

a899407

Signed-off-by: Chen Dai <[email protected]>

Add more UT for MV id column

65ba58c

Signed-off-by: Chen Dai <[email protected]>

Merge branch 'main' into support-covering-index-and-mv-idempotency

a5066ed

Change concat to concat with ASCII null as separator for safe side

2903d47

Signed-off-by: Chen Dai <[email protected]>

dai-chen marked this pull request as ready for review November 10, 2023 22:58

dai-chen requested review from rupal-bq, vmmusings, penghuo, anirudha, kaituo and YANG-DB as code owners November 10, 2023 22:58

dai-chen mentioned this pull request Nov 10, 2023

[Feature] OpenSearch and Apache Spark Integration #3

Closed

Swiddis mentioned this pull request Nov 22, 2023

[FEATURE] Add id_expression to S3 integration MVs opensearch-project/dashboards-observability#1266

Closed

dai-chen closed this Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exactly-once guarantee for covering index and MV incremental refresh #143

Exactly-once guarantee for covering index and MV incremental refresh #143

dai-chen commented Nov 9, 2023 •

edited

Loading

dai-chen commented Dec 1, 2023

Exactly-once guarantee for covering index and MV incremental refresh #143

Exactly-once guarantee for covering index and MV incremental refresh #143

Conversation

dai-chen commented Nov 9, 2023 • edited Loading

Description

Documentation

Algorithm

Testing

Case 1:Use ID expression provided in the index options:

Case 2:Generate ID column based on all output columns if MV with aggregation:

Case 3: ID expression is not mandatory if manual refresh or auto refresh without checkpoint location:

Case 4: Otherwise, throwThrow exception if doesn't match any case above:

Issues Resolved

dai-chen commented Dec 1, 2023

dai-chen commented Nov 9, 2023 •

edited

Loading