Add covering index based query rewriter rule #318

dai-chen · 2024-04-23T21:33:04Z

Description

This PR introduces the feature to rewrite applicable queries using covering indexes, enhancing the performance of SELECT queries by allowing faster data retrieval using these indexes.

PR Planned

Add covering index based query rewriter rule #318
Support query rewrite for Iceberg table (IT disabled temporarily)
Add logging or whyNot API to explain why/whyNot index applied
Support SQL hint for Spark conf, ex. CV rewrite, hybrid scan [TBD]
Support partial covering index rewrite [TBD]

Changes

Implemented query rewrite logic that detects when a covering index can be utilized for a query.
Added new SQL test cases to ensure proper functionality and to validate that queries are only rewritten when appropriate indexes are available and enabled.
Provided a new option to enable or disable query rewrites using covering indexes.

Testing

Added several unit and integration tests to cover:

UT: The basic functionality of query rewrite when a suitable covering index exists.
IT: Ensuring no rewrite occurs when covering indexes are disabled via Spark configuration.
IT: Correct interaction of query rewrites in presence of both covering and skipping indexes.

Examples:

# Create covering index as before
spark-sql> CREATE INDEX all ON ds_tables.http_logs
         > (`@timestamp`, clientip, request, status, size);
spark-sql> REFRESH INDEX all ON ds_tables.http_logs;
Time taken: 9593.567 seconds

# Disable covering index acceleration
spark-sql> set spark.flint.optimizer.covering.enabled=false;

spark-sql> EXPLAIN SELECT clientip, request AS cnt FROM ds_tables.http_logs WHERE status = 400;
== Physical Plan ==
*(1) Project [clientip#17, request#18 AS cnt#10]
+- *(1) Filter (isnotnull(status#19) AND (status#19 = 400))
   +- FileScan json ds_tables.http_logs[clientip#17,request#18,status#19,year#21,month#22,day#23]
Batched: false, DataFilters: [isnotnull(status#19), (status#19 = 400)], Format: JSON,
Location: FlintSparkSkippingFileIndex(1 paths)[s3://.../httplogs/http_logs_partitioned_json_bz2],
PartitionFilters: [], PushedFilters: [IsNotNull(status), EqualTo(status,400)],
ReadSchema: struct

# Enable covering index acceleration
spark-sql> set spark.flint.optimizer.covering.enabled=true;

spark-sql> EXPLAIN SELECT clientip, request AS cnt FROM ds_tables.http_logs WHERE status = 400;
== Physical Plan ==
*(1) Project [clientip#17, request#18 AS cnt#39]
+- BatchScan[@timestamp#16, request#18, size#20, clientip#17, status#19]
class org.apache.spark.sql.flint.FlintScan,
PushedPredicates: [status IS NOT NULL, status = 400]
 RuntimeFilters: []

# Join query
spark-sql> EXPLAIN
         > SELECT l.clientip, l.status, s.description
         > FROM ds_tables.http_logs l
         > JOIN ds_tables.http_status s ON l.status = s.status
         > WHERE clientip = '157.12.11.0';
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [clientip#17, status#19, description#51]
   +- BroadcastHashJoin [status#19], [status#50], Inner, BuildRight, false
      :- Project [clientip#17, status#19]
      :  +- BatchScan[@timestamp#16, request#18, size#20, clientip#17, status#19]
class org.apache.spark.sql.flint.FlintScan,
PushedPredicates: [clientip IS NOT NULL, clientip = '157.12.11.0', status IS NOT NULL]
RuntimeFilters: []
      +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=71]
         +- Filter isnotnull(status#50)
            +- FileScan parquet ds_tables.http_status[status#50,description#51] Batched: true,
DataFilters: [isnotnull(status#50)], Format: Parquet,
Location: InMemoryFileIndex(1 paths)[s3://.../httpstatus], PartitionFilters: [],
PushedFilters: [IsNotNull(status)], ReadSchema: struct

Issues Resolved

#298

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <[email protected]>

...ration/src/main/scala/org/opensearch/flint/spark/covering/ApplyFlintSparkCoveringIndex.scala

Signed-off-by: Chen Dai <[email protected]>

dai-chen added 9 commits April 23, 2024 14:29

Add query rewriting rule for covering index

0c60edf

Signed-off-by: Chen Dai <[email protected]>

Add UT

2c70155

Signed-off-by: Chen Dai <[email protected]>

Add more UT

94a4d92

Signed-off-by: Chen Dai <[email protected]>

Fix collecting required column logic

0fa54c3

Signed-off-by: Chen Dai <[email protected]>

Fix output attributes in index relation

481a567

Signed-off-by: Chen Dai <[email protected]>

Refactor UT with real Spark table

b5fd8e8

Signed-off-by: Chen Dai <[email protected]>

Refactor rewrite rule code

9e03956

Signed-off-by: Chen Dai <[email protected]>

Add more IT

1e5b0f8

Signed-off-by: Chen Dai <[email protected]>

Exclude logically deleted covering index

3ec9e42

Signed-off-by: Chen Dai <[email protected]>

dai-chen added feature New feature 0.4 labels Apr 23, 2024

dai-chen self-assigned this Apr 23, 2024

dai-chen added 4 commits April 23, 2024 15:01

Update user manual and code comments

f9eedf7

Signed-off-by: Chen Dai <[email protected]>

Only collect relation column in query plan

564b9f2

Signed-off-by: Chen Dai <[email protected]>

Fix data type mismatch in Flint scan

b65c0ab

Signed-off-by: Chen Dai <[email protected]>

Move IT from Flint API suite to SQL suite

967217f

Signed-off-by: Chen Dai <[email protected]>

dai-chen changed the title ~~Add covering index query rewriter rule~~ Add covering index based query rewriter rule Apr 26, 2024

dai-chen marked this pull request as ready for review April 26, 2024 16:39

dai-chen requested review from rupal-bq, vmmusings, penghuo, seankao-az, anirudha, kaituo and YANG-DB as code owners April 26, 2024 16:39

Disable Iceberg CV IT temporarily

d17ba04

Signed-off-by: Chen Dai <[email protected]>

seankao-az reviewed Apr 26, 2024

View reviewed changes

Addressed comments

f35f180

Signed-off-by: Chen Dai <[email protected]>

dai-chen requested a review from seankao-az April 29, 2024 19:43

seankao-az approved these changes Apr 30, 2024

View reviewed changes

Merge branch 'main' into add-covering-index-query-rewriter

9867176

dai-chen merged commit a8a376f into opensearch-project:main Apr 30, 2024
4 checks passed

dai-chen deleted the add-covering-index-query-rewriter branch April 30, 2024 18:05

dai-chen mentioned this pull request Apr 30, 2024

Enhance Iceberg query performance with covering index support #325

Closed

5 tasks

dai-chen mentioned this pull request Jun 20, 2024

Abstracting source relations for enhanced covering index rewriting #391

Merged

6 tasks

dai-chen mentioned this pull request Jul 3, 2024

Enhance query rewriter rule to support partial covering index #409

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add covering index based query rewriter rule #318

Add covering index based query rewriter rule #318

dai-chen commented Apr 23, 2024 •

edited

Loading

Add covering index based query rewriter rule #318

Add covering index based query rewriter rule #318

Conversation

dai-chen commented Apr 23, 2024 • edited Loading

Description

PR Planned

Changes

Testing

Issues Resolved

dai-chen commented Apr 23, 2024 •

edited

Loading