Add skipping index recommendations for specific columns #300

rupal-bq · 2024-04-01T22:25:38Z

Description

Address comments from Implement analyze skipping index statement #284 and Move analyze skipping index rules to config #288
Add support for ANALYZE SKIPPING INDEX ON TABLE datasource.database.table(column1, column2, ...)

Issues Resolved

[FEATURE] Automate skipping index column and algorithm selection #221

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rupal Mahajan <[email protected]>

dai-chen · 2024-04-04T23:45:27Z

flint-spark-integration/src/main/scala/org/opensearch/flint/spark/FlintSpark.scala

   * @return
   *   skipping index recommendation dataframe
   */
-  def analyzeSkippingIndex(tableName: String): Seq[Row] = {
-    new DataTypeSkippingStrategy().analyzeSkippingIndexColumns(tableName, spark)
+  def analyzeSkippingIndex(inputs: Map[String, List[String]]): Seq[Row] = {


Could you use some abstraction here instead of generic Map? DataFrame or any existing Table abstraction we can use here?

Because you may ignore query/function as input for now. The reason is in Limitation no.1 in #298 (comment). I'm thinking can we add generic query analyze API for all Flint index, ex. ANALYZE FLINT INDEX FOR query.

raised revision with DataFrame. bdw should grammar be ANALYZE FLINT INDEX or ANALYZE SKIPPING INDEX? Can same static rules apply to any type of index?

Just some thoughts. I was thinking of something like ANALYZE FLINT INDEX FOR query. For example:

ANALYZE FLINT INDEX FOR SELECT * FROM test: recommend covering index

ANALYZE FLINT INDEX FOR SELECT ... WHERE clientip = ...: recommend skipping and covering

ANALYZE FLINT INDEX FOR SELECT ... GROUP BY ...: recommend MV

...src/main/scala/org/opensearch/flint/spark/skipping/recommendations/RecommendationRules.scala

Signed-off-by: Rupal Mahajan <[email protected]>

dai-chen · 2024-04-17T16:15:14Z

flint-spark-integration/src/main/scala/org/opensearch/flint/spark/FlintSpark.scala

   * @return
   *   skipping index recommendation dataframe
   */
-  def analyzeSkippingIndex(tableName: String): Seq[Row] = {
-    new DataTypeSkippingStrategy().analyzeSkippingIndexColumns(tableName, spark)
+  def analyzeSkippingIndex(schema: StructType, data: Seq[Row]): Seq[Row] = {


SparkSession is available in this class. I think the input of this API can be simply tableName and columnNames? This is convenient for users who rely on Flint API instead of SQL layer.

dai-chen · 2024-04-17T16:18:45Z

...c/main/scala/org/opensearch/flint/spark/sql/skipping/FlintSparkSkippingIndexAstBuilder.scala

+    if (ctx.indexColumns != null) {
+      ctx.indexColumns.multipartIdentifierProperty().forEach { indexColCtx =>
+        data = data :+ Row(ctx.tableName().getText, indexColCtx.multipartIdentifier().getText)
+      }
+    } else {
+      data = data :+ Row(ctx.tableName().getText, null.asInstanceOf[String])
+    }


use Scala stream map() instead of forEach()?

dai-chen · 2024-04-17T16:28:33Z

...ain/scala/org/opensearch/flint/spark/skipping/recommendations/DataTypeSkippingStrategy.scala

+          columns = table.schema().fields.map(field => field.name).toList
+        }
+        columns.foreach(column => {
+          val field = findField(table.schema(), column).get


Could you refactor this method and make it more readable? I think only line 50 - 62 is the core logic.

refactored this method. Can you please take another look?

dai-chen · 2024-04-17T16:32:37Z

...src/main/scala/org/opensearch/flint/spark/skipping/recommendations/RecommendationRules.scala

+/**
+ * Recommendation rules for skipping index column and algorithm selection.
+ */
+object RecommendationRules {


Just some thought: making this Rule abstraction may be more useful than static util methods?
Probably we can think about how to extend this for recommendation on WHERE clause?

Sure that can be useful if have separate implementation for data type and function based rules, but I was thinking to have all static rules at one place. e.g https://github.com/rupal-bq/opensearch_spark/blob/query-recommendations/flint-spark-integration/src/main/resources/skipping_index_recommendation.conf#L50

Do you see any problem with this approach for recommendation on WHERE clause?

Signed-off-by: Rupal Mahajan <[email protected]>

dai-chen · 2024-08-05T21:21:34Z

Closing this PR due to prolonged inactivity. Please rebase if you wish to reopen it.

rupal-bq added 11 commits March 19, 2024 10:54

Move data type rules to config file

4ed299b

Signed-off-by: Rupal Mahajan <[email protected]>

Merge branch 'opensearch-project:main' into analyze-follow-up

a31f2c8

Add columns draft

b91db08

Signed-off-by: Rupal Mahajan <[email protected]>

Add recommendation for specific columns

c586b1b

Signed-off-by: Rupal Mahajan <[email protected]>

Add recommendation rules class

621b9c4

Signed-off-by: Rupal Mahajan <[email protected]>

refactor data type based recommendations class

b2e88ab

Signed-off-by: Rupal Mahajan <[email protected]>

Remove funtion based recommendation code

eec3a97

Signed-off-by: Rupal Mahajan <[email protected]>

nit

dce2257

Signed-off-by: Rupal Mahajan <[email protected]>

Merge branch 'opensearch-project:main' into analyze-follow-up

38090be

update function args

921ff42

Signed-off-by: Rupal Mahajan <[email protected]>

fix grammar

bbc8ca4

Signed-off-by: Rupal Mahajan <[email protected]>

rupal-bq marked this pull request as ready for review April 2, 2024 19:41

rupal-bq requested review from dai-chen, vamsimanohar, penghuo, anirudha, kaituo and YANG-DB as code owners April 2, 2024 19:41

Add nested column test

e09ab39

Signed-off-by: Rupal Mahajan <[email protected]>

rupal-bq requested a review from seankao-az as a code owner April 2, 2024 21:57

dai-chen reviewed Apr 4, 2024

View reviewed changes

rupal-bq added 3 commits April 5, 2024 15:40

make recommendations scala object

f0da5f0

Signed-off-by: Rupal Mahajan <[email protected]>

use DataFrame for input

1d7b12b

Signed-off-by: Rupal Mahajan <[email protected]>

Merge branch 'opensearch-project:main' into analyze-follow-up

d8b75f3

rupal-bq mentioned this pull request Apr 15, 2024

Add skipping index recommendations for sql query #307

Closed

dai-chen reviewed Apr 17, 2024

View reviewed changes

dai-chen added enhancement New feature or request 0.4 labels Apr 17, 2024

rupal-bq added 2 commits April 17, 2024 22:18

update input for analyzeSkippingIndex

b27292a

Signed-off-by: Rupal Mahajan <[email protected]>

Fix null pointer

d2b84e2

Signed-off-by: Rupal Mahajan <[email protected]>

rupal-bq added 2 commits April 18, 2024 22:28

refactor analyzeSkippingIndexColumns in DataTypeSkippingStrategy

afe188a

Signed-off-by: Rupal Mahajan <[email protected]>

nit

f87df18

Signed-off-by: Rupal Mahajan <[email protected]>

dai-chen closed this Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add skipping index recommendations for specific columns #300

Add skipping index recommendations for specific columns #300

rupal-bq commented Apr 1, 2024 •

edited

Loading

dai-chen Apr 4, 2024

rupal-bq Apr 10, 2024

dai-chen Apr 19, 2024

dai-chen Apr 17, 2024

dai-chen Apr 17, 2024

dai-chen Apr 17, 2024

rupal-bq Apr 23, 2024

dai-chen Apr 17, 2024

rupal-bq Apr 19, 2024

dai-chen commented Aug 5, 2024

Add skipping index recommendations for specific columns #300

Add skipping index recommendations for specific columns #300

Conversation

rupal-bq commented Apr 1, 2024 • edited Loading

Description

Issues Resolved

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dai-chen commented Aug 5, 2024

rupal-bq commented Apr 1, 2024 •

edited

Loading