-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement analyze skipping index statement #284
Changes from all commits
f66994a
5ce393d
fb8d12d
bba6823
cc46bbd
b04b0b8
472a962
37d3df3
2fdf421
4a8fc1e
54d7ee6
4bac586
e289a9c
a720116
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
/* | ||
* Copyright OpenSearch Contributors | ||
* SPDX-License-Identifier: Apache-2.0 | ||
*/ | ||
package org.opensearch.flint.spark.skipping.recommendations | ||
|
||
import org.apache.spark.sql.{Row, SparkSession} | ||
|
||
/** | ||
* Automate skipping index column and algorithm selection. | ||
*/ | ||
trait AnalyzeSkippingStrategy { | ||
|
||
/** | ||
* Recommend skipping index columns and algorithm. | ||
* | ||
* @param tableName | ||
* table name | ||
* @return | ||
* skipping index recommendation dataframe | ||
*/ | ||
def analyzeSkippingIndexColumns(tableName: String, spark: SparkSession): Seq[Row] | ||
|
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
/* | ||
* Copyright OpenSearch Contributors | ||
* SPDX-License-Identifier: Apache-2.0 | ||
*/ | ||
|
||
package org.opensearch.flint.spark.skipping.recommendations | ||
|
||
import scala.collection.mutable.ArrayBuffer | ||
|
||
import org.opensearch.flint.spark.skipping.FlintSparkSkippingStrategy.SkippingKind.{BLOOM_FILTER, MIN_MAX, PARTITION, VALUE_SET} | ||
|
||
import org.apache.spark.sql.{Row, SparkSession} | ||
import org.apache.spark.sql.flint.{loadTable, parseTableName} | ||
|
||
class DataTypeSkippingStrategy extends AnalyzeSkippingStrategy { | ||
|
||
val rules = Map( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm thinking if more flexible to move this static mapping to config file? Or maybe not necessary for this P0 solution? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good idea. added this here thinking it's specific to data type based recommendation and won't be used by other strategies(e.g. recommendation based on table stats). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will take this up as fast follow up because it will unblock sql plugin if we can finalize grammar before 2.13 release. |
||
"PARTITION" -> (PARTITION.toString, "PARTITION data structure is recommended for partition columns"), | ||
"BooleanType" -> (VALUE_SET.toString, "VALUE_SET data structure is recommended for BooleanType columns"), | ||
"IntegerType" -> (MIN_MAX.toString, "MIN_MAX data structure is recommended for IntegerType columns"), | ||
"LongType" -> (MIN_MAX.toString, "MIN_MAX data structure is recommended for LongType columns"), | ||
"ShortType" -> (MIN_MAX.toString, "MIN_MAX data structure is recommended for ShortType columns"), | ||
"DateType" -> (BLOOM_FILTER.toString, "BLOOM_FILTER data structure is recommended for DateType columns"), | ||
"TimestampType" -> (BLOOM_FILTER.toString, "BLOOM_FILTER data structure is recommended for TimestampType columns"), | ||
"StringType" -> (BLOOM_FILTER.toString, "BLOOM_FILTER data structure is recommended for StringType columns"), | ||
"VarcharType" -> (BLOOM_FILTER.toString, "BLOOM_FILTER data structure is recommended for VarcharType columns"), | ||
"CharType" -> (BLOOM_FILTER.toString, "BLOOM_FILTER data structure is recommended for CharType columns"), | ||
"StructType" -> (BLOOM_FILTER.toString, "BLOOM_FILTER data structure is recommended for StructType columns")) | ||
|
||
override def analyzeSkippingIndexColumns(tableName: String, spark: SparkSession): Seq[Row] = { | ||
val (catalog, ident) = parseTableName(spark, tableName) | ||
val table = loadTable(catalog, ident).getOrElse( | ||
throw new IllegalStateException(s"Table $tableName is not found")) | ||
|
||
val partitionFields = table.partitioning().flatMap { transform => | ||
transform | ||
.references() | ||
.collect({ case reference => | ||
reference.fieldNames() | ||
}) | ||
.flatten | ||
.toSet | ||
} | ||
Comment on lines
+35
to
+43
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this is the right API because I've only used There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure will do. Thanks! |
||
|
||
val result = ArrayBuffer[Row]() | ||
table.schema().fields.map { field => | ||
if (partitionFields.contains(field.name)) { | ||
result += Row( | ||
field.name, | ||
field.dataType.typeName, | ||
rules("PARTITION")._1, | ||
rules("PARTITION")._2) | ||
} else if (rules.contains(field.dataType.toString)) { | ||
result += Row( | ||
field.name, | ||
field.dataType.typeName, | ||
rules(field.dataType.toString)._1, | ||
rules(field.dataType.toString)._2) | ||
} | ||
} | ||
result | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this grammar finalized? What is the semantic meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is proposed grammar. Please comment if you have any other suggestions. Analyze refers to examining data to get insights. This command will return recommendation for creating skipping index (skipping index columns with suggested data structure) based on table data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reference / compatibility analysis with the mainstream syntax?
Just brainstorming -
Or
The assumption is we may want to do more things other from the recommendation.
ref https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/ANALYZE.html#GUID-535CE98E-2359-4147-839F-DCB3772C1B0E