Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skipping index recommendations for specific columns #300

Closed
wants to merge 19 commits into from

Conversation

rupal-bq
Copy link
Contributor

@rupal-bq rupal-bq commented Apr 1, 2024

Description

Issues Resolved

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rupal-bq rupal-bq marked this pull request as ready for review April 2, 2024 19:41
Signed-off-by: Rupal Mahajan <[email protected]>
@rupal-bq rupal-bq requested a review from seankao-az as a code owner April 2, 2024 21:57
* @return
* skipping index recommendation dataframe
*/
def analyzeSkippingIndex(tableName: String): Seq[Row] = {
new DataTypeSkippingStrategy().analyzeSkippingIndexColumns(tableName, spark)
def analyzeSkippingIndex(inputs: Map[String, List[String]]): Seq[Row] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use some abstraction here instead of generic Map? DataFrame or any existing Table abstraction we can use here?

Because you may ignore query/function as input for now. The reason is in Limitation no.1 in #298 (comment). I'm thinking can we add generic query analyze API for all Flint index, ex. ANALYZE FLINT INDEX FOR query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raised revision with DataFrame. bdw should grammar be ANALYZE FLINT INDEX or ANALYZE SKIPPING INDEX? Can same static rules apply to any type of index?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some thoughts. I was thinking of something like ANALYZE FLINT INDEX FOR query. For example:

  1. ANALYZE FLINT INDEX FOR SELECT * FROM test: recommend covering index
  2. ANALYZE FLINT INDEX FOR SELECT ... WHERE clientip = ...: recommend skipping and covering
  3. ANALYZE FLINT INDEX FOR SELECT ... GROUP BY ...: recommend MV

* @return
* skipping index recommendation dataframe
*/
def analyzeSkippingIndex(tableName: String): Seq[Row] = {
new DataTypeSkippingStrategy().analyzeSkippingIndexColumns(tableName, spark)
def analyzeSkippingIndex(schema: StructType, data: Seq[Row]): Seq[Row] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkSession is available in this class. I think the input of this API can be simply tableName and columnNames? This is convenient for users who rely on Flint API instead of SQL layer.

Comment on lines 137 to 143
if (ctx.indexColumns != null) {
ctx.indexColumns.multipartIdentifierProperty().forEach { indexColCtx =>
data = data :+ Row(ctx.tableName().getText, indexColCtx.multipartIdentifier().getText)
}
} else {
data = data :+ Row(ctx.tableName().getText, null.asInstanceOf[String])
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Scala stream map() instead of forEach()?

columns = table.schema().fields.map(field => field.name).toList
}
columns.foreach(column => {
val field = findField(table.schema(), column).get
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you refactor this method and make it more readable? I think only line 50 - 62 is the core logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored this method. Can you please take another look?

/**
* Recommendation rules for skipping index column and algorithm selection.
*/
object RecommendationRules {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some thought: making this Rule abstraction may be more useful than static util methods?
Probably we can think about how to extend this for recommendation on WHERE clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that can be useful if have separate implementation for data type and function based rules, but I was thinking to have all static rules at one place. e.g https://github.com/rupal-bq/opensearch_spark/blob/query-recommendations/flint-spark-integration/src/main/resources/skipping_index_recommendation.conf#L50

Do you see any problem with this approach for recommendation on WHERE clause?

@dai-chen dai-chen added enhancement New feature or request 0.4 labels Apr 17, 2024
Signed-off-by: Rupal Mahajan <[email protected]>
@dai-chen
Copy link
Collaborator

dai-chen commented Aug 5, 2024

Closing this PR due to prolonged inactivity. Please rebase if you wish to reopen it.

@dai-chen dai-chen closed this Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.4 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants