Skip to content

Commit

Permalink
[SPARK-50061][SQL] Enable analyze table for collated columns
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
In this PR the `analyze table` command is enabled for collated strings. Current implementation collects stats based on the collation-aware `Aggregate` expression, so this PR only enables the aggregation.

### Why are the changes needed?
To enable `analyze table` command for collated strings.

### Does this PR introduce _any_ user-facing change?
Yes, currently doing:
```sql
ANALYZE TABLE test_table COMPUTE STATISTICS FOR COLUMNS c
```
where c is collated string, fails because of unsupported datatype. This PR addresses this issue and enables the command.

### How was this patch tested?
New test in this PR.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #48586 from stevomitric/stevomitric/analyze-fix.

Authored-by: Stevo Mitric <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
  • Loading branch information
stevomitric authored and MaxGekk committed Oct 25, 2024
1 parent c976c80 commit 413a65b
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ case class AnalyzeColumnCommand(
case DoubleType | FloatType => true
case BooleanType => true
case _: DatetimeType => true
case BinaryType | StringType => true
case BinaryType | _: StringType => true
case _ => false
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -411,7 +411,7 @@ object CommandUtils extends Logging {
case DoubleType | FloatType => fixedLenTypeStruct
case BooleanType => fixedLenTypeStruct
case _: DatetimeType => fixedLenTypeStruct
case BinaryType | StringType =>
case BinaryType | _: StringType =>
// For string and binary type, we don't compute min, max or histogram
val nullLit = Literal(null, col.dataType)
struct(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,21 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
}
}

test("analyze stats for collated strings") {
val tableName = "collated_strings"
Seq[String]("sr_CI").foreach { collation =>
withTable(tableName) {
sql(s"CREATE TABLE $tableName (c STRING COLLATE $collation) USING PARQUET")
sql(s"INSERT INTO $tableName VALUES ('a'), ('A')")
sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS c")

val table = getCatalogTable(tableName)
assert(table.stats.get.colStats("c") ==
CatalogColumnStat(Some(1), None, None, Some(0), Some(1), Some(1)))
}
}
}

test("analyzes table statistics in cached catalog view") {
def getTableStats(tableName: String): Statistics = {
spark.table(tableName).queryExecution.optimizedPlan.stats
Expand Down

0 comments on commit 413a65b

Please sign in to comment.