-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Incorrect flint index name during query rewrite #319
Comments
Root cause of the issue is https://github.com/opensearch-project/opensearch-spark/blob/main/spark-sql-applica[…]main/java/org/opensearch/sql/FlintDelegatingSessionCatalog.java. We should be returning the actual name of the datasource rather than spark_catalog. This will be a breaking change. |
We did this by intention. The is because Spark compare catalog name with static string (spark_catalog) and take action differently. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L386C1-L388C4 |
You shouldn't need to do that . Because we are delegating to default spark catalog, the session will still work correctly if you set catalog name properly. Take a look at how Iceberg does it: https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java Let's assume we fix |
Test create csv table with Iceberg catalog
|
Spent a few hours researching this as well today and had the same findings. Per your question on Iceberg, only the default So I think it should work if you remove this logic:
The easiest path forward will be to leave the existing behavior of using |
How to resolve catalog name in Flint1. Current Statusflint index name is composed by flint_{catalog_name}{database_name}{table_name}_{index_type}. flint index builder and flint optimizer resolve catalog_name from table.qualifedName.
Why the solution does not work since spark 3.4 2. Proposed solutionThe proposed solution is to change catalog resolve logic of step 3 and step 4 as below. The idea is reuse Spark current logic to reolsve CatalogPlugin, but customized catalogName resolve logic. spark.sql.defaultCatalog is used if Catalog.name is spark_catalog. it is a work around to solve #319 (comment).
Preconditions
Limitations
|
This approach looks good to me. Even today I think only one glue catalog can be supported at a time. There may be some weird behavior by changing defaultCatalog to point to |
What is the bug?
After upgrading Spark to 3.4.1, skipping index won't be applied to rewrite applicable queries, because during query rewrite, an incorrect flint index name is constructed for the queried table.
How can one reproduce the bug?
Steps to reproduce the behavior:
INFO FlintSpark: Describing index name flint_spark_catalog_default_{table}_skipping_index
What is the expected behavior?
INFO FlintSpark: Describing index name flint_{datasource}_default_{table}_skipping_index
What is your host/environment?
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Using emr-6.13.0 release,
EXPLAIN EXTENDED
for query shows+- Relation spark_catalog.default.{table}
in Analyzed Logical Plan, while for emr-6.11.0 release, it reads+- Relation default.{table}
Change in TableIdentifiers interface:
The text was updated successfully, but these errors were encountered: