[FEATURE] Tumble function doesn't support expression #626

dai-chen · 2024-09-06T21:52:32Z

Is your feature request related to a problem?

A ClassCastException when using the TUMBLE function with expressions in a CREATE MATERIALIZED VIEW statement.

For example:

CREATE MATERIALIZED VIEW test_day AS
SELECT
  COUNT(1),
  window.start
FROM
  test
GROUP BY
  TUMBLE(CAST(FROM_UNIXTIME(time) AS TIMESTAMP), '1 Hour')
ORDER BY
  window.start;
...

java.lang.ClassCastException: class org.apache.spark.sql.catalyst.expressions.Cast cannot be cast to class org.apache.spark.sql.catalyst.expressions.Attribute (org.apache.spark.sql.catalyst.expressions.Cast and org.apache.spark.sql.catalyst.expressions.Attribute are in unnamed module of loader 'app')
    at org.opensearch.flint.spark.mv.FlintSparkMaterializedView$WindowingAggregate$.unapply(FlintSparkMaterializedView.scala:132)
    at org.opensearch.flint.spark.mv.FlintSparkMaterializedView$$anonfun$1.applyOrElse(FlintSparkMaterializedView.scala:87)
    at org.opensearch.flint.spark.mv.FlintSparkMaterializedView$$anonfun$1.applyOrElse(FlintSparkMaterializedView.scala:86)
...

What solution would you like?

Support expression in TUMBLE function. This is especially useful when time column in the source dataset is not timestamp type.

What alternatives have you considered?

Alternatively, using subquery can be a workaround:

CREATE MATERIALIZED VIEW test_day AS
SELECT
  COUNT(1),
  window.start
FROM (
    SELECT CAST(FROM_UNIXTIME(start) AS TIMESTAMP) AS startTime
    FROM test
)
GROUP BY
  TUMBLE(startTime, '1 Hour')
ORDER BY
  window.start
...

Do you have any additional context?

The first thing is to confirm if Spark can support event time defined by an expression.

The text was updated successfully, but these errors were encountered:

dai-chen · 2024-10-30T20:47:21Z

Actually EventTimeWatermark operator only accepts column. In this case the workaround above seems the right way to do this. I verified the correctness by inspecting the query plan:

Aggregate [window#132-T1000ms], [window#132-T1000ms.start AS startTime#107, count(1) AS count#108L]
+- Project [named_struct(...) AS window#132-T1000ms]
   +- Filter isnotnull(timestamp2#106-T1000ms)
      +- EventTimeWatermark timestamp2#106: timestamp, 1 seconds
         +- Project [cast(timestamp#130 as timestamp) AS timestamp2#106]
            +- StreamingRelation DataSource(org.apache.spark.sql.test.TestSparkSession@4bf9f44b,CSV,List(),
Some(StructType(StructField(id,IntegerType,true),StructField(status_code,IntegerType,true),
StructField(request_path,StringType,true),StructField(timestamp,StringType,true))),List(),None,
Map(header -> false, delimiter -> 	, path -> file:/...),Some(CatalogTable(...

dai-chen added enhancement New feature or request untriaged and removed untriaged labels Sep 6, 2024

dai-chen added the Core:MV label Oct 16, 2024

dai-chen added bug Something isn't working and removed enhancement New feature or request labels Oct 29, 2024

dai-chen mentioned this issue Oct 31, 2024

Add validation for time column in tumble function #858

Merged

5 tasks

dai-chen closed this as completed Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Tumble function doesn't support expression #626

[FEATURE] Tumble function doesn't support expression #626

dai-chen commented Sep 6, 2024

dai-chen commented Oct 30, 2024 •

edited

Loading

[FEATURE] Tumble function doesn't support expression #626

[FEATURE] Tumble function doesn't support expression #626

Comments

dai-chen commented Sep 6, 2024

dai-chen commented Oct 30, 2024 • edited Loading

dai-chen commented Oct 30, 2024 •

edited

Loading