[DOP-18631] - add partial support for ArrayType #8

maxim-lixakov · 2024-08-01T12:28:48Z

Change Summary

Add partial support:

Array(T) -> ArrayType(T):

ClickHouse Type (Read)	Spark Type	ClickHouse Type (Write)	ClickHouse Type (Create)
`Array(String)`	`ArrayType(StringType)`	`Array(String)`	`Array(String)`
unsupported	`ArrayType(ByteType)`	`Array(Int8)`	`Array(Int8)`
unsupported	`ArrayType(ShortType)`	unsupported	unsupported
unsupported	`ArrayType(LongType)`	`Array(Int64)`	`Array(Int64)`
`Array(Decimal(M, N))`	`ArrayType(DecimalType(M, N))`	`Array(Decimal(M, N))`	`Array(Decimal(M, N))`
unsupported	`ArrayType(TimestampType)`	unsupported	unsupported
unsupported	`ArrayType(Date)`	`Array(Date)`	`Array(Date)`
unsupported	`ArrayType(FloatType)`	`Array(Float32)`	`Array(Float32)`
unsupported	`ArrayType(DoubleType)`	unsupported	unsupported

the problem with short type and double type is next:

in native spark typeName for nested type in Array is get via getJDBCType:

in default getJDBCType:
case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))

it converts ShortType to Java JdbcType("INTEGER") type and ArrayType(ShortType) -> Array(Integers)
but as we customize behaviour for ShortType to Int16 in clickhouse it tries to convert ArrayType(ShortType) -> Array(Int16), however in Java there is no Int16 type, therefore it fails with error:

Java.lang.IllegalArgumentException: Unknown data type: int16, so for current spark implementation there is a choice:

custom casting of short to int16 without support to ArrayType(ShortType) for writing
remove custom of short to int16 and leave integer but with support of ArrayType(ShortType) for writing

Related issue number

Checklist

Commit message and PR title is comprehensive
Keep the change as small as possible
Unit and integration tests for the changes exist
Tests pass on CI and coverage does not decrease
Documentation reflects the changes where applicable
My PR is ready to review.

Auto-commit applied. A retry was triggered due to a failure in formatting/linting checks.

…y(...)

codecov · 2024-08-01T12:42:59Z

Codecov Report

Attention: Patch coverage is 65.27778% with 25 lines in your changes missing coverage. Please review.

Project coverage is 71.26%. Comparing base (2320a27) to head (35f8e8e).

Files	Patch %	Lines
...dialectextensions/ClickhouseDialectExtension.scala	65.27%	25 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           develop       #8       +/-   ##
============================================
- Coverage   100.00%   71.26%   -28.74%     
============================================
  Files            1        1               
  Lines           22       87       +65     
  Branches         4       27       +23     
============================================
+ Hits            22       62       +40     
- Misses           0       25       +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/main/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectExtension.scala

src/test/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectTest.scala

dolfinus · 2024-08-01T13:04:03Z

however in Java there is no Int16 type, therefore it fails with error:

But Array(Int16) is a type name send directly to Clickhouse, and not used by Java at all. Could you provide a full stacktrace for this error?

maxim-lixakov · 2024-08-01T13:08:11Z

however in Java there is no Int16 type, therefore it fails with error:

But Array(Int16) is a type name send directly to Clickhouse, and not used by Clickhouse at all. Could you provide a full stacktrace for this error?

ase ShortType => Option(JdbcType("INTEGER", java.sql.Types.SM

however in Java there is no Int16 type, therefore it fails with error:

But Array(Int16) is a type name send directly to Clickhouse, and not used by Java at all. Could you provide a full stacktrace for this error?

no, it first converts in scala for arrays: https://github.com/apache/spark/blob/c3bb64e367aca5b49c691a9b64bc3e2965b7c80e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L712

> > however in Java there is no Int16 type, therefore it fails with error:

in scala*

dolfinus · 2024-08-01T13:16:49Z

no, it first converts in scala for arrays:

It is important to have a full stacktrace. conn.createArrayOf(typeName, elements) expects typeName to be a SQL database-specific type name, not a Java type:
https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#createArrayOf(java.lang.String,%20java.lang.Object[])
So currently I don't understand where this exception is being raised.

maxim-lixakov · 2024-08-01T13:17:37Z

no, it first converts in scala for arrays:

It is important to have a full stacktrace. conn.createArrayOf(typeName, elements) expects typeName to be a SQL adtabase-specific type name, not a Java type: https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#createArrayOf(java.lang.String,%20java.lang.Object[]) So currently I don't understand where this exception is being raised.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 1 times, most recent failure: Lost task 0.0 in stage 43.0 (TID 34) (178.248.237.216 executor driver): java.lang.IllegalArgumentException: Unknown data type: int16
[info]  at com.clickhouse.data.ClickHouseDataType.of(ClickHouseDataType.java:238)
[info]  at com.clickhouse.data.ClickHouseColumn.readColumn(ClickHouseColumn.java:488)
[info]  at com.clickhouse.data.ClickHouseColumn.of(ClickHouseColumn.java:537)
[info]  at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:46)
[info]  at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:27)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16(JdbcUtils.scala:643)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16$adapted(JdbcUtils.scala:640)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:738)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:902)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:901)
[info]  at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1039)
[info]  at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1039)
[info]  at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2438)
[info]  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]  at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
[info]  at org.apache.spark.scheduler.Task.run(Task.scala:141)
[info]  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
[info]  at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]  at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
[info]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
[info]  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info]  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info]  at java.base/java.lang.Thread.run(Thread.java:829)
[info] 
[info] Driver stacktrace:
[info]   at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
[info]   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[info]   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[info]   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
[info]   at scala.Option.foreach(Option.scala:407)
[info]   ...
[info]   Cause: java.lang.IllegalArgumentException: Unknown data type: int16
[info]   at com.clickhouse.data.ClickHouseDataType.of(ClickHouseDataType.java:238)
[info]   at com.clickhouse.data.ClickHouseColumn.readColumn(ClickHouseColumn.java:488)
[info]   at com.clickhouse.data.ClickHouseColumn.of(ClickHouseColumn.java:537)
[info]   at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:46)
[info]   at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:27)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16(JdbcUtils.scala:643)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16$adapted(JdbcUtils.scala:640)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:738)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:902)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:901)

dolfinus · 2024-08-01T13:24:34Z

This is probably caused by converting type name to lowercase:
https://github.com/apache/spark/blob/fd86f85e181fc2dc0f50a096855acf83a6cc5d9c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L639](https://github.com/apache/spark/blob/v3.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L639

Should be fixed in Spark 4.0:
https://issues.apache.org/jira/browse/SPARK-46612
https://github.com/apache/spark/blob/v4.0.0-preview1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L701

src/main/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectExtension.scala

dolfinus

New type mapping should be mentioned in documentation

maxim-lixakov and others added 4 commits July 23, 2024 18:39

[DOP-16990] - implement Array(T) mapping & test

9ac3a86

[DOP-16990] - add logic for writing ArrayType(...) to Clickhouse Arra…

107b3dc

…y(...)

[DOP-18631] - add logic for writing ArrayType(...) to Clickhouse Arra…

f9e8940

…y(...)

Apply code quality checks (auto-formatted)

3bd7515

maxim-lixakov requested a review from dolfinus August 1, 2024 12:41

maxim-lixakov self-assigned this Aug 1, 2024

dolfinus reviewed Aug 1, 2024

View reviewed changes

src/main/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectExtension.scala Outdated Show resolved Hide resolved

dolfinus reviewed Aug 1, 2024

View reviewed changes

src/test/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectTest.scala Outdated Show resolved Hide resolved

dolfinus reviewed Aug 1, 2024

View reviewed changes

src/main/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectExtension.scala Outdated Show resolved Hide resolved

maxim-lixakov force-pushed the feature/DOP-18631 branch 2 times, most recently from 1a55bb6 to 35f8e8e Compare August 5, 2024 12:04

dolfinus reviewed Aug 5, 2024

View reviewed changes

src/main/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectExtension.scala Show resolved Hide resolved

dolfinus reviewed Aug 5, 2024

View reviewed changes

src/main/scala/ru/mts/doetl/sparkdialectextensions/ClickhouseDialectExtension.scala Show resolved Hide resolved

[DOP-18631] - update test

3e3fc6c

maxim-lixakov force-pushed the feature/DOP-18631 branch from 35f8e8e to 3e3fc6c Compare August 5, 2024 13:06

Apply code quality checks (auto-formatted)

7ff9166

dolfinus approved these changes Aug 5, 2024

View reviewed changes

[DOP-18631] - update documentation

808a5d6

maxim-lixakov merged commit 22e74eb into develop Aug 5, 2024
2 checks passed

maxim-lixakov deleted the feature/DOP-18631 branch August 5, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOP-18631] - add partial support for ArrayType #8

[DOP-18631] - add partial support for ArrayType #8

maxim-lixakov commented Aug 1, 2024 •

edited by github-actions bot

Loading

codecov bot commented Aug 1, 2024 •

edited

Loading

dolfinus commented Aug 1, 2024 •

edited

Loading

maxim-lixakov commented Aug 1, 2024

dolfinus commented Aug 1, 2024 •

edited

Loading

maxim-lixakov commented Aug 1, 2024

dolfinus commented Aug 1, 2024 •

edited

Loading

dolfinus left a comment

[DOP-18631] - add partial support for ArrayType #8

[DOP-18631] - add partial support for ArrayType #8

Conversation

maxim-lixakov commented Aug 1, 2024 • edited by github-actions bot Loading

Change Summary

Related issue number

Checklist

codecov bot commented Aug 1, 2024 • edited Loading

Codecov Report

dolfinus commented Aug 1, 2024 • edited Loading

maxim-lixakov commented Aug 1, 2024

dolfinus commented Aug 1, 2024 • edited Loading

maxim-lixakov commented Aug 1, 2024

dolfinus commented Aug 1, 2024 • edited Loading

dolfinus left a comment

Choose a reason for hiding this comment

maxim-lixakov commented Aug 1, 2024 •

edited by github-actions bot

Loading

codecov bot commented Aug 1, 2024 •

edited

Loading

dolfinus commented Aug 1, 2024 •

edited

Loading

dolfinus commented Aug 1, 2024 •

edited

Loading

dolfinus commented Aug 1, 2024 •

edited

Loading