Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOP-18631] - add partial support for ArrayType #8

Merged
merged 7 commits into from
Aug 5, 2024

Conversation

maxim-lixakov
Copy link
Contributor

@maxim-lixakov maxim-lixakov commented Aug 1, 2024

Change Summary

Add partial support:

Array(T) -> ArrayType(T):

ClickHouse Type (Read) Spark Type ClickHouse Type (Write) ClickHouse Type (Create)
Array(String) ArrayType(StringType) Array(String) Array(String)
unsupported ArrayType(ByteType) Array(Int8) Array(Int8)
unsupported ArrayType(ShortType) unsupported unsupported
unsupported ArrayType(LongType) Array(Int64) Array(Int64)
Array(Decimal(M, N)) ArrayType(DecimalType(M, N)) Array(Decimal(M, N)) Array(Decimal(M, N))
unsupported ArrayType(TimestampType) unsupported unsupported
unsupported ArrayType(Date) Array(Date) Array(Date)
unsupported ArrayType(FloatType) Array(Float32) Array(Float32)
unsupported ArrayType(DoubleType) unsupported unsupported

the problem with short type and double type is next:

in native spark typeName for nested type in Array is get via getJDBCType:

in default getJDBCType:
case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))

it converts ShortType to Java JdbcType("INTEGER") type and ArrayType(ShortType) -> Array(Integers)
but as we customize behaviour for ShortType to Int16 in clickhouse it tries to convert ArrayType(ShortType) -> Array(Int16), however in Java there is no Int16 type, therefore it fails with error:

Java.lang.IllegalArgumentException: Unknown data type: int16, so for current spark implementation there is a choice:

  1. custom casting of short to int16 without support to ArrayType(ShortType) for writing
  2. remove custom of short to int16 and leave integer but with support of ArrayType(ShortType) for writing

Related issue number

Checklist

  • Commit message and PR title is comprehensive
  • Keep the change as small as possible
  • Unit and integration tests for the changes exist
  • Tests pass on CI and coverage does not decrease
  • Documentation reflects the changes where applicable
  • My PR is ready to review.

Auto-commit applied. A retry was triggered due to a failure in formatting/linting checks.

@maxim-lixakov maxim-lixakov requested a review from dolfinus August 1, 2024 12:41
@maxim-lixakov maxim-lixakov self-assigned this Aug 1, 2024
Copy link

codecov bot commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 65.27778% with 25 lines in your changes missing coverage. Please review.

Project coverage is 71.26%. Comparing base (2320a27) to head (35f8e8e).

Files Patch % Lines
...dialectextensions/ClickhouseDialectExtension.scala 65.27% 25 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop       #8       +/-   ##
============================================
- Coverage   100.00%   71.26%   -28.74%     
============================================
  Files            1        1               
  Lines           22       87       +65     
  Branches         4       27       +23     
============================================
+ Hits            22       62       +40     
- Misses           0       25       +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dolfinus
Copy link
Member

dolfinus commented Aug 1, 2024

however in Java there is no Int16 type, therefore it fails with error:

But Array(Int16) is a type name send directly to Clickhouse, and not used by Java at all. Could you provide a full stacktrace for this error?

@maxim-lixakov
Copy link
Contributor Author

however in Java there is no Int16 type, therefore it fails with error:

But Array(Int16) is a type name send directly to Clickhouse, and not used by Clickhouse at all. Could you provide a full stacktrace for this error?

ase ShortType => Option(JdbcType("INTEGER", java.sql.Types.SM

however in Java there is no Int16 type, therefore it fails with error:

But Array(Int16) is a type name send directly to Clickhouse, and not used by Java at all. Could you provide a full stacktrace for this error?

no, it first converts in scala for arrays: https://github.com/apache/spark/blob/c3bb64e367aca5b49c691a9b64bc3e2965b7c80e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L712

> > however in Java there is no Int16 type, therefore it fails with error:

in scala*

@dolfinus
Copy link
Member

dolfinus commented Aug 1, 2024

no, it first converts in scala for arrays:

It is important to have a full stacktrace. conn.createArrayOf(typeName, elements) expects typeName to be a SQL database-specific type name, not a Java type:
https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#createArrayOf(java.lang.String,%20java.lang.Object[])
So currently I don't understand where this exception is being raised.

@maxim-lixakov
Copy link
Contributor Author

no, it first converts in scala for arrays:

It is important to have a full stacktrace. conn.createArrayOf(typeName, elements) expects typeName to be a SQL adtabase-specific type name, not a Java type: https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#createArrayOf(java.lang.String,%20java.lang.Object[]) So currently I don't understand where this exception is being raised.

no, it first converts in scala for arrays:

It is important to have a full stacktrace. conn.createArrayOf(typeName, elements) expects typeName to be a SQL adtabase-specific type name, not a Java type: https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#createArrayOf(java.lang.String,%20java.lang.Object[]) So currently I don't understand where this exception is being raised.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 1 times, most recent failure: Lost task 0.0 in stage 43.0 (TID 34) (178.248.237.216 executor driver): java.lang.IllegalArgumentException: Unknown data type: int16
[info]  at com.clickhouse.data.ClickHouseDataType.of(ClickHouseDataType.java:238)
[info]  at com.clickhouse.data.ClickHouseColumn.readColumn(ClickHouseColumn.java:488)
[info]  at com.clickhouse.data.ClickHouseColumn.of(ClickHouseColumn.java:537)
[info]  at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:46)
[info]  at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:27)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16(JdbcUtils.scala:643)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16$adapted(JdbcUtils.scala:640)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:738)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:902)
[info]  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:901)
[info]  at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1039)
[info]  at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1039)
[info]  at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2438)
[info]  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]  at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
[info]  at org.apache.spark.scheduler.Task.run(Task.scala:141)
[info]  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
[info]  at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]  at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
[info]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
[info]  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info]  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info]  at java.base/java.lang.Thread.run(Thread.java:829)
[info] 
[info] Driver stacktrace:
[info]   at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
[info]   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[info]   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[info]   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
[info]   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
[info]   at scala.Option.foreach(Option.scala:407)
[info]   ...
[info]   Cause: java.lang.IllegalArgumentException: Unknown data type: int16
[info]   at com.clickhouse.data.ClickHouseDataType.of(ClickHouseDataType.java:238)
[info]   at com.clickhouse.data.ClickHouseColumn.readColumn(ClickHouseColumn.java:488)
[info]   at com.clickhouse.data.ClickHouseColumn.of(ClickHouseColumn.java:537)
[info]   at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:46)
[info]   at com.clickhouse.jdbc.ClickHouseConnection.createArrayOf(ClickHouseConnection.java:27)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16(JdbcUtils.scala:643)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$16$adapted(JdbcUtils.scala:640)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:738)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:902)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:901)

@maxim-lixakov maxim-lixakov force-pushed the feature/DOP-18631 branch 2 times, most recently from 1a55bb6 to 35f8e8e Compare August 5, 2024 12:04
Copy link
Member

@dolfinus dolfinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New type mapping should be mentioned in documentation

@maxim-lixakov maxim-lixakov merged commit 22e74eb into develop Aug 5, 2024
2 checks passed
@maxim-lixakov maxim-lixakov deleted the feature/DOP-18631 branch August 5, 2024 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants