Databricks read from redshift first selected column returns wrong value if quotes used #467

HansHerrlich · 2022-08-02T11:18:51Z

Hi folks,
when issuing a query against redshift from databricks there seems to be an issue with how quotes inside an sql string
are handled.

The following set up on Redshift:

CREATE TABLE IF NOT EXISTS tmp_test
(
	my_new_varchar VARCHAR(128)   ENCODE lzo
	,some_text VARCHAR(256)   ENCODE lzo
	,some_decimal NUMERIC(12,6)   ENCODE az64
	,int_col INTEGER   ENCODE az64
	,real_col REAL   ENCODE RAW
	,double_col DOUBLE PRECISION   ENCODE RAW
	,smal_int_col SMALLINT   ENCODE az64
	,t_with_timezone TIMESTAMP WITH TIME ZONE   ENCODE az64
	,t_without_timezone TIMESTAMP WITHOUT TIME ZONE   ENCODE az64
	,dwh_created_at TIMESTAMP WITHOUT TIME ZONE  DEFAULT now() ENCODE az64
)
DISTSTYLE AUTO;


INSERT INTO tmp_test
(my_new_varchar, some_text, some_decimal, int_col, real_col, double_col, smal_int_col, t_with_timezone, t_without_timezone, dwh_created_at)
VALUES('first_val', 'second_val', 123.123, 123456, 12.12, 0.233234, 123, null, null, null);

commit;

on databricks I run the following command afterwards:

val query = """
 Select max('my_new_varchar') as column_name, max(coalesce(length(my_new_varchar),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test  union 
 Select max('some_text') as column_name, max(coalesce(length(some_text),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test  union 
 Select max('some_decimal') as column_name ,max(length(cast(some_decimal as varchar))) as length_or_numeric_precision, max(position('.' in some_decimal))  as numeric_scale FROM tmp_test  union 
 Select max('int_col') as column_name ,max(length(cast(int_col as varchar))) as length_or_numeric_precision, max(position('.' in int_col))  as numeric_scale FROM tmp_test  union 
 Select max('real_col') as column_name ,max(length(cast(real_col as varchar))) as length_or_numeric_precision, max(position('.' in real_col))  as numeric_scale FROM tmp_test  union 
 Select max('double_col') as column_name ,max(length(cast(double_col as varchar))) as length_or_numeric_precision, max(position('.' in double_col))  as numeric_scale FROM tmp_test  union 
 Select max('smal_int_col') as column_name ,max(length(cast(smal_int_col as varchar))) as length_or_numeric_precision, max(position('.' in smal_int_col))  as numeric_scale FROM tmp_test
""".stripMargin

val maxRedDataDF = spark.read
       .format("com.databricks.spark.redshift")
       .option("url", redUrl)
       .option("tempdir", tempdir)
       .option("forward_spark_s3_credentials", "true")
       .option("autoenablessl", "false")
       .option("query", query)
       .load()

display(maxRedDataDF)

As you can see for the column my_new_varchar it does not return the column_name as for the rest of the columns in the query but instead the max values inside this column e.g. first_val

A dirty quick fix is to modify the query by prepanding the following line:

Select max('''tester''') as column_name, max(0) as length_or_numeric_precision, max(0) as numeric_scale  union

Now the result for the column my_new_varchar is returned as expected. But column_name tester
now is returned as tester' (see single quotation mark at string ending) if you do not use triple quotation
marks around tester but single once it gives error of column not found (Redshift) case of double quotation marks it gives sql exception (databricks).

Hope the explanation is detailed enough, if not please let me know.
Regards,
Hans

The text was updated successfully, but these errors were encountered:

HansHerrlich changed the title ~~Databricks read from redshift first row wrong value~~ Databricks read from redshift first selected column returns wrong value if quotes used Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks read from redshift first selected column returns wrong value if quotes used #467

Databricks read from redshift first selected column returns wrong value if quotes used #467

HansHerrlich commented Aug 2, 2022 •

edited

Loading

Databricks read from redshift first selected column returns wrong value if quotes used #467

Databricks read from redshift first selected column returns wrong value if quotes used #467

Comments

HansHerrlich commented Aug 2, 2022 • edited Loading

HansHerrlich commented Aug 2, 2022 •

edited

Loading