You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi folks,
when issuing a query against redshift from databricks there seems to be an issue with how quotes inside an sql string
are handled.
The following set up on Redshift:
CREATE TABLE IF NOT EXISTS tmp_test
(
my_new_varchar VARCHAR(128) ENCODE lzo
,some_text VARCHAR(256) ENCODE lzo
,some_decimal NUMERIC(12,6) ENCODE az64
,int_col INTEGER ENCODE az64
,real_col REAL ENCODE RAW
,double_col DOUBLE PRECISION ENCODE RAW
,smal_int_col SMALLINT ENCODE az64
,t_with_timezone TIMESTAMP WITH TIME ZONE ENCODE az64
,t_without_timezone TIMESTAMP WITHOUT TIME ZONE ENCODE az64
,dwh_created_at TIMESTAMP WITHOUT TIME ZONE DEFAULT now() ENCODE az64
)
DISTSTYLE AUTO;
INSERT INTO tmp_test
(my_new_varchar, some_text, some_decimal, int_col, real_col, double_col, smal_int_col, t_with_timezone, t_without_timezone, dwh_created_at)
VALUES('first_val', 'second_val', 123.123, 123456, 12.12, 0.233234, 123, null, null, null);
commit;
on databricks I run the following command afterwards:
val query = """
Select max('my_new_varchar') as column_name, max(coalesce(length(my_new_varchar),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test union
Select max('some_text') as column_name, max(coalesce(length(some_text),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test union
Select max('some_decimal') as column_name ,max(length(cast(some_decimal as varchar))) as length_or_numeric_precision, max(position('.' in some_decimal)) as numeric_scale FROM tmp_test union
Select max('int_col') as column_name ,max(length(cast(int_col as varchar))) as length_or_numeric_precision, max(position('.' in int_col)) as numeric_scale FROM tmp_test union
Select max('real_col') as column_name ,max(length(cast(real_col as varchar))) as length_or_numeric_precision, max(position('.' in real_col)) as numeric_scale FROM tmp_test union
Select max('double_col') as column_name ,max(length(cast(double_col as varchar))) as length_or_numeric_precision, max(position('.' in double_col)) as numeric_scale FROM tmp_test union
Select max('smal_int_col') as column_name ,max(length(cast(smal_int_col as varchar))) as length_or_numeric_precision, max(position('.' in smal_int_col)) as numeric_scale FROM tmp_test
""".stripMargin
val maxRedDataDF = spark.read
.format("com.databricks.spark.redshift")
.option("url", redUrl)
.option("tempdir", tempdir)
.option("forward_spark_s3_credentials", "true")
.option("autoenablessl", "false")
.option("query", query)
.load()
display(maxRedDataDF)
As you can see for the column my_new_varchar it does not return the column_name as for the rest of the columns in the query but instead the max values inside this column e.g. first_val
A dirty quick fix is to modify the query by prepanding the following line:
Select max('''tester''') as column_name, max(0) as length_or_numeric_precision, max(0) as numeric_scale union
Now the result for the column my_new_varchar is returned as expected. But column_name tester
now is returned as tester' (see single quotation mark at string ending) if you do not use triple quotation
marks around tester but single once it gives error of column not found (Redshift) case of double quotation marks it gives sql exception (databricks).
Hope the explanation is detailed enough, if not please let me know.
Regards,
Hans
The text was updated successfully, but these errors were encountered:
HansHerrlich
changed the title
Databricks read from redshift first row wrong value
Databricks read from redshift first selected column returns wrong value if quotes used
Aug 2, 2022
Hi folks,
when issuing a query against redshift from databricks there seems to be an issue with how quotes inside an sql string
are handled.
The following set up on Redshift:
on databricks I run the following command afterwards:
As you can see for the column my_new_varchar it does not return the column_name as for the rest of the columns in the query but instead the max values inside this column e.g. first_val
A dirty quick fix is to modify the query by prepanding the following line:
Now the result for the column my_new_varchar is returned as expected. But column_name tester
now is returned as tester' (see single quotation mark at string ending) if you do not use triple quotation
marks around tester but single once it gives error of column not found (Redshift) case of double quotation marks it gives sql exception (databricks).
Hope the explanation is detailed enough, if not please let me know.
Regards,
Hans
The text was updated successfully, but these errors were encountered: