Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks read from redshift first selected column returns wrong value if quotes used #467

Open
HansHerrlich opened this issue Aug 2, 2022 · 0 comments

Comments

@HansHerrlich
Copy link

HansHerrlich commented Aug 2, 2022

Hi folks,
when issuing a query against redshift from databricks there seems to be an issue with how quotes inside an sql string
are handled.

The following set up on Redshift:

CREATE TABLE IF NOT EXISTS tmp_test
(
	my_new_varchar VARCHAR(128)   ENCODE lzo
	,some_text VARCHAR(256)   ENCODE lzo
	,some_decimal NUMERIC(12,6)   ENCODE az64
	,int_col INTEGER   ENCODE az64
	,real_col REAL   ENCODE RAW
	,double_col DOUBLE PRECISION   ENCODE RAW
	,smal_int_col SMALLINT   ENCODE az64
	,t_with_timezone TIMESTAMP WITH TIME ZONE   ENCODE az64
	,t_without_timezone TIMESTAMP WITHOUT TIME ZONE   ENCODE az64
	,dwh_created_at TIMESTAMP WITHOUT TIME ZONE  DEFAULT now() ENCODE az64
)
DISTSTYLE AUTO;


INSERT INTO tmp_test
(my_new_varchar, some_text, some_decimal, int_col, real_col, double_col, smal_int_col, t_with_timezone, t_without_timezone, dwh_created_at)
VALUES('first_val', 'second_val', 123.123, 123456, 12.12, 0.233234, 123, null, null, null);

commit;

on databricks I run the following command afterwards:

val query = """
 Select max('my_new_varchar') as column_name, max(coalesce(length(my_new_varchar),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test  union 
 Select max('some_text') as column_name, max(coalesce(length(some_text),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test  union 
 Select max('some_decimal') as column_name ,max(length(cast(some_decimal as varchar))) as length_or_numeric_precision, max(position('.' in some_decimal))  as numeric_scale FROM tmp_test  union 
 Select max('int_col') as column_name ,max(length(cast(int_col as varchar))) as length_or_numeric_precision, max(position('.' in int_col))  as numeric_scale FROM tmp_test  union 
 Select max('real_col') as column_name ,max(length(cast(real_col as varchar))) as length_or_numeric_precision, max(position('.' in real_col))  as numeric_scale FROM tmp_test  union 
 Select max('double_col') as column_name ,max(length(cast(double_col as varchar))) as length_or_numeric_precision, max(position('.' in double_col))  as numeric_scale FROM tmp_test  union 
 Select max('smal_int_col') as column_name ,max(length(cast(smal_int_col as varchar))) as length_or_numeric_precision, max(position('.' in smal_int_col))  as numeric_scale FROM tmp_test
""".stripMargin

val maxRedDataDF = spark.read
       .format("com.databricks.spark.redshift")
       .option("url", redUrl)
       .option("tempdir", tempdir)
       .option("forward_spark_s3_credentials", "true")
       .option("autoenablessl", "false")
       .option("query", query)
       .load()

display(maxRedDataDF)

As you can see for the column my_new_varchar it does not return the column_name as for the rest of the columns in the query but instead the max values inside this column e.g. first_val

image

A dirty quick fix is to modify the query by prepanding the following line:

Select max('''tester''') as column_name, max(0) as length_or_numeric_precision, max(0) as numeric_scale  union

Now the result for the column my_new_varchar is returned as expected. But column_name tester
now is returned as tester' (see single quotation mark at string ending) if you do not use triple quotation
marks around tester but single once it gives error of column not found (Redshift) case of double quotation marks it gives sql exception (databricks).

Hope the explanation is detailed enough, if not please let me know.
Regards,
Hans

@HansHerrlich HansHerrlich changed the title Databricks read from redshift first row wrong value Databricks read from redshift first selected column returns wrong value if quotes used Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant