You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug?
Fillnull command throw AMBIGUOUS_REFERENCE exception in the case that the datatypes of null_replacement is not the same as null_fields(although compatible, Spark will transform them to the same datatype by Analyzer).
It's actually a potential BUG for all cases but most of them has been luckily addressed by another Spark bug currently: https://issues.apache.org/jira/browse/SPARK-49782. After upgrading to a Spark version that includes this fix (potentially Spark 3.5.4 or later), the AMBIGUOUS_REFERENCE exception will be thrown more consistently across a wide range of scenarios, not limiting to the specific cases mentioned above.
How can one reproduce the bug?
Steps to reproduce the behavior:
Create a table with a column of LONG type
create table test (id INT, longV LONG) using CSV OPTIONS (header 'false', delimiter '\t');
Insert a value with null on column longV into that table
insert into test values (1, null);
run a fillnull with null_replacement is 0(parsed as Literal of integer type)
source=test | fields longV | eval originalLongV = longV | fillnull with 0 in longV;
It will throw exception:
[AMBIGUOUS_REFERENCE] Reference `longV` is ambiguous, could be: [`longV`, `spark_catalog`.`default`.`test`.`longV`].
What is the expected behavior?
It should run successfully.
The root cause is that we converted the ppl into a ambiguous plan on 'DataFrameDropColumns ['longV]. Spark cannot resolve longV because there are 2 longV, one from its child Project and another from its grand-child which derives from the Table.
And the reason why most cases works well is that, there is a bug in Spark rule ResolveDataFrameDropColumns. It resolves DataFrameDropColumns's expressions by its grand-children instead of children(which is incorrect), so there is only one longV in its grand-children and doesn't have any ambiguous. While for the specific case where there is datatype mismatch, it goes into the rule typeCoercionRules first to transform datatypes into the same, and then into the rule ResolveReferences which doesn't have a similar bug.
So it's actually a case of two wrongs making a right.
plan
== Parsed Logical Plan ==
'Project [*]
+- 'DataFrameDropColumns ['longV]
+- 'Project [*, 'coalesce('longV, 0) AS longV#1]
+- 'Project [*, 'longV AS originalLong#0]
+- 'Project ['longV]
+- 'UnresolvedRelation [test4], [], false
== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `longV` is ambiguous, could be: [`longV`, `spark_catalog`.`default`.`test4`.`longV`].
What is your host/environment?
OS: mac
Version: opensearch-spark 0.7.0, spark 3.5.3
Plugins
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.
The text was updated successfully, but these errors were encountered:
What is the bug?
Fillnull command throw AMBIGUOUS_REFERENCE exception in the case that the datatypes of null_replacement is not the same as null_fields(although compatible, Spark will transform them to the same datatype by Analyzer).
It's actually a potential BUG for all cases but most of them has been luckily addressed by another Spark bug currently: https://issues.apache.org/jira/browse/SPARK-49782. After upgrading to a Spark version that includes this fix (potentially Spark 3.5.4 or later), the AMBIGUOUS_REFERENCE exception will be thrown more consistently across a wide range of scenarios, not limiting to the specific cases mentioned above.
How can one reproduce the bug?
Steps to reproduce the behavior:
longV
into that tableIt will throw exception:
What is the expected behavior?
It should run successfully.
The root cause is that we converted the ppl into a ambiguous plan on
'DataFrameDropColumns ['longV]
. Spark cannot resolvelongV
because there are 2longV
, one from its childProject
and another from its grand-child which derives from the Table.And the reason why most cases works well is that, there is a bug in Spark rule
ResolveDataFrameDropColumns
. It resolvesDataFrameDropColumns
's expressions by its grand-children instead of children(which is incorrect), so there is only onelongV
in its grand-children and doesn't have any ambiguous. While for the specific case where there is datatype mismatch, it goes into the ruletypeCoercionRules
first to transform datatypes into the same, and then into the ruleResolveReferences
which doesn't have a similar bug.So it's actually a case of two wrongs making a right.
What is your host/environment?
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.
The text was updated successfully, but these errors were encountered: