[SPARK-49618][SQL]: Union & UnionExec nodes equality not take into account unaligned positions of branches causing cache miss and non reuse of exchange #48094
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
A Trait UnionEquality is introduced which is implemented by Union and UnionExec nodes. It contains code to check equality of Union node legs in an order agnostic manner and also hashCode independent of the order of the legs. The equality does consider if the output attributes of the head nodes are same in terms of name, datatype, metadata, nullability etc (but not exprIDs).
It is true that converting Sequence of Legs into set to get order agnostic hashCode can result in situation like:
Seq(leg1, leg2) and Seq(leg1, leg2, leg2) to have same hashCode when converted to Set, but that should not cause logical problem as equality checks for length.
Though if we want to avoid hash collision in that situation, the code can be changed to
Objects.hashCode(Seq(leg1, leg2).map(_.hashCode).sorted: _*)
Why are the changes needed?
Because of the way the equality of Union nodes behave currently, changing the order of the legs, will cause cache miss and reuse of exchange not happening, as the canonicalized plans will not match.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added tests to check the equality of Union and UnionExec nodes with unaligned order of the legs.
Added test to verify cache lookup of InMemoryRelation and reuse of exchange.
Was this patch authored or co-authored using generative AI tooling?
No