Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics #48920

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

xupefei
Copy link
Contributor

@xupefei xupefei commented Nov 21, 2024

What changes were proposed in this pull request?

This PR fixes an issue that some of DataFrame execution paths would not process ObservedMetrics. The fix is done by injecting a lazy processing logic into the result iterator.

This PR also fixes an issue in which on the Server side, two observed metrics can be assigned to the same Plan ID when they are in the same plan (e.g., one observation is used as the input of another). The fix is to traverse the plan and find all observations with correct IDs.

Another bug is discovered as a byproduct of introducing a new test case. Copying the PR comment here from SparkConnectPlanner.scala:

This fixes a bug where the input of a CollectMetrics can be processed two times, once in Line 1190 and once here/below.

When the input contains another CollectMetrics, transforming it twice will cause two Observation objects (in the input) to be initialised and registered two times to the system. Since only one of them will be fulfilled when the query finishes, the one we'll be looking at may not have any data.

This issue is highlighted in the test case Observation.get is blocked until the query is finished ..., where we specifically execute observedObservedDf, which is a CollectMetrics that has another CollectMetrics as its input.

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

Yes, this bug is user-facing.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

CollectMetrics(name, metrics.map(_.named), transformRelation(rel.getInput), planId)
CollectMetrics(name, metrics.map(_.named), input, planId)
Copy link
Contributor Author

@xupefei xupefei Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a bug where the input of a CollectMetrics can be processed two times, once in Line 1190 and once here/below.

When the input contains another CollectMetrics, transforming it twice will cause two Observation objects (in the input) to be initialised and registered two times to the system. Since only one of them will be fulfilled when the query finishes, the one we'll be looking at may not have any data.

This issue is highlighted in the test case Observation.get is blocked until the query is finished ..., where we specifically execute observedObservedDf, which is a CollectMetrics that has another CollectMetrics as its input.

Copy link
Contributor

@vicennial vicennial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! Mostly LGTM, minor ask

@HyukjinKwon HyukjinKwon changed the title [SPARK-50372][Connect][SQL] Make all DF execution path collect observed metrics [SPARK-50372][CONNECT][SQL] Make all DF execution path collect observed metrics Nov 21, 2024
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but mind listing up affected API in the PR description?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants