Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent shuffle write time sum results in Profiler output #1450

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Dec 5, 2024

Fixes #1408

Changes

  • In TaskModel class, keep using nanoseconds for shuffle write time
  • Convert these into milliseconds when generating output

This improves shuffle write time metrics output to avoid potential precision lost from converting nanoseconds to milliseconds and then taking the sum of the converted numbers. It also separates TaskModel and output reporting so that we know all metrics are in their original values before output generation.

Testing

  • Existing unit tests
  • Manually confirm the shuffle write time value is consistent in all places in Profiler output

Before/After Values (shuffle write time sum)

core/src/test/resources/ProfilingExpectations/rapids_join_eventlog_jobmetricsaggmulti_expectation.csv
944 --> 1001
849 --> 901

core/src/test/resources/ProfilingExpectations/rapids_join_eventlog_sqlmetricsaggmulti_expectation.csv
944 --> 1001
849 --> 901

core/src/test/resources/ProfilingExpectations/rapids_join_eventlog_stagemetricsaggmulti_expectation.csv
397 --> 400
505 --> 508
42 --> 93
373 --> 376
473 --> 475
3 --> 50

@cindyyuanjiang cindyyuanjiang self-assigned this Dec 5, 2024
@cindyyuanjiang cindyyuanjiang added bug Something isn't working core_tools Scope the core module (scala) labels Dec 5, 2024
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @cindyyuanjiang for the fix!
Nit: It would be nice to include the before and after values in the description. I understand that we can confirm the fix from the expected_files.

parthosa

This comment was marked as duplicate.

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang for this change.

@amahussein Unrelated but should we have similar approach for executorCpuTime and executorDeserializeCpuTime?

@cindyyuanjiang
Copy link
Collaborator Author

Thanks @nartal1! Updated the before/after values in PR description.

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Dec 9, 2024

Thanks @parthosa! Agree we should discuss the requirements for executorCpuTime and executorDeserializeCpuTime.

@amahussein
Copy link
Collaborator

Thanks @cindyyuanjiang for this change.

@amahussein Unrelated but should we have similar approach for executorCpuTime and executorDeserializeCpuTime?

Thanks @parthosa. Yes, it would have been better to fix the inconsistency for other metrics within the is very PR since the change is not big compared to the overhead we would have to go through filing another bug then dealing with the a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Profiler output shows inconsistent shuffleWriteTime results
4 participants