[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375

cindyyuanjiang · 2024-10-04T22:13:32Z

Contributes to #1374

Changes

Add a stage-level diagnostic view in Profiler output: stage_level_diagnostic_metrics.csv
- Add the diagnostic metrics results under AggRawMetricsResult
- Update AppSQLPlanAnalyzer to store mappings between stage IDs to node names and GPU semaphore wait time
Add a unit test in core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala to verify diagnostic csv file output
Clean up redundant function definitions in core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

Testing CMD

spark_rapids profiling -e <my_event_log> -t <my_tools_jar>

Output Example

+--------+-----------+-------------------+-------+---------------+--------+-----------------------+--------------------------+-----------------------+-------------------------+---------------------+------------------------+---------------------+-----------------------+-----------------+--------------------+-----------------+-------------------+---------------------+------------------------+---------------------+-----------------------+-------------------+----------------------+-------------------+---------------------+--------------------+-----------------------+--------------------+----------------------+---------------------------+------------------------------+---------------------------+-----------------------------+------------------------+---------------------------+------------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|appIndex|appName    |appId              |stageId|stageDurationMs|numTasks|memoryBytesSpilledMBMin|memoryBytesSpilledMBMedian|memoryBytesSpilledMBMax|memoryBytesSpilledMBTotal|diskBytesSpilledMBMin|diskBytesSpilledMBMedian|diskBytesSpilledMBMax|diskBytesSpilledMBTotal|inputBytesReadMin|inputBytesReadMedian|inputBytesReadMax|inputBytesReadTotal|outputBytesWrittenMin|outputBytesWrittenMedian|outputBytesWrittenMax|outputBytesWrittenTotal|shuffleReadBytesMin|shuffleReadBytesMedian|shuffleReadBytesMax|shuffleReadBytesTotal|shuffleWriteBytesMin|shuffleWriteBytesMedian|shuffleWriteBytesMax|shuffleWriteBytesTotal|shuffleReadFetchWaitTimeMin|shuffleReadFetchWaitTimeMedian|shuffleReadFetchWaitTimeMax|shuffleReadFetchWaitTimeTotal|shuffleWriteWriteTimeMin|shuffleWriteWriteTimeMedian|shuffleWriteWriteTimeMax|shuffleWriteWriteTimeTotal|gpuSemaphoreWaitTimeTotal|SQL Nodes(IDs)                                                                                                                                                                                    |
+--------+-----------+-------------------+-------+---------------+--------+-----------------------+--------------------------+-----------------------+-------------------------+---------------------+------------------------+---------------------+-----------------------+-----------------+--------------------+-----------------+-------------------+---------------------+------------------------+---------------------+-----------------------+-------------------+----------------------+-------------------+---------------------+--------------------+-----------------------+--------------------+----------------------+---------------------------+------------------------------+---------------------------+-----------------------------+------------------------+---------------------------+------------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1       |Spark shell|local-1622814619968|0      |1743           |6       |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |0                  |0                     |0                  |0                    |6688608             |6688702                |6688825             |40132250              |0                          |0                             |0                          |0                            |41                      |60                         |100                     |400                       |0                        |GpuColumnarExchange(16),GpuProject(17),GpuRowToColumnar(18),WholeStageCodegen (2)(19),Scan(21)                                                                                                    |
|1       |Spark shell|local-1622814619968|1      |1631           |6       |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |0                  |0                     |0                  |0                    |6688602             |6688708                |6688833             |40132258              |0                          |0                             |0                          |0                            |37                      |92                         |108                     |508                       |0                        |GpuColumnarExchange(8),GpuProject(9),GpuRowToColumnar(10),WholeStageCodegen (1)(11),Scan(13)                                                                                                      |
|1       |Spark shell|local-1622814619968|2      |688            |200     |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |397220             |401479                |405854             |80264508             |77                  |77                     |77                  |15400                 |0                          |0                             |0                          |0                            |0                       |0                          |9                       |93                        |0                        |GpuColumnarExchange(3),GpuHashAggregate(4),GpuProject(5),GpuShuffledHashJoin(6),GpuShuffleCoalesce(7),GpuColumnarExchange(8),GpuCoalesceBatches(14),GpuShuffleCoalesce(15),GpuColumnarExchange(16)|
|1       |Spark shell|local-1622814619968|3      |83             |1       |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |15400              |15400                 |15400              |15400                |0                   |0                      |0                   |0                     |0                          |0                             |0                          |0                            |0                       |0                          |0                       |0                         |0                        |GpuColumnarToRow(0),GpuHashAggregate(1),GpuShuffleCoalesce(2),GpuColumnarExchange(3)                                                                                                              |
+--------+-----------+-------------------+-------+---------------+--------+-----------------------+--------------------------+-----------------------+-------------------------+---------------------+------------------------+---------------------+-----------------------+-----------------+--------------------+-----------------+-------------------+---------------------+------------------------+---------------------+-----------------------+-------------------+----------------------+-------------------+---------------------+--------------------+-----------------------+--------------------+----------------------+---------------------------+------------------------------+---------------------------+-----------------------------+------------------------+---------------------------+------------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Signed-off-by: cindyyuanjiang <[email protected]>

kuhushukla · 2024-10-07T19:12:48Z

On a second thought, we should combine the two tables shown as an example here. My original intent was to keep the first view simple but the latter table is not too bad for that

Signed-off-by: cindyyuanjiang <[email protected]>

kuhushukla · 2024-10-29T14:41:05Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

+        swWriteTimeMax,
+        swWriteTimeSum,
+        gpuSemaphoreWaitSum,
+        nodeNames)


Can we make an encapsulating object for this so that we dont have large arg list as well as a single place to hold the metrics we care about -- easier to update it.

Thanks @kuhushukla! Can you elaborate a bit more on this? I thought StageDiagnosticResult is the encapsulating object. It has similar structure as other profiler results, for example - https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala#L417.

The arg list is very large and that on its own would be nice to abstract away in a case class etc.

thanks @kuhushukla! I experimented with a few things like encapsulating part of the arg list into a separate case class, but overall I think this presentation has the best readability. It also aligns with other classes in this file and current unit tests. We can discuss more offline if there is something else we should try.

kuhushukla · 2024-10-29T14:44:19Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

+  /**
+   * Given an input iterable, returns its min, median, max and sum.
+   */
+  def getStatistics(arr: Iterable[Long]): (Long, Long, Long, Long) = {


Do we need this? I thought this information is available and can be simply pulled? Please correct me if I am wrong -- for eg, in the existing profiler o/p where does the median value come from?

I updated the implementation to reuse/pull existing metrics results from ProfStageMetricView. I cannot do this for shuffle read total bytes because in ProfStageMetricView there are 2 metrics associated with this: internal.metrics.shuffle.read.localBytesRead and internal.metrics.shuffle.read.remoteBytesRead. I cannot get the min/med/max of shuffle read total bytes by adding the min/med/max of the 2 metrics. I am keeping this function for now, but if it looks too unnecessary I can remove it.

kuhushukla · 2024-10-29T14:45:29Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

@@ -608,6 +572,183 @@ case class StageAggTaskMetricsProfileResult(
  override def idHeader = "stageId"
 }

+case class StageDiagnosticMetricsProfileResult(


Nit: rename to a smaller string : DiagnosticResult , StageDiagnosticResult etc. are some options

Short and concise are good as long as it doesn't lose context. For instance I think DiagnosticResult may be to generic as it doesn't tell you what its applied or diagnostic of. so I would lean towards StageDiagnosticResult

thanks, changed to StageDiagnosticResult

...sources/ProfilingExpectations/rapids_join_eventlog_stagediagnosticmetricsagg_expectation.csv

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

tgravescs · 2024-10-30T15:54:24Z

is the example output real here? The first stage that took the longest has no input data but has a Scan with it

parthosa

Thanks @cindyyuanjiang. Minor nits and questions.

core/src/main/scala/com/nvidia/spark/rapids/tool/views/package.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-10-30T21:20:56Z

is the example output real here? The first stage that took the longest has no input data but has a Scan with it

thanks @tgravescs! This is the example output from existing testing event log: https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/test/resources/spark-events-profiling/rapids_join_eventlog.zstd.

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein · 2024-11-08T16:19:39Z

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala

@@ -47,7 +47,8 @@ case class ApplicationSummaryInfo(
    ioMetrics: Seq[IOAnalysisProfileResult],
    sysProps: Seq[RapidsPropertyProfileResult],
    sqlCleanedAlignedIds: Seq[SQLCleanAndAlignIdsProfileResult],
-    sparkRapidsBuildInfo: Seq[SparkRapidsBuildInfoEvent])
+    sparkRapidsBuildInfo: Seq[SparkRapidsBuildInfoEvent],
+    stageDiagnostics: Seq[StageDiagnosticResult])


I mean that we can generate those records for the sake of generating the report, and we do not have to store them in the ApplicationSummaryInfo.
ApplicationSummaryInfo fields are the ones that provide some information about the application and then it can be consumed by other modules inside scala (i.e., AutoTuner).

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein

Thanks @cindyyuanjiang
I tried to run this branch as qualification CMd and I do not see diagnostics CSV file generated in the raw_metrics.
I remember that you mentioned that the diagnsotics output will be generated from Qualification as well. Is there a change in the requirement?

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-11-15T21:09:19Z

Thanks @cindyyuanjiang I tried to run this branch as qualification CMd and I do not see diagnostics CSV file generated in the raw_metrics. I remember that you mentioned that the diagnsotics output will be generated from Qualification as well. Is there a change in the requirement?

Thanks @amahussein! Yes, I updated the PR, we do want diagnostic output in Qualification as well. PTAL.

Signed-off-by: cindyyuanjiang <[email protected]>

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein

Is there a benchmarkSuite for the Profiler similar to what we have for SingleThreadedQualToolBenchmark?

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

cindyyuanjiang · 2024-11-15T22:30:31Z

Is there a benchmarkSuite for the Profiler similar to what we have for SingleThreadedQualToolBenchmark?

Thanks @amahussein. I have a local copy of Profile Benchmark class, do we want to include that in this PR as well?

I am planning to run some Profiler benchmarks later based on our offline discussion.

amahussein · 2024-11-20T17:01:02Z

Is there a benchmarkSuite for the Profiler similar to what we have for SingleThreadedQualToolBenchmark?

Thanks @amahussein. I have a local copy of Profile Benchmark class, do we want to include that in this PR as well?

I am planning to run some Profiler benchmarks later based on our offline discussion.

Thanks @cindyyuanjiang !
Lets add the new Profiler benchmark class in this PR as well so that everyone has the same view of what you have profiled for your PR.

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-11-20T22:35:19Z

Lets add the new Profiler benchmark class in this PR as well so that everyone has the same view of what you have profiled for your PR.

Thank you @amahussein! I added the Profile benchmark class in this PR.

amahussein · 2024-11-21T14:48:32Z

.../main/scala/org/apache/spark/rapids/tool/benchmarks/SingleThreadedProfileToolBenchmark.scala

+    // Currently the input arguments are assumed to be common across cases
+    // This will be improved in a follow up PR to enable passing as a config
+    // file with argument support for different cases
+    runBenchmark("Benchmark_Per_SQL_Arg_Profiling") {


There is no "PER_SQL" argument for Profiling. that prefix was used in the qualification's benchmark because we were running the benchmark with the perSql argument enabled/disabled.
Suggestion is:

to enable CSV and call it something like Benchmark_Profiling_CSV

if this is supposed to run a single thread, then the number of threads should be specified. Otherwise, the benchmark will have non-deterministic behavior for multiple eventlogs.

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein

LGTME.
Thanks @cindyyuanjiang

cindyyuanjiang added 2 commits October 2, 2024 16:44

initial implementation

1265e47

Signed-off-by: cindyyuanjiang <[email protected]>

updated output schema based on offline discussion

6aefee9

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang changed the title ~~WIP: [FEA] Add diagnostic output for GPU slowness in Profiler tool~~ WIP: [FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool Oct 4, 2024

cindyyuanjiang added 6 commits October 8, 2024 17:27

address feedback to merge two tables together

5a56d30

Signed-off-by: cindyyuanjiang <[email protected]>

update order of columns

be13fa8

Signed-off-by: cindyyuanjiang <[email protected]>

get gpu semaphore time

7b895c5

Signed-off-by: cindyyuanjiang <[email protected]>

add benchmark

68c34f6

Signed-off-by: cindyyuanjiang <[email protected]>

clean up code

20dd4c8

Signed-off-by: cindyyuanjiang <[email protected]>

resolve merge conflict

41f6ff0

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang self-assigned this Oct 23, 2024

cindyyuanjiang added feature request New feature or request core_tools Scope the core module (scala) labels Oct 23, 2024

cindyyuanjiang added 3 commits October 24, 2024 18:09

add unit test

7cd25e9

Signed-off-by: cindyyuanjiang <[email protected]>

new expectation file

f740192

Signed-off-by: cindyyuanjiang <[email protected]>

resolve merge conflict

6931086

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang marked this pull request as ready for review October 25, 2024 20:41

cindyyuanjiang changed the title ~~WIP: [FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool~~ [FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool Oct 25, 2024

cindyyuanjiang requested review from amahussein, parthosa, nartal1, tgravescs and kuhushukla October 25, 2024 20:55

kuhushukla reviewed Oct 29, 2024

View reviewed changes

nartal1 reviewed Oct 29, 2024

View reviewed changes

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala Outdated Show resolved Hide resolved

parthosa reviewed Oct 30, 2024

View reviewed changes

address review feedback

2dcdb9b

Signed-off-by: cindyyuanjiang <[email protected]>

address review feedback

ac162f0

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang requested a review from tgravescs November 6, 2024 18:17

amahussein reviewed Nov 8, 2024

View reviewed changes

cindyyuanjiang added 5 commits November 8, 2024 16:48

refactor stageDiagnosticResults

bdd2292

Signed-off-by: cindyyuanjiang <[email protected]>

change num attemps to tasks

3a8cf9e

Signed-off-by: cindyyuanjiang <[email protected]>

remove diagnostic from applicationsummaryinfo

1361d53

Signed-off-by: cindyyuanjiang <[email protected]>

remove unused import

c54a4b7

Signed-off-by: cindyyuanjiang <[email protected]>

new file

882d403

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang requested a review from amahussein November 13, 2024 00:15

Merge branch 'dev' into profiler-diagnostic

f985b44

amahussein reviewed Nov 15, 2024

View reviewed changes

add diagnostic view in qual tool output

f3b78ff

Signed-off-by: cindyyuanjiang <[email protected]>

remove diagnostic vire from qual tool profile.log file

8b317e6

Signed-off-by: cindyyuanjiang <[email protected]>

parthosa reviewed Nov 15, 2024

View reviewed changes

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala Outdated Show resolved Hide resolved

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala Show resolved Hide resolved

cindyyuanjiang requested a review from amahussein November 15, 2024 22:02

address review feedback

ebfc6e3

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang requested a review from parthosa November 15, 2024 22:11

amahussein reviewed Nov 15, 2024

View reviewed changes

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala Outdated Show resolved Hide resolved

cindyyuanjiang requested a review from amahussein November 18, 2024 21:36

Merge branch 'dev' into profiler-diagnostic

b88119a

add profile benchmark class

056d4b2

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein requested changes Nov 21, 2024

View reviewed changes

fix profiler benchmark

de47ef4

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang requested a review from amahussein November 22, 2024 00:10

amahussein approved these changes Nov 22, 2024

View reviewed changes

cindyyuanjiang merged commit de40e8d into NVIDIA:dev Nov 22, 2024
14 checks passed

cindyyuanjiang deleted the profiler-diagnostic branch November 22, 2024 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375

[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375

cindyyuanjiang commented Oct 4, 2024 •

edited

Loading

kuhushukla commented Oct 7, 2024

kuhushukla Oct 29, 2024

cindyyuanjiang Oct 30, 2024 •

edited

Loading

kuhushukla Nov 4, 2024

cindyyuanjiang Nov 6, 2024 •

edited

Loading

kuhushukla Oct 29, 2024

cindyyuanjiang Oct 30, 2024

kuhushukla Oct 29, 2024

tgravescs Oct 29, 2024

cindyyuanjiang Oct 30, 2024

tgravescs commented Oct 30, 2024

parthosa left a comment •

edited

Loading

cindyyuanjiang commented Oct 30, 2024

amahussein Nov 8, 2024

amahussein left a comment

cindyyuanjiang commented Nov 15, 2024 •

edited

Loading

amahussein left a comment

cindyyuanjiang commented Nov 15, 2024 •

edited

Loading

amahussein commented Nov 20, 2024

cindyyuanjiang commented Nov 20, 2024

amahussein Nov 21, 2024

amahussein left a comment

[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375

[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375

Conversation

cindyyuanjiang commented Oct 4, 2024 • edited Loading

Changes

Testing CMD

Output Example

kuhushukla commented Oct 7, 2024

Choose a reason for hiding this comment

cindyyuanjiang Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Oct 30, 2024

parthosa left a comment • edited Loading

Choose a reason for hiding this comment

cindyyuanjiang commented Oct 30, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Nov 15, 2024 • edited Loading

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Nov 15, 2024 • edited Loading

amahussein commented Nov 20, 2024

cindyyuanjiang commented Nov 20, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Oct 4, 2024 •

edited

Loading

cindyyuanjiang Oct 30, 2024 •

edited

Loading

cindyyuanjiang Nov 6, 2024 •

edited

Loading

parthosa left a comment •

edited

Loading

cindyyuanjiang commented Nov 15, 2024 •

edited

Loading

cindyyuanjiang commented Nov 15, 2024 •

edited

Loading