Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool #1375

Merged
merged 30 commits into from
Nov 22, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Oct 4, 2024

Contributes to #1374

Changes

  • Add a stage-level diagnostic view in Profiler output: stage_level_diagnostic_metrics.csv
    • Add the diagnostic metrics results under AggRawMetricsResult
    • Update AppSQLPlanAnalyzer to store mappings between stage IDs to node names and GPU semaphore wait time
  • Add a unit test in core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala to verify diagnostic csv file output
  • Clean up redundant function definitions in core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

Testing CMD

spark_rapids profiling -e <my_event_log> -t <my_tools_jar>

Output Example

+--------+-----------+-------------------+-------+---------------+--------+-----------------------+--------------------------+-----------------------+-------------------------+---------------------+------------------------+---------------------+-----------------------+-----------------+--------------------+-----------------+-------------------+---------------------+------------------------+---------------------+-----------------------+-------------------+----------------------+-------------------+---------------------+--------------------+-----------------------+--------------------+----------------------+---------------------------+------------------------------+---------------------------+-----------------------------+------------------------+---------------------------+------------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|appIndex|appName    |appId              |stageId|stageDurationMs|numTasks|memoryBytesSpilledMBMin|memoryBytesSpilledMBMedian|memoryBytesSpilledMBMax|memoryBytesSpilledMBTotal|diskBytesSpilledMBMin|diskBytesSpilledMBMedian|diskBytesSpilledMBMax|diskBytesSpilledMBTotal|inputBytesReadMin|inputBytesReadMedian|inputBytesReadMax|inputBytesReadTotal|outputBytesWrittenMin|outputBytesWrittenMedian|outputBytesWrittenMax|outputBytesWrittenTotal|shuffleReadBytesMin|shuffleReadBytesMedian|shuffleReadBytesMax|shuffleReadBytesTotal|shuffleWriteBytesMin|shuffleWriteBytesMedian|shuffleWriteBytesMax|shuffleWriteBytesTotal|shuffleReadFetchWaitTimeMin|shuffleReadFetchWaitTimeMedian|shuffleReadFetchWaitTimeMax|shuffleReadFetchWaitTimeTotal|shuffleWriteWriteTimeMin|shuffleWriteWriteTimeMedian|shuffleWriteWriteTimeMax|shuffleWriteWriteTimeTotal|gpuSemaphoreWaitTimeTotal|SQL Nodes(IDs)                                                                                                                                                                                    |
+--------+-----------+-------------------+-------+---------------+--------+-----------------------+--------------------------+-----------------------+-------------------------+---------------------+------------------------+---------------------+-----------------------+-----------------+--------------------+-----------------+-------------------+---------------------+------------------------+---------------------+-----------------------+-------------------+----------------------+-------------------+---------------------+--------------------+-----------------------+--------------------+----------------------+---------------------------+------------------------------+---------------------------+-----------------------------+------------------------+---------------------------+------------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1       |Spark shell|local-1622814619968|0      |1743           |6       |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |0                  |0                     |0                  |0                    |6688608             |6688702                |6688825             |40132250              |0                          |0                             |0                          |0                            |41                      |60                         |100                     |400                       |0                        |GpuColumnarExchange(16),GpuProject(17),GpuRowToColumnar(18),WholeStageCodegen (2)(19),Scan(21)                                                                                                    |
|1       |Spark shell|local-1622814619968|1      |1631           |6       |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |0                  |0                     |0                  |0                    |6688602             |6688708                |6688833             |40132258              |0                          |0                             |0                          |0                            |37                      |92                         |108                     |508                       |0                        |GpuColumnarExchange(8),GpuProject(9),GpuRowToColumnar(10),WholeStageCodegen (1)(11),Scan(13)                                                                                                      |
|1       |Spark shell|local-1622814619968|2      |688            |200     |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |397220             |401479                |405854             |80264508             |77                  |77                     |77                  |15400                 |0                          |0                             |0                          |0                            |0                       |0                          |9                       |93                        |0                        |GpuColumnarExchange(3),GpuHashAggregate(4),GpuProject(5),GpuShuffledHashJoin(6),GpuShuffleCoalesce(7),GpuColumnarExchange(8),GpuCoalesceBatches(14),GpuShuffleCoalesce(15),GpuColumnarExchange(16)|
|1       |Spark shell|local-1622814619968|3      |83             |1       |0                      |0                         |0                      |0                        |0                    |0                       |0                    |0                      |0                |0                   |0                |0                  |0                    |0                       |0                    |0                      |15400              |15400                 |15400              |15400                |0                   |0                      |0                   |0                     |0                          |0                             |0                          |0                            |0                       |0                          |0                       |0                         |0                        |GpuColumnarToRow(0),GpuHashAggregate(1),GpuShuffleCoalesce(2),GpuColumnarExchange(3)                                                                                                              |
+--------+-----------+-------------------+-------+---------------+--------+-----------------------+--------------------------+-----------------------+-------------------------+---------------------+------------------------+---------------------+-----------------------+-----------------+--------------------+-----------------+-------------------+---------------------+------------------------+---------------------+-----------------------+-------------------+----------------------+-------------------+---------------------+--------------------+-----------------------+--------------------+----------------------+---------------------------+------------------------------+---------------------------+-----------------------------+------------------------+---------------------------+------------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

@cindyyuanjiang cindyyuanjiang changed the title WIP: [FEA] Add diagnostic output for GPU slowness in Profiler tool WIP: [FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool Oct 4, 2024
@kuhushukla
Copy link
Collaborator

On a second thought, we should combine the two tables shown as an example here. My original intent was to keep the first view simple but the latter table is not too bad for that

Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
@cindyyuanjiang cindyyuanjiang self-assigned this Oct 23, 2024
@cindyyuanjiang cindyyuanjiang added feature request New feature or request core_tools Scope the core module (scala) labels Oct 23, 2024
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review October 25, 2024 20:41
@cindyyuanjiang cindyyuanjiang changed the title WIP: [FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool [FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool Oct 25, 2024
swWriteTimeMax,
swWriteTimeSum,
gpuSemaphoreWaitSum,
nodeNames)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make an encapsulating object for this so that we dont have large arg list as well as a single place to hold the metrics we care about -- easier to update it.

Copy link
Collaborator Author

@cindyyuanjiang cindyyuanjiang Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kuhushukla! Can you elaborate a bit more on this? I thought StageDiagnosticResult is the encapsulating object. It has similar structure as other profiler results, for example - https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala#L417.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arg list is very large and that on its own would be nice to abstract away in a case class etc.

Copy link
Collaborator Author

@cindyyuanjiang cindyyuanjiang Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @kuhushukla! I experimented with a few things like encapsulating part of the arg list into a separate case class, but overall I think this presentation has the best readability. It also aligns with other classes in this file and current unit tests. We can discuss more offline if there is something else we should try.

/**
* Given an input iterable, returns its min, median, max and sum.
*/
def getStatistics(arr: Iterable[Long]): (Long, Long, Long, Long) = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I thought this information is available and can be simply pulled? Please correct me if I am wrong -- for eg, in the existing profiler o/p where does the median value come from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the implementation to reuse/pull existing metrics results from ProfStageMetricView. I cannot do this for shuffle read total bytes because in ProfStageMetricView there are 2 metrics associated with this: internal.metrics.shuffle.read.localBytesRead and internal.metrics.shuffle.read.remoteBytesRead. I cannot get the min/med/max of shuffle read total bytes by adding the min/med/max of the 2 metrics. I am keeping this function for now, but if it looks too unnecessary I can remove it.

@@ -608,6 +572,183 @@ case class StageAggTaskMetricsProfileResult(
override def idHeader = "stageId"
}

case class StageDiagnosticMetricsProfileResult(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename to a smaller string : DiagnosticResult , StageDiagnosticResult etc. are some options

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short and concise are good as long as it doesn't lose context. For instance I think DiagnosticResult may be to generic as it doesn't tell you what its applied or diagnostic of. so I would lean towards StageDiagnosticResult

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, changed to StageDiagnosticResult

@tgravescs
Copy link
Collaborator

is the example output real here? The first stage that took the longest has no input data but has a Scan with it

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. Minor nits and questions.

Signed-off-by: cindyyuanjiang <[email protected]>
@cindyyuanjiang
Copy link
Collaborator Author

is the example output real here? The first stage that took the longest has no input data but has a Scan with it

thanks @tgravescs! This is the example output from existing testing event log: https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/test/resources/spark-events-profiling/rapids_join_eventlog.zstd.

Signed-off-by: cindyyuanjiang <[email protected]>
@@ -47,7 +47,8 @@ case class ApplicationSummaryInfo(
ioMetrics: Seq[IOAnalysisProfileResult],
sysProps: Seq[RapidsPropertyProfileResult],
sqlCleanedAlignedIds: Seq[SQLCleanAndAlignIdsProfileResult],
sparkRapidsBuildInfo: Seq[SparkRapidsBuildInfoEvent])
sparkRapidsBuildInfo: Seq[SparkRapidsBuildInfoEvent],
stageDiagnostics: Seq[StageDiagnosticResult])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that we can generate those records for the sake of generating the report, and we do not have to store them in the ApplicationSummaryInfo.
ApplicationSummaryInfo fields are the ones that provide some information about the application and then it can be consumed by other modules inside scala (i.e., AutoTuner).

Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang
I tried to run this branch as qualification CMd and I do not see diagnostics CSV file generated in the raw_metrics.
I remember that you mentioned that the diagnsotics output will be generated from Qualification as well. Is there a change in the requirement?

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Nov 15, 2024

Thanks @cindyyuanjiang I tried to run this branch as qualification CMd and I do not see diagnostics CSV file generated in the raw_metrics. I remember that you mentioned that the diagnsotics output will be generated from Qualification as well. Is there a change in the requirement?

Thanks @amahussein! Yes, I updated the PR, we do want diagnostic output in Qualification as well. PTAL.

Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benchmarkSuite for the Profiler similar to what we have for SingleThreadedQualToolBenchmark?

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Nov 15, 2024

Is there a benchmarkSuite for the Profiler similar to what we have for SingleThreadedQualToolBenchmark?

Thanks @amahussein. I have a local copy of Profile Benchmark class, do we want to include that in this PR as well?

I am planning to run some Profiler benchmarks later based on our offline discussion.

@amahussein
Copy link
Collaborator

Is there a benchmarkSuite for the Profiler similar to what we have for SingleThreadedQualToolBenchmark?

Thanks @amahussein. I have a local copy of Profile Benchmark class, do we want to include that in this PR as well?

I am planning to run some Profiler benchmarks later based on our offline discussion.

Thanks @cindyyuanjiang !
Lets add the new Profiler benchmark class in this PR as well so that everyone has the same view of what you have profiled for your PR.

Signed-off-by: cindyyuanjiang <[email protected]>
@cindyyuanjiang
Copy link
Collaborator Author

Lets add the new Profiler benchmark class in this PR as well so that everyone has the same view of what you have profiled for your PR.

Thank you @amahussein! I added the Profile benchmark class in this PR.

// Currently the input arguments are assumed to be common across cases
// This will be improved in a follow up PR to enable passing as a config
// file with argument support for different cases
runBenchmark("Benchmark_Per_SQL_Arg_Profiling") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no "PER_SQL" argument for Profiling. that prefix was used in the qualification's benchmark because we were running the benchmark with the perSql argument enabled/disabled.
Suggestion is:

  • to enable CSV and call it something like Benchmark_Profiling_CSV
  • if this is supposed to run a single thread, then the number of threads should be specified. Otherwise, the benchmark will have non-deterministic behavior for multiple eventlogs.

Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTME.
Thanks @cindyyuanjiang

@cindyyuanjiang cindyyuanjiang merged commit de40e8d into NVIDIA:dev Nov 22, 2024
14 checks passed
@cindyyuanjiang cindyyuanjiang deleted the profiler-diagnostic branch November 22, 2024 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants