Report all operators in the output file #1431

nartal1 · 2024-11-20T19:21:19Z

This fixes #1325 . This is a follow-on PR to capture the expressions and save it to output file.

This PR is to print all the operators per app and per sqlID in a new file. This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames. ExecInfo analysis is done at the end to save memory overhead of storing the information.

Sample Output:

App ID,SQL ID,Operator Type,Operator Name,Count,Supported,Stages
"test-app-1",146,ReadRDD,Scan ExistingRDD,1,false,219
"test-app-1",146,Exec,GenerateExec,2,true,219
"test-app-1",146,Expr,explode,2,true,219
"test-app-1",146,Exec,FilterExec,1,true,219
"test-app-1",146,Expr,lower,1,true,219
"test-app-1",104,ReadRDD,Scan ExistingRDD,1,false,133
"test-app-1",104,Exec,ObjectHashAggregateExec,2,false,136:133
"test-app-1",104,Expr,collect_set,2,false,136:133
"test-app-1",104,Expr,count,2,false,136:133
"test-app-1",104,Expr,sum,2,false,136:133
test-app-1",104,Expr,last,2,false,136:133
"test-app-1",104,Expr,coalesce,2,false,136:133
"test-app-1",104,Exec,ProjectExec,1,true,133
"test-app-1",104,Exec,ShuffleExchangeExec,1,true,133:136

Followon work:

Refactor ExecInfo to minimize number of arguments in the case class. Try to store the expr and execInfo in their respective case classes.

This pull request introduces significant updates to the ExecInfoAnalyzer class and various execution parsers in the RAPIDS tool for Spark. The changes mainly focus on adding references for execution and expression names, improving the organization and analysis of execution information.

Key Changes:

New Features:

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/ExecInfoAnalyzer.scala: Introduced ExecInfoAnalyzer class for analyzing execution information, including new internal data structures and methods for traversing and aggregating execution data.

Execution Parser Updates:

Added execNameRef to various execution parsers to reference execution names:
- BatchScanExecParser [1] [2]
- BroadcastExchangeExecParser [1] [2]
- BroadcastHashJoinExecParser [1] [2]
- BroadcastNestedLoopJoinExecParserBase [1] [2] [3]
- DataWritingCommandExecParser [1] [2]
- DLWriteWithFormatAndSchemaParser [1] [2]
- FileSourceScanExecParser [1] [2] [3]
- GenericExecParser [1] [2] [3] [4] [5]

New Utility Classes:

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/OperatorNameRef.scala: Added ExecRef and ExprRef classes to manage execution and expression name references, including methods to create or retrieve references from a concurrent hash map.

These changes enhance the capability to track and analyze execution and expression references, improving the overall functionality and maintainability of the RAPIDS tool for Spark.This PR is to print all the operators per app and per sqlID in a new file.
This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames.
ExecInfo analysis is done at the end to save memory overhead of storing the information.
Fixed a bug where the parseAggregateExpressions was

This PR is to print all the operators per app and per sqlID in a new file. This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames. ExecInfo analysis is done at the end to save memory overhead of storing the information. Fixed a bug where the parseAggregateExpressions was Signed-off-by: Niranjan Artal <[email protected]>

amahussein

Thanks @nartal1
I put some questions and comments.

amahussein · 2024-11-22T16:24:34Z

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/OperatorNameRef.scala

+  def getOrCreate(name: String): ExecRef = {
+    namesTable.computeIfAbsent(name, ExecRef.apply)
+  }
+  val Empty: ExecRef = getOrCreate("")


nit: the name empty might confuse someone else because it is very similar to the reserved words in scala. EMPTY, could signify that this a defined constant.

amahussein · 2024-11-22T16:25:08Z

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/OperatorNameRef.scala

+    namesTable.computeIfAbsent(name, ExprRef.apply)
+  }
+
+  val Empty: ExprRef = getOrCreate("")


same as above comment

amahussein · 2024-11-22T16:25:31Z

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/OperatorNameRef.scala

+
+import org.apache.spark.sql.rapids.tool.util.StringUtils
+
+case class ExecRef(value: String) {


Can we add a comment to explain what is the case class going to be used for?

amahussein · 2024-11-22T16:26:06Z

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/OperatorNameRef.scala

+  val Empty: ExecRef = getOrCreate("")
+}
+
+case class ExprRef(value: String) {


Can we add a comment to explain what is the case class going to be used for?

amahussein · 2024-11-22T16:27:15Z

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/SQLPlanParser.scala

@@ -107,7 +107,9 @@ case class ExecInfo(
    unsupportedExprs: Seq[UnsupportedExpr],
    dataSet: Boolean,
    udf: Boolean,
-    shouldIgnore: Boolean) {
+    shouldIgnore: Boolean,
+    execsRef: ExecRef,


nit: execsRef is plural. perhaps execRef?

amahussein · 2024-11-22T17:04:03Z

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualOutputWriter.scala

+  def writeAllOpsSummaryCSVreport(
+      sums: Seq[QualificationSummaryInfo]): Unit = {
+    val csvFileWriter = new ToolTextFileWriter(outputDir,
+      s"${QualOutputWriter.LOGFILE_NAME}_allOperators.csv",


Shall we name it _operatorsStats since it most about counting?

amahussein · 2024-11-22T17:09:24Z

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualOutputWriter.scala

@@ -1020,6 +1057,50 @@ object QualOutputWriter {
    }.flatten.toSet
  }

+  private def constructAllOperatorsInfo(


Something bugs me about the operator files.
Is it possible to make this file at least sorted.

if we sort by (count, sql, operatorName, supported), then we will be able to see the to operators at the top but at the same time it will be noisy when we start to count expressions per exec. Because for sure, expressions will be more count than execs.

if we sort by (sql, count..) then the report will be more readable as SQLs appear in order.

Thanks for the suggestion @amahussein . I have yet to address this as this might cause memory/runtime overhead. Thinking of a better way to update this.

If the sort is done within a single app, then the final report should not incur any overhead because it has been sorted locally within a single app.

amahussein · 2024-11-22T17:12:22Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/ExecInfoAnalyzer.scala

+   *
+   * @param execInfo The execution information node to process
+   */
+  private def traverse(execInfo: ExecInfo): Unit = {


IS this to support future when we count expressions per exec? because for now, we only have a single exprRef that cannot be more than 1.

That's correct. Was planning to do it as a follow-on. Will file an issue to refactor the code further.

amahussein · 2024-11-22T17:12:52Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/ExecInfoAnalyzer.scala

+    execInfo.children.foreach(_.foreach(traverse))
+  }
+
+  case class ExpressionResult(


comment what is this case class gonna be used for. i.e., scope output,..etc?

amahussein · 2024-11-22T17:12:57Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/ExecInfoAnalyzer.scala

+      stages: Set[Int]
+  )
+
+  case class ExecResult(


comment what is this case class gonna be used for. i.e., scope output,..etc?

nartal1 · 2024-11-25T04:19:55Z

Thanks @amahussein for the review. I have addressed review comments except for sorting the output file.

Signed-off-by: Niranjan Artal <[email protected]>

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

make report sorted by sqlID-count

…-1325-part2

amahussein

There is a merge conflict

nartal1 · 2024-12-03T22:05:16Z

Closing this as it is replaced by #1444

nartal1 self-assigned this Nov 20, 2024

nartal1 marked this pull request as draft November 20, 2024 19:21

nartal1 requested a review from amahussein November 20, 2024 19:21

nartal1 added feature request New feature or request affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) labels Nov 20, 2024

nartal1 added 3 commits November 21, 2024 06:59

update class name

8f1187b

revert bug fix

8b9066f

add comments

be0c59d

nartal1 marked this pull request as ready for review November 22, 2024 06:23

amahussein requested changes Nov 22, 2024

View reviewed changes

nartal1 mentioned this pull request Nov 23, 2024

Include expression parsers for HashAggregate and ObjectHashAggregate #1432

Merged

addressed review comments

cf74266

nartal1 and others added 4 commits November 25, 2024 15:06

addressed review comments

4dd4c0d

Signed-off-by: Niranjan Artal <[email protected]>

make report sorted by sqlID-count

72bf892

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Merge pull request #1 from amahussein/issue-1325-part2-report-improve

627168b

make report sorted by sqlID-count

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into issue…

1331534

…-1325-part2

amahussein requested changes Nov 26, 2024

View reviewed changes

capture execs and expressions separately

b980be0

nartal1 mentioned this pull request Dec 3, 2024

Report all operators in the output file #1444

Merged

nartal1 closed this Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report all operators in the output file #1431

Report all operators in the output file #1431

nartal1 commented Nov 20, 2024 •

edited

Loading

amahussein left a comment

amahussein Nov 22, 2024

amahussein Nov 22, 2024

amahussein Nov 22, 2024

amahussein Nov 22, 2024

amahussein Nov 22, 2024

amahussein Nov 22, 2024

amahussein Nov 22, 2024

nartal1 Nov 25, 2024

amahussein Nov 25, 2024

amahussein Nov 22, 2024

nartal1 Nov 25, 2024

amahussein Nov 22, 2024

amahussein Nov 22, 2024

nartal1 commented Nov 25, 2024

amahussein left a comment

nartal1 commented Dec 3, 2024


		import org.apache.spark.sql.rapids.tool.util.StringUtils

		case class ExecRef(value: String) {

Report all operators in the output file #1431

Report all operators in the output file #1431

Conversation

nartal1 commented Nov 20, 2024 • edited Loading

Key Changes:

New Features:

Execution Parser Updates:

New Utility Classes:

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nartal1 commented Nov 25, 2024

amahussein left a comment

Choose a reason for hiding this comment

nartal1 commented Dec 3, 2024

nartal1 commented Nov 20, 2024 •

edited

Loading