-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report all operators in the output file #1431
Conversation
This PR is to print all the operators per app and per sqlID in a new file. This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames. ExecInfo analysis is done at the end to save memory overhead of storing the information. Fixed a bug where the parseAggregateExpressions was Signed-off-by: Niranjan Artal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nartal1
I put some questions and comments.
def getOrCreate(name: String): ExecRef = { | ||
namesTable.computeIfAbsent(name, ExecRef.apply) | ||
} | ||
val Empty: ExecRef = getOrCreate("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the name empty might confuse someone else because it is very similar to the reserved words in scala. EMPTY
, could signify that this a defined constant.
namesTable.computeIfAbsent(name, ExprRef.apply) | ||
} | ||
|
||
val Empty: ExprRef = getOrCreate("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above comment
|
||
import org.apache.spark.sql.rapids.tool.util.StringUtils | ||
|
||
case class ExecRef(value: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment to explain what is the case class going to be used for?
val Empty: ExecRef = getOrCreate("") | ||
} | ||
|
||
case class ExprRef(value: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment to explain what is the case class going to be used for?
@@ -107,7 +107,9 @@ case class ExecInfo( | |||
unsupportedExprs: Seq[UnsupportedExpr], | |||
dataSet: Boolean, | |||
udf: Boolean, | |||
shouldIgnore: Boolean) { | |||
shouldIgnore: Boolean, | |||
execsRef: ExecRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: execsRef is plural. perhaps execRef
?
def writeAllOpsSummaryCSVreport( | ||
sums: Seq[QualificationSummaryInfo]): Unit = { | ||
val csvFileWriter = new ToolTextFileWriter(outputDir, | ||
s"${QualOutputWriter.LOGFILE_NAME}_allOperators.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we name it _operatorsStats
since it most about counting?
@@ -1020,6 +1057,50 @@ object QualOutputWriter { | |||
}.flatten.toSet | |||
} | |||
|
|||
private def constructAllOperatorsInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something bugs me about the operator files.
Is it possible to make this file at least sorted.
- if we sort by (count, sql, operatorName, supported), then we will be able to see the to operators at the top but at the same time it will be noisy when we start to count expressions per exec. Because for sure, expressions will be more count than execs.
- if we sort by (sql, count..) then the report will be more readable as SQLs appear in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @amahussein . I have yet to address this as this might cause memory/runtime overhead. Thinking of a better way to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the sort is done within a single app, then the final report should not incur any overhead because it has been sorted locally within a single app.
* | ||
* @param execInfo The execution information node to process | ||
*/ | ||
private def traverse(execInfo: ExecInfo): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IS this to support future when we count expressions per exec? because for now, we only have a single exprRef that cannot be more than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Was planning to do it as a follow-on. Will file an issue to refactor the code further.
execInfo.children.foreach(_.foreach(traverse)) | ||
} | ||
|
||
case class ExpressionResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment what is this case class gonna be used for. i.e., scope output,..etc?
stages: Set[Int] | ||
) | ||
|
||
case class ExecResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment what is this case class gonna be used for. i.e., scope output,..etc?
Thanks @amahussein for the review. I have addressed review comments except for sorting the output file. |
Signed-off-by: Niranjan Artal <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
make report sorted by sqlID-count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a merge conflict
Closing this as it is replaced by #1444 |
This fixes #1325 . This is a follow-on PR to capture the expressions and save it to output file.
This PR is to print all the operators per app and per sqlID in a new file. This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames. ExecInfo analysis is done at the end to save memory overhead of storing the information.
Sample Output:
Followon work:
This pull request introduces significant updates to the
ExecInfoAnalyzer
class and various execution parsers in the RAPIDS tool for Spark. The changes mainly focus on adding references for execution and expression names, improving the organization and analysis of execution information.Key Changes:
New Features:
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/ExecInfoAnalyzer.scala
: IntroducedExecInfoAnalyzer
class for analyzing execution information, including new internal data structures and methods for traversing and aggregating execution data.Execution Parser Updates:
execNameRef
to various execution parsers to reference execution names:BatchScanExecParser
[1] [2]BroadcastExchangeExecParser
[1] [2]BroadcastHashJoinExecParser
[1] [2]BroadcastNestedLoopJoinExecParserBase
[1] [2] [3]DataWritingCommandExecParser
[1] [2]DLWriteWithFormatAndSchemaParser
[1] [2]FileSourceScanExecParser
[1] [2] [3]GenericExecParser
[1] [2] [3] [4] [5]New Utility Classes:
core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/OperatorNameRef.scala
: AddedExecRef
andExprRef
classes to manage execution and expression name references, including methods to create or retrieve references from a concurrent hash map.These changes enhance the capability to track and analyze execution and expression references, improving the overall functionality and maintainability of the RAPIDS tool for Spark.This PR is to print all the operators per app and per sqlID in a new file.
This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames.
ExecInfo analysis is done at the end to save memory overhead of storing the information.
Fixed a bug where the parseAggregateExpressions was