Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report all operators in the output file #1431

Closed
wants to merge 10 commits into from

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Nov 20, 2024

This fixes #1325 . This is a follow-on PR to capture the expressions and save it to output file.

This PR is to print all the operators per app and per sqlID in a new file. This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames. ExecInfo analysis is done at the end to save memory overhead of storing the information.

Sample Output:

App ID,SQL ID,Operator Type,Operator Name,Count,Supported,Stages
"test-app-1",146,ReadRDD,Scan ExistingRDD,1,false,219
"test-app-1",146,Exec,GenerateExec,2,true,219
"test-app-1",146,Expr,explode,2,true,219
"test-app-1",146,Exec,FilterExec,1,true,219
"test-app-1",146,Expr,lower,1,true,219
"test-app-1",104,ReadRDD,Scan ExistingRDD,1,false,133
"test-app-1",104,Exec,ObjectHashAggregateExec,2,false,136:133
"test-app-1",104,Expr,collect_set,2,false,136:133
"test-app-1",104,Expr,count,2,false,136:133
"test-app-1",104,Expr,sum,2,false,136:133
test-app-1",104,Expr,last,2,false,136:133
"test-app-1",104,Expr,coalesce,2,false,136:133
"test-app-1",104,Exec,ProjectExec,1,true,133
"test-app-1",104,Exec,ShuffleExchangeExec,1,true,133:136

Followon work:

  1. Refactor ExecInfo to minimize number of arguments in the case class. Try to store the expr and execInfo in their respective case classes.

This pull request introduces significant updates to the ExecInfoAnalyzer class and various execution parsers in the RAPIDS tool for Spark. The changes mainly focus on adding references for execution and expression names, improving the organization and analysis of execution information.

Key Changes:

New Features:

Execution Parser Updates:

  • Added execNameRef to various execution parsers to reference execution names:

New Utility Classes:

These changes enhance the capability to track and analyze execution and expression references, improving the overall functionality and maintainability of the RAPIDS tool for Spark.This PR is to print all the operators per app and per sqlID in a new file.
This helps to get the count of operators in an application. It has count of both supported and unsupported operators. Added arguments in ExecInfo to store references of execName and expressionNames.
ExecInfo analysis is done at the end to save memory overhead of storing the information.
Fixed a bug where the parseAggregateExpressions was

This PR is to print all the operators per app and per sqlID in a new
file.
This helps to get the count of operators in an application.
It has count of both supported and unsupported operators.
Added arguments in ExecInfo to store references of execName and
expressionNames.
ExecInfo analysis is done at the end to save memory overhead of storing
the information.
Fixed a bug where the parseAggregateExpressions was

Signed-off-by: Niranjan Artal <[email protected]>
@nartal1 nartal1 self-assigned this Nov 20, 2024
@nartal1 nartal1 marked this pull request as draft November 20, 2024 19:21
@nartal1 nartal1 added feature request New feature or request affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) labels Nov 20, 2024
@nartal1 nartal1 marked this pull request as ready for review November 22, 2024 06:23
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1
I put some questions and comments.

def getOrCreate(name: String): ExecRef = {
namesTable.computeIfAbsent(name, ExecRef.apply)
}
val Empty: ExecRef = getOrCreate("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the name empty might confuse someone else because it is very similar to the reserved words in scala. EMPTY, could signify that this a defined constant.

namesTable.computeIfAbsent(name, ExprRef.apply)
}

val Empty: ExprRef = getOrCreate("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment


import org.apache.spark.sql.rapids.tool.util.StringUtils

case class ExecRef(value: String) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment to explain what is the case class going to be used for?

val Empty: ExecRef = getOrCreate("")
}

case class ExprRef(value: String) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment to explain what is the case class going to be used for?

@@ -107,7 +107,9 @@ case class ExecInfo(
unsupportedExprs: Seq[UnsupportedExpr],
dataSet: Boolean,
udf: Boolean,
shouldIgnore: Boolean) {
shouldIgnore: Boolean,
execsRef: ExecRef,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: execsRef is plural. perhaps execRef?

def writeAllOpsSummaryCSVreport(
sums: Seq[QualificationSummaryInfo]): Unit = {
val csvFileWriter = new ToolTextFileWriter(outputDir,
s"${QualOutputWriter.LOGFILE_NAME}_allOperators.csv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we name it _operatorsStats since it most about counting?

@@ -1020,6 +1057,50 @@ object QualOutputWriter {
}.flatten.toSet
}

private def constructAllOperatorsInfo(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something bugs me about the operator files.
Is it possible to make this file at least sorted.

  • if we sort by (count, sql, operatorName, supported), then we will be able to see the to operators at the top but at the same time it will be noisy when we start to count expressions per exec. Because for sure, expressions will be more count than execs.
  • if we sort by (sql, count..) then the report will be more readable as SQLs appear in order.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @amahussein . I have yet to address this as this might cause memory/runtime overhead. Thinking of a better way to update this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sort is done within a single app, then the final report should not incur any overhead because it has been sorted locally within a single app.

*
* @param execInfo The execution information node to process
*/
private def traverse(execInfo: ExecInfo): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS this to support future when we count expressions per exec? because for now, we only have a single exprRef that cannot be more than 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Was planning to do it as a follow-on. Will file an issue to refactor the code further.

execInfo.children.foreach(_.foreach(traverse))
}

case class ExpressionResult(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment what is this case class gonna be used for. i.e., scope output,..etc?

stages: Set[Int]
)

case class ExecResult(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment what is this case class gonna be used for. i.e., scope output,..etc?

@nartal1
Copy link
Collaborator Author

nartal1 commented Nov 25, 2024

Thanks @amahussein for the review. I have addressed review comments except for sorting the output file.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a merge conflict

@nartal1
Copy link
Collaborator Author

nartal1 commented Dec 3, 2024

Closing this as it is replaced by #1444

@nartal1 nartal1 closed this Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Qualification tool: Report supported expressions in the output file
2 participants