Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent shuffle write time sum results in Profiler output #1450

Merged
merged 8 commits into from
Jan 2, 2025
Merged
Original file line number Diff line number Diff line change
Expand Up @@ -438,7 +438,7 @@ class AppSparkMetricsAnalyzer(app: AppBase) extends AppAnalysisBase(app) {
val peakMemoryValues = tasksInStage.map(_.peakExecutionMemory)
val shuffleWriteTime = tasksInStage.map(_.sw_writeTime)
(AppSparkMetricsAnalyzer.maxWithEmptyHandling(peakMemoryValues),
shuffleWriteTime.sum)
TimeUnit.NANOSECONDS.toMillis(shuffleWriteTime.sum))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still does not fix the problem because the conversion is done on the stage-level.
The correct way, is to convert after the metrics are aggregated on each level.
For example, perStage/perSql/perJob.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per SQL and per job results are computed based on cached per stage results. Please correct me if I am wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct!
But when we are aggregating perSql, this PR is actually aggregating the stages per SQL after the time is converted to milliseconds.
If we want to be more accurate, then the cached-per-stage-results should still be in nano-seconds; then per-sql value is the sum in nano-seconds; and finally it gets converted to milliseonds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood, thanks @amahussein! I will address this now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We will keep the current implementation to avoid potential overflow if we aggregate at SQL/job level.

}

val (durSum, durMax, durMin, durAvg) = AppSparkMetricsAnalyzer.getDurations(tasksInStage)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,7 @@ object GenerateTimeline {
tc.sr_fetchWaitTime
val opTimeMs = opMetrics.flatMap(_.taskUpdatesMap.get(taskId)).sum / 1000000
val writeTimeMs = writeMetrics.flatMap(_.taskUpdatesMap.get(taskId)).sum / 1000000 +
tc.sw_writeTime
tc.sw_writeTime / 1000000
val taskInfo = new TimelineTaskInfo(stageId, taskId, launchTime, finishTime, duration,
tc.executorDeserializeTime, readTimeMs, semTimeMs, opTimeMs, writeTimeMs)
val execHost = s"$execId/$host"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ object TaskModel {
event.taskMetrics.shuffleReadMetrics.localBytesRead,
event.taskMetrics.shuffleReadMetrics.totalBytesRead,
event.taskMetrics.shuffleWriteMetrics.bytesWritten,
TimeUnit.NANOSECONDS.toMillis(event.taskMetrics.shuffleWriteMetrics.writeTime),
event.taskMetrics.shuffleWriteMetrics.writeTime,
event.taskMetrics.shuffleWriteMetrics.recordsWritten,
event.taskMetrics.inputMetrics.bytesRead,
event.taskMetrics.inputMetrics.recordsRead,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
appIndex,jobId,numTasks,Duration,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,0,213,2515,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,849
1,0,213,2515,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,901
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
appIndex,jobId,numTasks,Duration,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,0,213,2569,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,944
1,0,213,2569,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,1001
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
appIndex,jobId,numTasks,Duration,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,0,213,2569,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,944
2,0,213,2515,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,849
1,0,213,2569,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,1001
2,0,213,2515,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,901
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
appIndex,appID,sqlID,description,numTasks,Duration,executorCPUTime,executorRunTime,executorCPURatio,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,"local-1622821994212",0,"count at <console>:28",213,3041,7045,13522,52.1,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,849
1,"local-1622821994212",0,"count at <console>:28",213,3041,7045,13522,52.1,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,901
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
appIndex,appID,sqlID,description,numTasks,Duration,executorCPUTime,executorRunTime,executorCPURatio,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,"local-1622814619968",0,"count at <console>:28",213,3087,6500,13414,48.46,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,944
1,"local-1622814619968",0,"count at <console>:28",213,3087,6500,13414,48.46,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,1001
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
appIndex,appID,sqlID,description,numTasks,Duration,executorCPUTime,executorRunTime,executorCPURatio,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,"local-1622814619968",0,"count at <console>:28",213,3087,6500,13414,48.46,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,944
2,"local-1622821994212",0,"count at <console>:28",213,3041,7045,13522,52.1,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,849
1,"local-1622814619968",0,"count at <console>:28",213,3087,6500,13414,48.46,0,26735,1598,10,125.5,6500,3433,12095,13414,0,0,336,0,0,0,0,8,8075,0,2600,80279908,0,0,0,80279908,80279908,2600,1001
2,"local-1622821994212",0,"count at <console>:28",213,3041,7045,13522,52.1,0,25761,1624,9,120.9,7045,3021,11178,13522,0,0,424,0,0,0,0,10,8075,0,2600,80279920,0,0,0,80279920,80279920,2600,901
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
appIndex,stageId,numTasks,Duration,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,0,6,1761,0,9455,1624,1540,1575.8,2914,1283,5056,4248,0,0,228,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132263,1200,373
1,1,6,1666,0,9274,1621,1528,1545.7,2568,1004,5016,4099,0,0,196,0,0,0,0,4,2951,0,0,0,0,0,0,0,40132257,1200,473
1,2,200,592,0,6937,221,9,34.7,1518,696,1065,5125,0,0,0,0,0,0,0,3,7402,0,2400,80264520,0,0,0,80264520,15400,200,3
1,0,6,1761,0,9455,1624,1540,1575.8,2914,1283,5056,4248,0,0,228,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132263,1200,376
1,1,6,1666,0,9274,1621,1528,1545.7,2568,1004,5016,4099,0,0,196,0,0,0,0,4,2951,0,0,0,0,0,0,0,40132257,1200,475
1,2,200,592,0,6937,221,9,34.7,1518,696,1065,5125,0,0,0,0,0,0,0,3,7402,0,2400,80264520,0,0,0,80264520,15400,200,50
1,3,1,101,0,95,95,95,95.0,45,38,41,50,0,0,0,0,0,0,0,0,8075,0,200,15400,0,0,0,15400,0,0,0
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
appIndex,stageId,numTasks,Duration,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,0,6,1743,0,9518,1598,1580,1586.3,2509,1391,5309,4043,0,0,168,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132250,1200,397
1,1,6,1631,0,9434,1582,1568,1572.3,2405,1065,5273,3998,0,0,168,0,0,0,0,5,2951,0,0,0,0,0,0,0,40132258,1200,505
1,2,200,688,0,7705,237,10,38.5,1556,940,1474,5337,0,0,0,0,0,0,0,0,7359,0,2400,80264508,0,0,0,80264508,15400,200,42
1,0,6,1743,0,9518,1598,1580,1586.3,2509,1391,5309,4043,0,0,168,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132250,1200,400
1,1,6,1631,0,9434,1582,1568,1572.3,2405,1065,5273,3998,0,0,168,0,0,0,0,5,2951,0,0,0,0,0,0,0,40132258,1200,508
1,2,200,688,0,7705,237,10,38.5,1556,940,1474,5337,0,0,0,0,0,0,0,0,7359,0,2400,80264508,0,0,0,80264508,15400,200,93
1,3,1,83,0,78,78,78,78.0,30,37,39,36,0,0,0,0,0,0,0,0,8075,0,200,15400,0,0,0,15400,0,0,0
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
appIndex,stageId,numTasks,Duration,diskBytesSpilled_sum,duration_sum,duration_max,duration_min,duration_avg,executorCPUTime_sum,executorDeserializeCPUTime_sum,executorDeserializeTime_sum,executorRunTime_sum,input_bytesRead_sum,input_recordsRead_sum,jvmGCTime_sum,memoryBytesSpilled_sum,output_bytesWritten_sum,output_recordsWritten_sum,peakExecutionMemory_max,resultSerializationTime_sum,resultSize_max,sr_fetchWaitTime_sum,sr_localBlocksFetched_sum,sr_localBytesRead_sum,sr_remoteBlocksFetched_sum,sr_remoteBytesRead_sum,sr_remoteBytesReadToDisk_sum,sr_totalBytesRead_sum,sw_bytesWritten_sum,sw_recordsWritten_sum,sw_writeTime_sum
1,0,6,1743,0,9518,1598,1580,1586.3,2509,1391,5309,4043,0,0,168,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132250,1200,397
1,1,6,1631,0,9434,1582,1568,1572.3,2405,1065,5273,3998,0,0,168,0,0,0,0,5,2951,0,0,0,0,0,0,0,40132258,1200,505
1,2,200,688,0,7705,237,10,38.5,1556,940,1474,5337,0,0,0,0,0,0,0,0,7359,0,2400,80264508,0,0,0,80264508,15400,200,42
1,0,6,1743,0,9518,1598,1580,1586.3,2509,1391,5309,4043,0,0,168,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132250,1200,400
1,1,6,1631,0,9434,1582,1568,1572.3,2405,1065,5273,3998,0,0,168,0,0,0,0,5,2951,0,0,0,0,0,0,0,40132258,1200,508
1,2,200,688,0,7705,237,10,38.5,1556,940,1474,5337,0,0,0,0,0,0,0,0,7359,0,2400,80264508,0,0,0,80264508,15400,200,93
1,3,1,83,0,78,78,78,78.0,30,37,39,36,0,0,0,0,0,0,0,0,8075,0,200,15400,0,0,0,15400,0,0,0
2,0,6,1761,0,9455,1624,1540,1575.8,2914,1283,5056,4248,0,0,228,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132263,1200,373
2,1,6,1666,0,9274,1621,1528,1545.7,2568,1004,5016,4099,0,0,196,0,0,0,0,4,2951,0,0,0,0,0,0,0,40132257,1200,473
2,2,200,592,0,6937,221,9,34.7,1518,696,1065,5125,0,0,0,0,0,0,0,3,7402,0,2400,80264520,0,0,0,80264520,15400,200,3
2,0,6,1761,0,9455,1624,1540,1575.8,2914,1283,5056,4248,0,0,228,0,0,0,0,3,2951,0,0,0,0,0,0,0,40132263,1200,376
2,1,6,1666,0,9274,1621,1528,1545.7,2568,1004,5016,4099,0,0,196,0,0,0,0,4,2951,0,0,0,0,0,0,0,40132257,1200,475
2,2,200,592,0,6937,221,9,34.7,1518,696,1065,5125,0,0,0,0,0,0,0,3,7402,0,2400,80264520,0,0,0,80264520,15400,200,50
2,3,1,101,0,95,95,95,95.0,45,38,41,50,0,0,0,0,0,0,0,0,8075,0,200,15400,0,0,0,15400,0,0,0
Loading