-
Notifications
You must be signed in to change notification settings - Fork 89
resnext , a case study demonstrating the importance of library tuning
MIGraphX allows one to take pre-trained models in ONNX format and deploy them on the AMD ROCm software platform.
As capabilities are added to MIGraphX, additional models are enabled. The ResNext model was one example recently enabled to work with MIGraphX. Our ONNX file example was created using PyTorch 0.4.0 from Remi Cadene's pretrained model collection.
We weren't certain what to expect, but initially the performance of ResNext appeared as it might be slow. This wiki page describes how the analysis was done and how performance of ResNext significantly improved. Along the way, this also provides a case study of the importance of MIOpen library tuning for particular models.
On initial runs on our test system showed ResNext processing 6 images per second. Actual performance is going to depend on hardware architecture, clock speeds as well as the model itself - however, this was also a much lower rate than other models. We weren't quite certain if this meant the model was more complex and involved, or if something wasn't quite tuned for the hardware.
The first tool to use was the HCC_PROFILE mode of the ROCm platform together with the "rpt" tool in the "hcc" package. When one runs a ROCm program with environment variable HCC_PROFILE=2, a detailed activity log is emitted. A short exerpt is included below:
profile: kernel; _ZN12_GLOBAL__N_110hip_fill_nILj256EPjmjEEvT0_T1_T2_; 184.3 us; 1291568592072; 1291568776381; #0.0.14;
profile: barrier; depcnt=0,acq=acc,rel=acc; 2.1 us; 1291569091218; 1291569093292; #0.0.17;
profile: barrier; depcnt=0,acq=acc,rel=acc; 1.6 us; 1291568779493; 1291568781122; #0.0.15;
profile: kernel; sp3AsmConvRxSU; 1194.2 us; 1291569612568; 1291570806725; #0.0.18;
profile: kernel; gcnAsmConv1x1U; 770.4 us; 1291574646888; 1291575417312; #0.0.19;
profile: kernel; MIOpenConv1x1; 209.6 us; 1291576427278; 1291576636922; #0.0.20;
profile: barrier; depcnt=0,acq=sys,rel=sys; 7.7 us; 1291576712952; 1291576720656; #0.0.21;
profile: copyslo; HostToDevice_sync_slow; 9007.9 us; 1291633572234; 1291642580120; #0.0.0; 51380224 bytes; 49.
0 MB; 5.4 GB/s;
profile: copyslo; HostToDevice_sync_slow; 37.9 us; 1291645338014; 1291645375925; #0.0.0; 36864 bytes; 0.0
MB; 0.9 GB/s;
profile: barrier; depcnt=0,acq=acc,rel=acc; 4.1 us; 1291653503154; 1291653507303; #0.0.22;
profile: kernel; Im2Col; 93.0 us; 1291660242434; 1291660335478; #0.0.23;
This detailed log can be summarized in a more understandable form using
rpt logfile > report.txt
An example of the report produced
ROI_START: DATA 0.000000: +0.00 copyslo #0.0.0 3: HostToDevice_sync_slow_9633792_bytes
ROI_STOP : GPU0 280057.649620: +0.00 barrier #0.0.50205 50624: depcnt=0,acq=sys,rel=sys
ROI_TIME= 280.058 secs
Resource=GPU0 Showing 20/41 records 97.04% busy
Total(%) Time(us) Calls Avg(us) Min(us) Max(us) Name
93.44% 261694931.5 979 267308.4 218.1 726547.5 gcnAsmConv1x1U
1.94% 5440843.5 4740 1147.9 356.2 1601.4 Cijk_Ailk_Bljk_SB_MT128x064x08_APM01_AF0EM01_AF1EM01_ASEM01_BL1_DTL0_EPS1_FL00_GRVW04_GSU01_ISA906_IU
01_K1_KLA_LPA00_LPB00_LDL01_MGWVW01_NLCA01_NLCB01_PGR1_PLR1_SNLL1_TT08_04_USFGRO00_VAW01_VW04_WG16_16_01_WGM01
1.65% 4618636.5 162 28510.1 10394.3 85382.0 gap 10000us-100000us
0.61% 1695046.1 4 423761.5 128938.8 875528.8 gap >=100000us
0.55% 1550620.5 3366 460.7 248.8 2617.9 MIOpenGroupConvUni
...
The first few lines starting after "GPU0" give several cues that something may be amiss:
- In total the GPU0 resource was marked "busy" some 98% of the time. When run with other models, it was more typically in the 50-60% range. (A common reason for a GPU that is not-busy is when intermediate results are copied from GPU memory between kernel execution, MIGraphX can reduce those copies through fusion optimizations, but 98% busy is unusual).
- One single type of routine "gcnAsmConv1x1U" dominates the time accounting for 94% of the total. There seems to be a wide range of times of those routines.
With the clues above, one of the first things tried was to update the MIOpen performance database. The MIOpen library comes pre-installed with a database with pretuned values for many models, but tuning was to check possibility that particular kernels used in ResNext might be missing.
There is both a global and a local version of the tuning database. Before tuning, the last lines of the local database (in ~/.config/miopen/gfx906_64.cd.updb.txt) looked like this:
2048-8-8-1x1-384-8-8-32-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:4,8,4,16,1,4,4,2;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0
2048-8-8-1x1-448-8-8-32-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:2,8,2,32,2,4,1,4;ConvOclDirectFwd1x1:1,64,1,1,1,1,16,256,0
2048-8-8-1x1-192-8-8-32-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:1,8,1,64,2,4,2,2;ConvOclDirectFwd1x1:1,64,1,1,1,0,16,256,0
3-224-224-3x3-64-224-224-16-1x1-1x1-1x1-0-NCHW-FP32-F=ConvOclDirectFwd:8,32,32,32,4,1,4,1,2;ConvOclDirectFwdFused:8,32,32,32,4,1,4,1,2
Multiple options for tuning are described at MIOpen performance database. For our exercise, I ran MIGraphX on the model using the environment variable
MIOPEN_FIND_ENFORCE=3
With this variable, MIOpen searches for an updates the performance database with the fastest kernels found. This can take a while, but MIOpen provides progress updates of the search that give indications the database is updated. These look a bit like this
MIOpen(HIP): Warning [Monitor] 1080/0/9375 0.815999, best within recent 1081: 0.815999 #913 2,16,2,64,3,2,1,1, ETA:23.0684 sec.
MIOpen(HIP): Warning [Monitor] 2151/0/9375 0.815999, best within recent 1071: 0.855104 #2013 2,16,2,64,2,2,2,1, ETA:20.1665 sec.
MIOpen(HIP): Warning [Monitor] 3152/0/9375 0.815999, best within recent 1001: 0.87024 #3085 2,16,4,64,1,2,4,1, ETA:17.7788 sec.
MIOpen(HIP): Warning [Monitor] 4190/0/9375 0.815999, best within recent 1038: 0.830816 #3936 3,16,3,64,2,1,1,2, ETA:14.8569 sec.
MIOpen(HIP): Warning [Monitor] 5274/0/9375 0.747264, best within recent 1084: 0.747264 #4419 3,16,3,64,2,2,1,2, ETA:11.6694 sec.
MIOpen(HIP): Warning [Monitor] 6295/0/9375 0.721632, best within recent 1021: 0.721632 #5777 1,16,1,64,4,2,2,2, ETA:8.81193 sec.
MIOpen(HIP): Warning [Monitor] 7291/0/9375 0.721632, best within recent 996: 0.78912 #7163 3,8,3,64,4,1,1,4, ETA:6.00544 sec.
MIOpen(HIP): Warning [Monitor] 8245/0/9375 0.637984, best within recent 954: 0.637984 #7592 1,16,1,64,4,2,1,4, ETA:3.2907 sec.
MIOpen(HIP): Warning [Monitor] 9107/0/9375 0.637984, best within recent 862: 0.72544 #9060 2,16,2,64,2,2,1,8, ETA:0.794937 sec.
MIOpen(HIP): Warning [GenericSearch] Done: 9375/0/9375, best #7592 0.637984 1,16,1,64,4,2,1,4
A side effect of this exercise was that six additional entries had been added to local performance database.
256-56-56-1x1-256-56-56-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:3,16,3,64,2,2,1,1;ConvOclDirectFwd1x1:1,64,1,1,1,1,64,2048,0;ConvActivAsm1x1U:3,16,5,64,2,1
256-56-56-1x1-512-56-56-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:2,16,2,32,3,2,1,4;ConvOclDirectFwd1x1:1,64,1,1,1,4,64,2048,0;ConvActivAsm1x1U:3,16,3,64,4,1
512-28-28-1x1-512-28-28-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:3,16,5,32,2,2,2,1;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0;ConvActivAsm1x1U:4,16,5,32,2,2
512-28-28-1x1-1024-28-28-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:3,16,5,16,1,2,2,2;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0;ConvActivAsm1x1U:3,16,5,32,2,1
1024-14-14-1x1-1024-14-14-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:2,32,7,32,1,1,1,1;ConvOclDirectFwd1x1:1,64,1,1,1,4,64,2048,0
1024-14-14-1x1-2048-14-14-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:4,16,7,16,1,2,2,1;ConvOclDirectFwd1x1:1,64,1,1,1,4,16,2048,0
2048-7-7-1x1-2048-7-7-16-0x0-1x1-1x1-0-NCHW-FP32-F=ConvAsm1x1U:4,16,4,16,1,2,1,2;ConvOclDirectFwd1x1:1,64,1,1,1,4,64,2048,0
Once the performance database was updated, it was time to try the experiment again. The performance now showed 148 images per second, a factor of almost 25x improvement.
In addition, if one looks at the HCC_PROFILE=2 output using rpt, one also sees some overall differences:
ROI_START: DATA 0.000000: +0.00 copyslo #0.0.0 3: HostToDevice_sync_slow_9633792_bytes
ROI_STOP : GPU0 17744.207163: +0.00 barrier #0.0.40105 40524: depcnt=0,acq=sys,rel=sys
ROI_TIME= 17.744 secs
Resource=GPU0 Showing 20/41 records 60.84% busy
Total(%) Time(us) Calls Avg(us) Min(us) Max(us) Name
39.11% 6938999.5 6029 1150.9 218.1 2969.3 gcnAsmConv1x1U
19.75% 3504600.8 106 33062.3 11560.1 83246.2 gap 10000us-100000us
9.47% 1679962.4 3366 499.1 248.8 2599.8 MIOpenGroupConvUni
9.14% 1621510.2 4 405377.5 140392.1 710962.0 gap >=100000us
3.00% 531562.3 7776 68.4 50.0 100.0 gap 50us-100us
...
These changes include:
- GPU0 is now only 61% busy.
- The gcnAsmConv1x1U routines still take the most time, but now account for only 39% of the total. In addition, the maximum time taken by one of those calls has gone down to 2969 us.
This example highlights the importance tuning MIOpen can have on performance. It highlights tools such as HCC_PROFILE mode that can help in investigation and describes symptoms to look for in a mis-tuned application.