Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

展开OCL kernel中的标量dot操作可以获得更高的GFLOPs #113

Open
chillingche opened this issue Jun 12, 2022 · 2 comments
Open

Comments

@chillingche
Copy link

chillingche commented Jun 12, 2022

展开前:

#define DOT_A4B16C4(a, b, c)                                        \
    {                                                               \
        c.x += (a.x * b.s0 + a.y * b.s1 + a.z * b.s2 + a.w * b.s3); \
        c.y += (a.x * b.s4 + a.y * b.s5 + a.z * b.s6 + a.w * b.s7); \
        c.z += (a.x * b.s8 + a.y * b.s9 + a.z * b.sa + a.w * b.sb); \
        c.w += (a.x * b.sc + a.y * b.sd + a.z * b.se + a.w * b.sf); \
    }
./test_convolution_ocl 32 128 128 32 3 3 1 1 0
[DEBUG] thread 15285 OCLContext 0x589b080390 constructor start
[DEBUG] thread 15285 try to dlopen libQUALCOMM_Adreno_650_map.so failed, dlopen failed: library "libQUALCOMM_Adreno_650_map.so" not found, create kernel from source code
[DEBUG] thread 15285 gcl_kernel_source 0xb4000074402203c0 constructor
[DEBUG] thread 15285 OCLContext 0x589b080390 constructor end
[DEBUG] thread 15285 get forward run info from cache fail, try to find best forward run info
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3311 runInfo: ls <0 0 0> executeTime = 2797.056000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3321 runInfo: ls <0 0 0> executeTime = 1689.088000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3331 runInfo: ls <0 0 0> executeTime = 1257.984000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3341 runInfo: ls <0 0 0> executeTime = 1140.992000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3351 runInfo: ls <0 0 0> executeTime = 1051.136000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3361 runInfo: ls <0 0 0> executeTime = 1120.000000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3371 runInfo: ls <0 0 0> executeTime = 1175.040000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1026.048000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3312 runInfo: ls <0 0 0> executeTime = 2488.832000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3322 runInfo: ls <0 0 0> executeTime = 1725.952000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3332 runInfo: ls <0 0 0> executeTime = 1430.016000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3342 runInfo: ls <0 0 0> executeTime = 1312.000000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3314 runInfo: ls <0 0 0> executeTime = 5136.896000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3324 runInfo: ls <0 0 0> executeTime = 3611.136000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3334 runInfo: ls <0 0 0> executeTime = 3038.976000 us
[DEBUG] thread 15285 enqueue_fill_image runInfo: executeTime = 17.920000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_trans_flt_hw_44 runInfo: executeTime = 13.056000 us
[DEBUG] thread 15285 DATATRANS>>> enqueue_write_buffer runInfo: executeTime = 77.056000 us
[DEBUG] thread 15285 KERNEL>>> unknow_mem_trans_om_nchw_to_nchwc4 runInfo: executeTime = 80.128000 us
[DEBUG] thread 15285 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 15285 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1022.976000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1000.960000 us
[DEBUG] thread 15285 SELECT LS KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: best ls = 8 1 8 executeTime = 860.928000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <8 1 8> executeTime = 860.160000 us
[INFO] thread 15285 min_time = 0.860160
[INFO] thread 15285 max_time = 0.860160
[INFO] thread 15285 avg_time = -0.000000
[DEBUG] thread 15285 KERNEL>>> unknow_mem_trans_im_nchwc4_to_nchw runInfo: executeTime = 140.032000 us
[DEBUG] thread 15285 DATATRANS>>> enqueue_read_buffer runInfo: executeTime = 79.104000 us
[INFO] thread 15285 16bit,          Convolution,            (1 32 1 128 128)+(32 32 1 3 3)/(1 1 1 1 0 0 1 1 1 1)=(1 32 1 128 128),     TIME    0.860ms,        GFLOPS  351.695
abs(diff) >= 1.000000e+00f, number = 0
abs(diff) >= 1.000000e-01f, number = 0
abs(diff) >= 1.000000e-02f, number = 13129
abs(diff) >= 1.000000e-03f, number = 339363
abs(diff) >= 1.000000e-04f, number = 123968
abs(diff) >= 1.000000e-05f, number = 681
abs(diff) >= 0.000000e+00f, number = 47147
maxabs = 0.046875, a = 4.781250, b = 4.828125 @ 357254
maxrel = 10498.046875, a = 0.002625, b = -0.002625 @ 278147
[DEBUG] thread 15285 OCLContext 0x589b080390 deconstructor start
[DEBUG] thread 15285 gcl_kernel_source 0xb4000074402203c0 constructor
[DEBUG] thread 15285 OCLContext 0x589b080390 deconstructor end

展开后:

#define DOT_A4B16C4(a, b, c) \
    {                        \
        c.x += (a.x * b.s0); \
        c.x += (a.y * b.s1); \
        c.x += (a.z * b.s2); \
        c.x += (a.w * b.s3); \
        c.y += (a.x * b.s4); \
        c.y += (a.y * b.s5); \
        c.y += (a.z * b.s6); \
        c.y += (a.w * b.s7); \
        c.z += (a.x * b.s8); \
        c.z += (a.y * b.s9); \
        c.z += (a.z * b.sa); \
        c.z += (a.w * b.sb); \
        c.w += (a.x * b.sc); \
        c.w += (a.y * b.sd); \
        c.w += (a.z * b.se); \
        c.w += (a.w * b.sf); \
    }
./test_convolution_ocl 32 128 128 32 3 3 1 1 0
[DEBUG] thread 17343 OCLContext 0x5e124b4390 constructor start
[DEBUG] thread 17343 try to dlopen libQUALCOMM_Adreno_650_map.so failed, dlopen failed: library "libQUALCOMM_Adreno_650_map.so" not found, create kernel from source code
[DEBUG] thread 17343 gcl_kernel_source 0xb400007ab98203c0 constructor
[DEBUG] thread 17343 OCLContext 0x5e124b4390 constructor end
[DEBUG] thread 17343 get forward run info from cache fail, try to find best forward run info
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3311 runInfo: ls <0 0 0> executeTime = 2744.832000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3321 runInfo: ls <0 0 0> executeTime = 1667.072000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3331 runInfo: ls <0 0 0> executeTime = 1198.080000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3341 runInfo: ls <0 0 0> executeTime = 1105.920000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3351 runInfo: ls <0 0 0> executeTime = 1036.032000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3361 runInfo: ls <0 0 0> executeTime = 944.896000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3371 runInfo: ls <0 0 0> executeTime = 958.976000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 907.008000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3312 runInfo: ls <0 0 0> executeTime = 2529.024000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3322 runInfo: ls <0 0 0> executeTime = 1652.992000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3332 runInfo: ls <0 0 0> executeTime = 1390.848000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3342 runInfo: ls <0 0 0> executeTime = 1227.008000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3314 runInfo: ls <0 0 0> executeTime = 5095.936000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3324 runInfo: ls <0 0 0> executeTime = 3202.048000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3334 runInfo: ls <0 0 0> executeTime = 2576.896000 us
[DEBUG] thread 17343 enqueue_fill_image runInfo: executeTime = 17.920000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_trans_flt_hw_44 runInfo: executeTime = 12.800000 us
[DEBUG] thread 17343 DATATRANS>>> enqueue_write_buffer runInfo: executeTime = 68.864000 us
[DEBUG] thread 17343 KERNEL>>> unknow_mem_trans_om_nchw_to_nchwc4 runInfo: executeTime = 78.080000 us
[DEBUG] thread 17343 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 17343 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 914.944000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 895.232000 us
[DEBUG] thread 17343 SELECT LS KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: best ls = 8 1 8 executeTime = 760.064000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <8 1 8> executeTime = 768.000000 us
[INFO] thread 17343 min_time = 0.768000
[INFO] thread 17343 max_time = 0.768000
[INFO] thread 17343 avg_time = -0.000000
[DEBUG] thread 17343 KERNEL>>> unknow_mem_trans_im_nchwc4_to_nchw runInfo: executeTime = 139.008000 us
[DEBUG] thread 17343 DATATRANS>>> enqueue_read_buffer runInfo: executeTime = 77.056000 us
[INFO] thread 17343 16bit,          Convolution,            (1 32 1 128 128)+(32 32 1 3 3)/(1 1 1 1 0 0 1 1 1 1)=(1 32 1 128 128),     TIME    0.768ms,        GFLOPS  393.899
abs(diff) >= 1.000000e+00f, number = 0
abs(diff) >= 1.000000e-01f, number = 0
abs(diff) >= 1.000000e-02f, number = 7769
abs(diff) >= 1.000000e-03f, number = 349884
abs(diff) >= 1.000000e-04f, number = 118162
abs(diff) >= 1.000000e-05f, number = 814
abs(diff) >= 0.000000e+00f, number = 47659
maxabs = 0.039062, a = -3.292969, b = -3.253906 @ 68999
maxrel = 11718.750000, a = 0.002930, b = -0.002930 @ 386530
[DEBUG] thread 17343 OCLContext 0x5e124b4390 deconstructor start
[DEBUG] thread 17343 gcl_kernel_source 0xb400007ab98203c0 constructor
[DEBUG] thread 17343 OCLContext 0x5e124b4390 deconstructor end
@chillingche
Copy link
Author

./test_convolution_ocl 64 256 256 32 3 3 1 1 0

@yuxianzhi
Copy link
Contributor

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants