Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance combineLoadingLiterals #67

Open
nomaddo opened this issue Mar 28, 2018 · 13 comments
Open

Enhance combineLoadingLiterals #67

nomaddo opened this issue Mar 28, 2018 · 13 comments
Assignees
Labels
enhancement optimization related to an optimization step

Comments

@nomaddo
Copy link
Collaborator

nomaddo commented Mar 28, 2018

The output of deepCL/backproweights.cl can be improved by enhacement of reducing the number of ldi.
For example, loading constant 256 happens 69 times.
If we can combine them to one instruction, rapid speedup can be archived.
https://gist.github.com/nomaddo/220b867143eff68f2c0f83f9188ab382

More small example, the following code can be improved:

kernel void f(global float * a) {
  for (int i = 0; i < 4; i++){
    a[i] += 129;
  }
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'f' with 37 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items)
// label: %start_of_function
or ra0, unif, unif
// label: %tmp.0
nop.never 
or tmu0s, ra0, ra0
nop.load_tmu0.never 
ldi r0, 1124139008                       // should be fused
fadd r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 4, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add tmu0s, ra0, 4 (4)
nop.load_tmu0.never 
ldi r1, 1124139008                       // should be fused
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 8 (8)
nop.load_tmu0.never 
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 12 (12)
nop.load_tmu0.never 
ldi r0, 1124139008                       // should be fused
fadd r0, r4, r0
or vpm, r0, r0
or vpw_addr, ra0, ra0
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 
@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 3, 2018

Currently, combineLoadingLietrals are limited: The replacement works within previous 6 instructions specified as ACCUMLATOR_THRESHOLD_HINT in canReplaceLietralLoad.

I think the following scheme is better

  • combine all loading instruction in a basic block if possible
  • estimate the number of registers needed (probably functions for register-allocation can be reused?)
  • split lifetimes of variables by inserting load instruction if the number is exceed

@nomaddo nomaddo self-assigned this Apr 3, 2018
@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 3, 2018

Tested in https://github.com/nomaddo/VC4C/tree/addOption.
With --Xthreshold=1000, the the number of output lines of deepCL/backproweights.cl reduce from 2302 to 1940.

More bigger threshold, register-allocation failed due to lacking the number of register.

@doe300
Copy link
Owner

doe300 commented Apr 3, 2018

That's impressive. Do you have any numbers on compilation time slow-down?

@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 3, 2018

It takes longer, but I think it's not problem...
I will calculate other test-cases in deepCL.

time time(opt)
4.26 s 6.48 s
nomaddo@nomaddo-AS:~/VC4C$ time ./build/VC4C -Dcl_clang_storage_class_specifiers -DSIGMOID=1 -DgInPerBlock=16 -DgOutPerBlock=16 -DgNumFilters=16 -DgFilterSize=16 -DgHalfFilterSize=8 -DgFilterSizeSquared=256 -DgPadZeros=0 -DgPoolingSize=16 -DgNumPlanes=16 -DgInputPlanes=16 -DgNumInputPlanes=16 -DgMargin=8 -DgNumOutputPlanes=16 -DinputRow=0 -DoutputRow=0 -DgInputSize=32 -DgInputSizeSquared=1024 -DgOutputSize=32 -DgOutputSizeSquared=1024 -DgNumStripes=2 -DgInputStripeOuterSize=2 -DgInputStripeInnerSize=2 -DgInputStripeMarginSize=0 -DgOutputStripeSize=16 -DgOutputStripeNumRows=12 -DgWorkgroupSize=12  -DgEven=0 -DgPixelsPerThread=128 --Xthreshold=1000 --quiet -o /tmp/hoge testing/deepCL/backpropweights.cl
threshold=1000

real	0m5.961s
user	0m6.480s
sys	0m2.300s
nomaddo@nomaddo-AS:~/VC4C$ time ./build/VC4C -Dcl_clang_storage_class_specifiers -DSIGMOID=1 -DgInPerBlock=16 -DgOutPerBlock=16 -DgNumFilters=16 -DgFilterSize=16 -DgHalfFilterSize=8 -DgFilterSizeSquared=256 -DgPadZeros=0 -DgPoolingSize=16 -DgNumPlanes=16 -DgInputPlanes=16 -DgNumInputPlanes=16 -DgMargin=8 -DgNumOutputPlanes=16 -DinputRow=0 -DoutputRow=0 -DgInputSize=32 -DgInputSizeSquared=1024 -DgOutputSize=32 -DgOutputSizeSquared=1024 -DgNumStripes=2 -DgInputStripeOuterSize=2 -DgInputStripeInnerSize=2 -DgInputStripeMarginSize=0 -DgOutputStripeSize=16 -DgOutputStripeNumRows=12 -DgWorkgroupSize=12  -DgEven=0 -DgPixelsPerThread=128 --quiet -o /tmp/hoge testing/deepCL/backpropweights.cl

real	0m3.875s
user	0m4.264s
sys	0m2.432s

@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 3, 2018

Comparison between --Xthreshold=1000 and --Xthreshold=6 (same as the default value).

filename line-num line-num (opt) time time (opt)
BackpropWeightsScratch.cl 1767 1703 4.38s 4.46s
BackpropWeightsScratchLarge.cl 2550 2344 6.908s 8.128s
PoolingBackwardGpuNaive.cl --- failed --- ---
SGD.cl 80 80 3.516s 3.568s
activate.cl 340 338 3.552s 3.676s
addscalar.cl 57 57 3.440s 3.576s
applyActivationDeriv.cl 127 127 3.520s 3.464s
backpropweights.cl 2301 1939 4.316s 7.672s
backpropweights_byrow.cl 966 954 3.968s 4.040s
backward.cl 2288 1925 4.312s 6.144s
backward_cached.cl 1752 1688 4.364s 4.440s
copy.cl 224 224 3.552s 3.552s
dropout.cl 161 161 3.492s 3.500s
forward1.cl 3024 2627 5.656s 8.572s
forward_byinputplane.cl 3031 2844 5.512s 8.016s
forward_fc.cl 1043 930 3.876s 3.912s
forwardfc_workgroupperfilterplane.cl 13 13 3.536s 3.516s
inv.cl 75 75 3.492s 3.456s
per_element_add.cl --- failed --- ---
per_element_mult.cl 61 61 3.504s 3.496s
pooling.cl 2279 1915 4.268s 6.280s
reduce_segments.cl 124 123 3.576s 3.580s
sqrt.cl 61 61 3.608s 3.544s
squared.cl 56 56 3.464s 3.560s

@doe300
Copy link
Owner

doe300 commented Apr 3, 2018

Thanks. That is a difference I think we can well live with.
From the table it looks like the duration of the optimization is directly linked to the number of instructions saved. So there should be no case where the optimization runs long without any reduced execution time.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 3, 2018

I think we can change the default value (6 to 100, for example?).
For better optimization, I am wondering how to improve it

  • Optimize by specifying the parameter (meaning optimize by man-hand).
  • Implement smarter feature (auto-adjustment using estimation of register-pressure)

@doe300 Any suggestion?

@doe300
Copy link
Owner

doe300 commented Apr 3, 2018

6 to 100, for example

I doubt we can do this in general, since the parameter is used at several places, where some of them would fail to compile a lot sooner.
I came up with 6 by manually testing a few values until my sample code compiled correctly. If anyone has an idea how to auto-detect the "perfect" value, that would be great. Otherwise, I think we'll need to re-test a few values and set a new hint manually.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 3, 2018

Thanks. Currently, it seems better to only add options to control the threshold by users.

@doe300 doe300 added the optimization related to an optimization step label Apr 7, 2018
@doe300
Copy link
Owner

doe300 commented Apr 7, 2018

@nomaddo , is this resolved?

@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 9, 2018

Not yet. As this issue has huge impact of performance, I will improve it by better way.

@nomaddo
Copy link
Collaborator Author

nomaddo commented Apr 11, 2018

I think this can be done with life-range analysis. After fusion of ldi, if the number of used registers exceed the number of existence registers, we "forget" the variables marked as ldi by inserting ldi.

Example

In this example, we assume we have only 5 registers to simplify the example.

example code:

ldi a, 1000
iadd a, a, b
iadd c, a, d
imul24 e, f, c
iadd g, a, c
...
... 
ldi a, 1000
iadd a, a, e
...
...  instructions using b, c, d, e
  1. Rename variables to make variables read-only if possible

Now, a is read-only except first assignment

ldi a, 1000
iadd a2, a, b
iadd c, a2, d
imul24 e, f, c
iadd g, a2, c
... 
... // replace a to a2
...  instructions using b, c, d, e
ldi a, 1000
iadd a2, a, e
...
... // replace a to a2
...  instructions using b, c, d, e, f 
  1. Fuse all possible ldi
ldi a, 1000
iadd a2, a, b
iadd c, a2, d
imul24 e, f, c
iadd g, a2, c
... 
... 
// ldi a, 1000
iadd a2, a, e
...
...  instructions using b, c, d, e
  1. Life-range analysis of variables with labels of ldi (hint for next step to forget variables) to count register usage.
ldi a, 1000                 // b, d, f
iadd a2, a, b               // a(ldi), b, d, f
iadd c, a2, d               // a(ldi), a2, b, d, f
imul24 e, f, c              // a(ldi), b, c, d, f
iadd g, a2, c               // a(ldi), a2, b, c, d, e       !!! exceed the number of register !!!
... 
... 
iadd a2, a, e               // a(ldi), b, c, d, e
...
...  instructions using b, c, d, e
  1. If register usage exceed the maximum numbers of registers, insert ldi to make life-range of variable shorter.
ldi a, 1000                  // b, d, f
iadd a2, a, b               // b, d, f
iadd c, a2, d               // a2, b, d, f
imul24 e, f, c              // b, c, d, f
iadd g, a2, c               // a2, b, c, d, e       !!! It doesn't exceed the number of register !!!
... 
... 
ldi a, 1000
iadd a2, a, e               // a(ldi), b, c, d, e
...
...  instructions using b, c, d, e

@doe300
Copy link
Owner

doe300 commented Apr 11, 2018

This is a great idea.
One note though: Since we have 2 register-banks, we should set the limit for this pass not to the actual number of registers, but to some smaller value to prevent more register-association failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement optimization related to an optimization step
Projects
None yet
Development

No branches or pull requests

2 participants