-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance combineLoadingLiterals
#67
Comments
Currently, I think the following scheme is better
|
Tested in https://github.com/nomaddo/VC4C/tree/addOption. More bigger threshold, register-allocation failed due to lacking the number of register. |
That's impressive. Do you have any numbers on compilation time slow-down? |
It takes longer, but I think it's not problem...
nomaddo@nomaddo-AS:~/VC4C$ time ./build/VC4C -Dcl_clang_storage_class_specifiers -DSIGMOID=1 -DgInPerBlock=16 -DgOutPerBlock=16 -DgNumFilters=16 -DgFilterSize=16 -DgHalfFilterSize=8 -DgFilterSizeSquared=256 -DgPadZeros=0 -DgPoolingSize=16 -DgNumPlanes=16 -DgInputPlanes=16 -DgNumInputPlanes=16 -DgMargin=8 -DgNumOutputPlanes=16 -DinputRow=0 -DoutputRow=0 -DgInputSize=32 -DgInputSizeSquared=1024 -DgOutputSize=32 -DgOutputSizeSquared=1024 -DgNumStripes=2 -DgInputStripeOuterSize=2 -DgInputStripeInnerSize=2 -DgInputStripeMarginSize=0 -DgOutputStripeSize=16 -DgOutputStripeNumRows=12 -DgWorkgroupSize=12 -DgEven=0 -DgPixelsPerThread=128 --Xthreshold=1000 --quiet -o /tmp/hoge testing/deepCL/backpropweights.cl
threshold=1000
real 0m5.961s
user 0m6.480s
sys 0m2.300s
nomaddo@nomaddo-AS:~/VC4C$ time ./build/VC4C -Dcl_clang_storage_class_specifiers -DSIGMOID=1 -DgInPerBlock=16 -DgOutPerBlock=16 -DgNumFilters=16 -DgFilterSize=16 -DgHalfFilterSize=8 -DgFilterSizeSquared=256 -DgPadZeros=0 -DgPoolingSize=16 -DgNumPlanes=16 -DgInputPlanes=16 -DgNumInputPlanes=16 -DgMargin=8 -DgNumOutputPlanes=16 -DinputRow=0 -DoutputRow=0 -DgInputSize=32 -DgInputSizeSquared=1024 -DgOutputSize=32 -DgOutputSizeSquared=1024 -DgNumStripes=2 -DgInputStripeOuterSize=2 -DgInputStripeInnerSize=2 -DgInputStripeMarginSize=0 -DgOutputStripeSize=16 -DgOutputStripeNumRows=12 -DgWorkgroupSize=12 -DgEven=0 -DgPixelsPerThread=128 --quiet -o /tmp/hoge testing/deepCL/backpropweights.cl
real 0m3.875s
user 0m4.264s
sys 0m2.432s |
Comparison between
|
Thanks. That is a difference I think we can well live with. |
I think we can change the default value (6 to 100, for example?).
@doe300 Any suggestion? |
I doubt we can do this in general, since the parameter is used at several places, where some of them would fail to compile a lot sooner. |
Thanks. Currently, it seems better to only add options to control the threshold by users. |
@nomaddo , is this resolved? |
Not yet. As this issue has huge impact of performance, I will improve it by better way. |
I think this can be done with life-range analysis. After fusion of ExampleIn this example, we assume we have only 5 registers to simplify the example. example code: ldi a, 1000
iadd a, a, b
iadd c, a, d
imul24 e, f, c
iadd g, a, c
...
...
ldi a, 1000
iadd a, a, e
...
... instructions using b, c, d, e
Now, ldi a, 1000
iadd a2, a, b
iadd c, a2, d
imul24 e, f, c
iadd g, a2, c
...
... // replace a to a2
... instructions using b, c, d, e
ldi a, 1000
iadd a2, a, e
...
... // replace a to a2
... instructions using b, c, d, e, f
ldi a, 1000
iadd a2, a, b
iadd c, a2, d
imul24 e, f, c
iadd g, a2, c
...
...
// ldi a, 1000
iadd a2, a, e
...
... instructions using b, c, d, e
ldi a, 1000 // b, d, f
iadd a2, a, b // a(ldi), b, d, f
iadd c, a2, d // a(ldi), a2, b, d, f
imul24 e, f, c // a(ldi), b, c, d, f
iadd g, a2, c // a(ldi), a2, b, c, d, e !!! exceed the number of register !!!
...
...
iadd a2, a, e // a(ldi), b, c, d, e
...
... instructions using b, c, d, e
ldi a, 1000 // b, d, f
iadd a2, a, b // b, d, f
iadd c, a2, d // a2, b, d, f
imul24 e, f, c // b, c, d, f
iadd g, a2, c // a2, b, c, d, e !!! It doesn't exceed the number of register !!!
...
...
ldi a, 1000
iadd a2, a, e // a(ldi), b, c, d, e
...
... instructions using b, c, d, e |
This is a great idea. |
The output of
deepCL/backproweights.cl
can be improved by enhacement of reducing the number ofldi
.For example, loading constant
256
happens 69 times.If we can combine them to one instruction, rapid speedup can be archived.
https://gist.github.com/nomaddo/220b867143eff68f2c0f83f9188ab382
More small example, the following code can be improved:
The text was updated successfully, but these errors were encountered: