You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortutately, I don't know much about low-level programming and hardware (and I really don't understand any of OpenCL, the Pi's GPU architecture, or what what the work size actually means, sorry), so the question I ask may be a bit dumb:
Would it be possible to change the work size?
I have been looking for the source of the magic number here in the repository and found this comment
* "The work-items in a given work-group execute concurrently on the processing elements of a single compute
* unit." (page 24) Since there is no limitation, that work-groups need to be executed in parallel, we set 1
* compute unit with all 12 QPUs, allowing us to run 12 work-items in a single work-group in parallel and run
* the work-groups sequentially.
If work items can in part be executed sequentially – could I be taught to set some of the work size limits to 48 (the lcm of 12 and 16) for a small performance hit, or is that number embedded too deeply in the code and would require a lot of changes in other places? Like
I think you misunderstood the comment. work-groups can be run sequentially, work-items (single executions within a work-group) must be run in parallel.
The 12 for work-group size (number of work-items in a single work-group) is a hardware/implementation limitation, since we only have 12 cores.
I am currently working on a very experimental optimization to merge work-items, which would then allow for a work-group of more than 12. But whether this can be applied depends on the kernels being executed...
Out of a toy interest, I am trying to run OpenCL and the tree likelihood computation library BEAGLE to run on a PI. BEAGLE assumes that work sizes are divisible by 16, because that's handy for nucleotide substitution matrices, and it fails to run on the 12×12×12 work size limit of VC4CL on the Pi.
Unfortutately, I don't know much about low-level programming and hardware (and I really don't understand any of OpenCL, the Pi's GPU architecture, or what what the work size actually means, sorry), so the question I ask may be a bit dumb:
Would it be possible to change the work size?
I have been looking for the source of the magic number here in the repository and found this comment
VC4CL/src/vc4cl_config.h
Lines 140 to 143 in 842d444
If work items can in part be executed sequentially – could I be taught to set some of the work size limits to 48 (the lcm of 12 and 16) for a small performance hit, or is that number embedded too deeply in the code and would require a lot of changes in other places? Like
VC4CL/src/Kernel.cpp
Line 339 in a00572f
The text was updated successfully, but these errors were encountered: