-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clEnqueueNDRangeKernel times out on large buffers #103
Comments
The different interfaces to access the QPUs have some "quirks", e.g. the mailbox interface does not allow kernel executions taking longer then 1s and the VCHIQ GPU service does not wait for the kernel to actually finish. Can you rerun your program (with and without "sudo") with the |
Without sudo:
With sudo:
Ok. So performance difference can be explained by different memory allocation types. |
I turned on VCHI and now sudo is slow. So this is consistent with the "no sudo". It looks like VCHI is causing this slowness. After some number of executions, it started calculating very slowly. But the output is still valid (so it is not a timeout):
|
Interesting results. I never got to the actual performance comparison of the different memory allocation methods, since non-root execution does not work reliably (see tickets I linked above). If you are interested in playing around a bit more, you could rerun the program with the Generally, I would recommend to run any VC4CL client as root (I know, not so great), allocate memory via the Mailbox and execute kernels via the V3D registers (the defaults for root), since this the most-tested combination by far. |
It gets weirder with more tests.
So it looks like:
The performance gap between sudo/no-sudo is very important here, because executing very similar filter on CPU (+neon) gives "average time: 0.070378". So it lies somewhere between sudo and non-sudo. There are number of things I can try to optimise performance, but first of all I would like to make the program stable :) Some optimisations:
|
Performance-wise, having a quick look at your code, you could try some of these:
E.g. this code has only half the memory accesses and takes (without scheduling overhead) about 25% less cycles to execute:
This code could be twice as fast (again ignoring scheduling overhead), but requires the input and taps buffers to be padded to 8 floats and a constant tap-size of 3:
|
Ok. I will go with sudo-enabled access and start optimising the code. This is not related to this issue, but I switched to float8 and got 5x performance boost:
So it looks like it is possible to beat CPU on this task. |
I have been extensively testing the timeout issue for the last several days:
Normally computation takes 16k us and lots of execution cycles:
But after some time it returns earlier:
With much smaller execution cycles. However due to sleep(1) I can submit the same kernel again. It returns earlier, but the execution cycles (which are read from the GPU) stay relatively normal.
So it looks like GPU computes the kernel, but for some reason mailbox interface returns earlier.
|
Couple more observations:
|
Tried running code similar to add.py: for x in range(100):
start = time.time()
drv.execute(
n_threads=n_threads,
program=code,
uniforms=uniforms
)
elapsed_gpu = time.time() - start
print('GPU: {:.4f} sec'.format(elapsed_gpu)) using py-videocore and got the same timeout:
|
Tried bullseye and got timeout after very first execution when executing via MAILBOX.
Here is firmware version:
|
Isn't this also related to raspberrypi/linux#4321? Can you check how long it actually took to time out (from the start of that particular Mailbox call to the timeout error)? |
Unlikely. I'm executing exactly the same code all the time and it takes ~16264us to execute. On "buster" I've got timeout after several executions. While on "bullseye" it instantly fails. I think it relates to: raspberrypi/firmware#1582 or at least looks very similar. Another observation (not sure if related), but GPU firmware becomes corrupted. For example, other modules responsible for changing frequency stop working. |
I'm seeing a very strange behaviour when running relatively large buffers (87968 bytes). I have a test FIR filter:
https://github.com/dernasherbrezon/clDsp/blob/main/fir_filter.cl
It takes input buffer, multiplies to filter taps (constant) buffer and writes to output.
The simple test ( https://github.com/dernasherbrezon/clDsp/blob/main/test/test_fir_filter.c ) works fine.
Performance test is not. When I execute the loop 10 times ( https://github.com/dernasherbrezon/clDsp/blob/main/test/perf_fir_filter.c#L47 ) it might hang on reading the data. Any subsequent executions of any programs on GPU will hang. So only reboot helps.
When I execute performance loop only once, then everything is fine. It produces valid results.
Another observation: running using "sudo" never hangs.
Another observation: running using "sudo" take ~10times slower than under normal user ( "pi" ).
How can I troubleshoot the slowness? Can be it related to some memory constraints or some user-specific limits while working with /dev/mem?
The text was updated successfully, but these errors were encountered: