Skip to content

GPU Acceleration

carbotaniuman edited this page Apr 18, 2022 · 5 revisions

Current State

PhotonVision currently supports GPU acceleration on one niche but popular platform - Pi3(B+)/CM3(+)/PiZ2W, with Pi Cameras (V1/V2).
The acceleration takes care of capturing the image via MMAL, thresholding it with an OpenGL shader, and then preparing it for the journey to the CPU. (More detail/links/references needed)

This only works on Pi3 based platforms as the OpenGL drivers on Pi4 do not support the features used that enable the zero-copy to the CPU.

Pi3 Details

Declan pls fill in here

Summer 2023 Goal

To make more platforms as performant as the Pi3+PiCam. The current targets are Pi4 (via libcamera) and Nvidia Jetson (via v4l + CUDA shared memory).

Pi 3 / Pi 4 Libcamera

Since Bullseye, the Pi camera stack has transitioned from the legacy MMAL stack to a more modern open-source one, and the GPU acceleration code is being transitioned accordingly. The new libcamera-based GPU acceleration captures the images with libcamera, using zero-copy DMAbufs to pass data to an OpenGL shader that does the thresholding, before a zero-copy transfer brings it back to the GPU.

More technically, raw buffers flow in a loop between the libcamera CameraGrabber and the OpenGL GlHsvThresholder, with an OpenCV Mat being formed from a copy of the data after it reaches the CPU but before the buffer is returned.

... |libcamera | ------> |OpenGL | |CameraGrabber| <------ |Thresholder| | | | |PhotonVision CPU Code| <------- / ...

These transfers are backed by concurrent queues, allowing all 3 components to run in their own threads, and keep data at their own rate. This may have issues if one of the components backs up, but that hopefully should not be an issue.

Now for a bit more detail on all 3 of the components:

libcamera uses a completion port/callback semantics, so data is directly pushed into the concurrent queue to prevent libcamera from taking up too much time in the callback. libcamera also handles all of the settings and shutdown of the camera, but there are several mutexes required to ensure orderly shutdown. (Shutdown across the stack is still relatively untested).

The OpenGL thresholder is a near direct port of the legacy implementation, but uses the EGL_EXT_image_dma_buf_import and EGL_EXT_image_dma_buf_import_modifiers OpenGL extensions to reinterpret raw libcamera bytes as an EGLImageKHR, using the zero-copy DMAbuf infrastructure.

The CPU code simply memcpys the mmaped DMAbufs into a new allocated buffer, splitting an RGBA image into RGB + A components, where A represents the results of the thresholding. This is also copied wholesale from the previous implementation, and requires autovectorization in order to meet frame times. After the memcpy is completed, the buffers are returned to libcamera via the thread-safe requeueBuffer method.

Jetson (CUDA)

@bankst

Clone this wiki locally