-
Notifications
You must be signed in to change notification settings - Fork 2
chunking
Fix citations fix ref to other parts of the website
Runtime Scheduling and Memory Management of Arbitrarily Large Datasets With the Ability to Distribute Across Many GPUs
Computation speed is often hindered by data IO. To mitigate this effect data should be located as close to the processor as possible, in registers ideally \cite{Hennessy2011}. However, GPUs have limited memory when compared to that available to the CPU. State-of-the-art microscopes can easily produce datasets one to two orders of magnitude larger than GPU memory or vRAM. HIP obscures this from the end user while simultaneously making it easy for programmers to write new image processing algorithms. Ensuring data is close to the GPU is accomplished through a technique called "chunking."
Chunking is the method that partitions data based on the input data dimensions,kernel size, and GPU resources available at runtime.
By integrating the resources at runtime, HIP is able to run on a diverse set of hardware, spanning from laptops with a discreet GPU to servers with many GPUs.
The optimal chunk size must balance memory transfer speeds while minimizing redundant work and ensuring access to the needed data for a given operation.
As shown in the top panel of Figure \ref{fig:hip_part}, naïvely chunking in the
The memory available for processing is more limited than one would expect.
To process data on the device, there needs to be a minimum of two buffers, one as input and the other as output, reducing the "usable" size to one half of the available.
For complex operations, it may be necessary to have intermediate buffers.
The more buffers needed, the smaller the "usable" size will be.
Currently, chunking always occurs across the fourth (
Variables ending with (
\State $vol \gets imageX \times imageY \times imageZ$
\State $overlapX \gets kernelX \times imageY \times imageZ$
\State $overlapY \gets imageX \times kernelY \times imageZ$
\State $overlapZ \gets imageX \times imageY \times kernelZ$
\If {$overlapX > overlapY \& overlapX > overlapZ$}\\
\Comment{Overlapping in the X dimension would produce the biggest overlapping volume}
\State $bufferX \gets imageX$
\State $leftoverVol \gets vol / imageX$
\State $sqrDim \gets \sqrt leftoverVol$
\If {$overlapY > overlapZ$}
\If {$sqrDim > overlapY$}
\State $bufferY \gets sqrDim$
\Else
\State $bufferY \gets imageY$
\EndIf
\State $bufferZ \gets leftoverVol / imageY$
\Else
\If {$squareDim > overlapZ$}
\State $bufferZ \gets sqrDim$
\Else
\State $bufferZ \gets imageZ$
\EndIf
\State $bufferY \gets leftoverVol / imageZ$
\EndIf
\ElsIf {$overlapY > overlapZ$}\\
\Comment{Similar logic to above starting from the Y dimension instead of X}
\Else\\
\Comment{Similar logic to above starting from the Z dimension instead of X or Y}
\EndIf
Colored boxes are distinct memory partitions to be processed independently. White areas are only processed once. Gray areas are processed more than once. Darker gray areas are processed more than light gray areas. Panel A shows a partitioning that preserves the continuity of the (
$X,Y$ ) plane by only partitioning across$Z$ . This scheme will reprocess much of the volume. The partitioning scheme in panel B balances memory continuity while reducing the amount of redundant data in overlapping sections. The large volume overlapping sections are at most processed twice. There are sections that will be processed up to eight times. However, these volumes are small cubes in the most interior sections and should not be as numerous as other overlapping sections.
After the buffer dimensionality is calculated, the Image Chunk
class then partitions up the data into explicit chunks.
A chunk contains the start and end (
There are three common ways that image processing software deals with image boundaries: zero padding, last padding, and mirroring. Zero padding adds zeros to the border of the image like that in this figure. Last padding repeats the last value in the image outward like that in this figure. Lastly, mirroring where the values are mirrored outward like that in this figure. HIP uses the size of the available support to re-normalize the operation. Many operations are trivial to re-normalize, such as the mean, median, minimum, and maximum filters. The Gaussian filter can be re-normalized easily enough by ensuring that the Gaussian kernel (or convolution matrix) sums to one. While applying the kernel to input values, a sum of the kernel values used needs to be kept. After all of the available input values have been visited, the result need only be divided by the kernel accumulator. This works out the same if a new Gaussian kernel was calculated to be unitary over the same support. See a simple version of this in panel D of each of the figures zero, copy, and mirror. Figures gaussian with single precision and gaussian with integer precision show a practical example. The use of a large Gaussian smoothing is often times how background or out-of-focus signal is estimated. This exemplifies how not treating the edge of an image can be detrimental to downstream operations. Other operations have also been re-normalized such as the Laplacian of Gaussian (LoG) filter. However, the results have only be verified empirically. The proof is still a work in progress.
A is a simulated image that has been padded with zeros for a 5x5 neighborhood support. B is a 1-D sample of the image represented by the green box in A. C is the same sample without the padding. The left side of D is a mean filter using B. The right side of D is a mean filter with the kernel re-normalized. E is like D but the kernel is now a Gaussian filter. The resulting values in D and E have been rounded to the next integer value as would be done with integer valued data.
A is a simulated image that has been padded with the last value for a 5x5 neighborhood support. B is a 1-D sample of the image represented by the green box in A. C is the same sample without the padding. The left side of D is a mean filter using B. The right side of D is a mean filter with the kernel re-normalized. E is like D but the kernel is now a Gaussian filter. The resulting values in D and E have been rounded to the next integer value as would be done with integer valued data.
A is a simulated image that has been padded with a mirror image for a 5x5 neighborhood support. B is a 1-D sample of the image represented by the green box in A. C is the same sample without the padding. The left side of D is a mean filter using B. The right side of D is a mean filter with the kernel re-normalized. E is like D but the kernel is now a Gaussian filter. The resulting values in D and E have been rounded to the next integer value as would be done with integer valued data.
The left most column contains a raw image of cells florescent marked in a three dimensional space. The second column from the left is the resulting image after being smoothed with a large Gaussian kernel using HIP. The Gaussian kernel is smaller in the
$Z$ direction due to the anisotropic voxel dimensions of the image. The second column from the right was smoothed using a MATLAB Gaussian filter of the same size. The right most column is the difference between the HIP result and the MATLAB result. Positive values are in magenta and negative values are in green. Zero values are in black. All images have been normalized for display with actual ranges listed on the sides. The two middle columns look almost identical, however, the rightmost column shows the structured difference between the two. Note that MATLAB looses energy at image edges with low intensity (first$Z$ slice) and gains energy where the intensities are higher (last$Z$ slice). Also note the difference in execution time between HIP and MATLAB.
The layout is similar to that of figure with single precision. In these integer valued images, one can start to see that the MATLAB version is losing the asymmetry of the underlying data. These differences can significantly change downstream analysis.