chunking

Chunking

Fix citations fix ref to other parts of the website

Runtime Scheduling and Memory Management of Arbitrarily Large Datasets With the Ability to Distribute Across Many GPUs

Computation speed is often hindered by data IO. To mitigate this effect data should be located as close to the processor as possible, in registers ideally \cite{Hennessy2011}. However, GPUs have limited memory when compared to that available to the CPU. State-of-the-art microscopes can easily produce datasets one to two orders of magnitude larger than GPU memory or vRAM. HIP obscures this from the end user while simultaneously making it easy for programmers to write new image processing algorithms. Ensuring data is close to the GPU is accomplished through a technique called "chunking."

Chunking is the method that partitions data based on the input data dimensions,kernel size, and GPU resources available at runtime. By integrating the resources at runtime, HIP is able to run on a diverse set of hardware, spanning from laptops with a discreet GPU to servers with many GPUs. The optimal chunk size must balance memory transfer speeds while minimizing redundant work and ensuring access to the needed data for a given operation. As shown in the top panel of Figure \ref{fig:hip_part}, naïvely chunking in the $Z$ dimension will give the largest contiguous memory copies (fastest way to copy memory) at the expense of having more redundant work. The bottom panel shows a more nuanced partition scheme.

The memory available for processing is more limited than one would expect. To process data on the device, there needs to be a minimum of two buffers, one as input and the other as output, reducing the "usable" size to one half of the available. For complex operations, it may be necessary to have intermediate buffers. The more buffers needed, the smaller the "usable" size will be. Currently, chunking always occurs across the fourth ($\lambda$) and fifth dimensions ($t$) (see Section \ref{sec:hip_conclusion} for future work in this area). If the spacial ($X,Y,Z$) dimensions are too big to fit in the usable memory, additional chunking is necessary. Pseudocode summarizes the logic behind determining chunk size. Note that this technique biases against large overlaps and favors keeping the first dimension ($X$) intact.

Calculating Buffer Size for Image Chunking

Variables ending with ($X,Y,Z$) should be interpreted as sizes in that dimension. For brevity, only the case that preserves the $X$ dimension has been fully described. The other two cases are exactly the same except the fully preserved dimension is different. See discussion in Section \ref{sec:hip_conclusion} on possible improvements.

\State $vol \gets imageX \times imageY \times imageZ$
\State $overlapX \gets kernelX \times imageY \times imageZ$
\State $overlapY \gets imageX \times kernelY \times imageZ$
\State $overlapZ \gets imageX \times imageY \times kernelZ$
\If {$overlapX > overlapY \& overlapX > overlapZ$}\\
\Comment{Overlapping in the X dimension would produce the biggest overlapping volume}
\State $bufferX \gets imageX$
\State $leftoverVol \gets vol / imageX$
\State $sqrDim \gets \sqrt leftoverVol$

\If {$overlapY > overlapZ$}
\If {$sqrDim > overlapY$}
    \State $bufferY \gets sqrDim$
\Else
    \State $bufferY \gets imageY$
\EndIf

\State $bufferZ \gets leftoverVol / imageY$

\Else
\If {$squareDim > overlapZ$}
    \State $bufferZ \gets sqrDim$
\Else
    \State $bufferZ \gets imageZ$
\EndIf

\State $bufferY \gets leftoverVol / imageZ$
\EndIf
\ElsIf {$overlapY > overlapZ$}\\
\Comment{Similar logic to above starting from the Y dimension instead of X}
\Else\\
\Comment{Similar logic to above starting from the Z dimension instead of X or Y}
\EndIf

Two possible image partitioning schemes

HIP overlap

Colored boxes are distinct memory partitions to be processed independently. White areas are only processed once. Gray areas are processed more than once. Darker gray areas are processed more than light gray areas. Panel A shows a partitioning that preserves the continuity of the ($X,Y$) plane by only partitioning across $Z$. This scheme will reprocess much of the volume. The partitioning scheme in panel B balances memory continuity while reducing the amount of redundant data in overlapping sections. The large volume overlapping sections are at most processed twice. There are sections that will be processed up to eight times. However, these volumes are small cubes in the most interior sections and should not be as numerous as other overlapping sections.

After the buffer dimensionality is calculated, the Image Chunk class then partitions up the data into explicit chunks. A chunk contains the start and end ($X,Y,Z,\lambda,t$) locations to be loaded into the input buffer. In addition, an output ROI is also stored. This is smaller than the input. This ensures that all output has the largest support necessary for the current operation. This full data support partitioning scheme is yet another novel contribution that is provided by HIP. As explained in the next section (Section \ref{sec:hip_bounaries}), HIP eliminates edge artifacts by using exact methods when the data support is not available. However, when partitioning data, the full support is} available and \emph{should be used as is presented here. An additional benefit to this output scheme is that it is completely lock-free. Meaning, memory can be written to many chunks in parallel without worry of collision. See Sections \ref{sec:hip_speedup} and \ref{sec:hip_conclusion} for further discussion and implications on speedup that this provides.

Exact Methods for Image Boundaries Eliminating Processing Artifacts

There are three common ways that image processing software deals with image boundaries: zero padding, last padding, and mirroring. Zero padding adds zeros to the border of the image like that in this figure. Last padding repeats the last value in the image outward like that in this figure. Lastly, mirroring where the values are mirrored outward like that in this figure. HIP uses the size of the available support to re-normalize the operation. Many operations are trivial to re-normalize, such as the mean, median, minimum, and maximum filters. The Gaussian filter can be re-normalized easily enough by ensuring that the Gaussian kernel (or convolution matrix) sums to one. While applying the kernel to input values, a sum of the kernel values used needs to be kept. After all of the available input values have been visited, the result need only be divided by the kernel accumulator. This works out the same if a new Gaussian kernel was calculated to be unitary over the same support. See a simple version of this in panel D of each of the figures zero, copy, and mirror. Figures gaussian with single precision and gaussian with integer precision show a practical example. The use of a large Gaussian smoothing is often times how background or out-of-focus signal is estimated. This exemplifies how not treating the edge of an image can be detrimental to downstream operations. Other operations have also been re-normalized such as the Laplacian of Gaussian (LoG) filter. However, the results have only be verified empirically. The proof is still a work in progress.

Image processing using a zero padding scheme

zeroPad

A is a simulated image that has been padded with zeros for a 5x5 neighborhood support. B is a 1-D sample of the image represented by the green box in A. C is the same sample without the padding. The left side of D is a mean filter using B. The right side of D is a mean filter with the kernel re-normalized. E is like D but the kernel is now a Gaussian filter. The resulting values in D and E have been rounded to the next integer value as would be done with integer valued data.

Image processing using a last value copying scheme

copyPad

A is a simulated image that has been padded with the last value for a 5x5 neighborhood support. B is a 1-D sample of the image represented by the green box in A. C is the same sample without the padding. The left side of D is a mean filter using B. The right side of D is a mean filter with the kernel re-normalized. E is like D but the kernel is now a Gaussian filter. The resulting values in D and E have been rounded to the next integer value as would be done with integer valued data.

Image processing using a mirroring scheme

mirrorPad

A is a simulated image that has been padded with a mirror image for a 5x5 neighborhood support. B is a 1-D sample of the image represented by the green box in A. C is the same sample without the padding. The left side of D is a mean filter using B. The right side of D is a mean filter with the kernel re-normalized. E is like D but the kernel is now a Gaussian filter. The resulting values in D and E have been rounded to the next integer value as would be done with integer valued data.

A comparison of Gaussian smoothing performed on an image with single point precision

gauss75single

The left most column contains a raw image of cells florescent marked in a three dimensional space. The second column from the left is the resulting image after being smoothed with a large Gaussian kernel using HIP. The Gaussian kernel is smaller in the $Z$ direction due to the anisotropic voxel dimensions of the image. The second column from the right was smoothed using a MATLAB Gaussian filter of the same size. The right most column is the difference between the HIP result and the MATLAB result. Positive values are in magenta and negative values are in green. Zero values are in black. All images have been normalized for display with actual ranges listed on the sides. The two middle columns look almost identical, however, the rightmost column shows the structured difference between the two. Note that MATLAB looses energy at image edges with low intensity (first $Z$ slice) and gains energy where the intensities are higher (last $Z$ slice). Also note the difference in execution time between HIP and MATLAB.

A comparison of Gaussian smoothing performed on an image with unsigned 8 bit integer values

gauss75int

The layout is similar to that of figure with single precision. In these integer valued images, one can start to see that the MATLAB version is losing the asymmetry of the underlying data. These differences can significantly change downstream analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly