Note: At the time of publishing, Github did not render formula well. This markdown renders well in VSCode but I cannot sort the formatting in Github markdown - sorry.
Having written a vectorised version of gradient descent in Python (numpy) to 'solve' the Digit Recognition problem, I wanted to understand the relative performance of this to alternatives written in C++ and CUDA. Given my limited programming ability there is no way I have come close to any optimal implementation but the work was still interesting. I opted to use efficient libraries where I could (and knew about them) because getting stuck in the weeds of fast matrix multiplication was not something I really wanted to get into.
- Python 3.10.4 for the benchmark implementation
- Microsoft (R) C/C++ Optimizing Compiler Version 19.32.31329
- vcpkg package management program version 2022-07-21-a0e87e227afb536c62188c11ad029954f28fdb22
- Package: Inih to read initialisation files
- Package: Eigen3 3.4.0 for matrix manipulation
- Package: oneTBB for running loops in parallel
- Package: Catch2 for testing
- CUDA: Build cuda_11.7.r11.7/compiler.31442593_0
- cuBLAS: 11.10.3 - Note my comment about avoiding the weeds of matrix multiplication but I was happy to stop here and not use cuDNN for this exercise
- Some terrible CMake - my fault because I am a noob. In particular I really struggled getting the appropriate inih debug or release library linked because it appears that inih does not have the necessary CMake support files. In the end I require the user to create an Environment Variable VCPKG_ROOT (to the folder that contains vcpkg.exe) from which I force CMake to find the appropriate version of inih.lib
- VSCode
- Keras: While not really comparable, because the library is designed to be be far more general and fast to implement and because I didn't spend all the time necessary to understand the framework, I never-the-less was keen to understand how fast a model would run using a purpose built framework. Keras had access to the GPU
- The Python code in the Python folder only has a training algorithm. My benchmark test uses all the Kaggle input data for training (i.e. there is no training / testing split) because I am only interested in the algorithm's execution speed.
- The final C++ vectorised version is found in
/src/neuralnetwork.cpp
- The final C++ loop version (both single and multithreaded) is found in
/src/neuralnetworkloop.cpp
- Experimental CPP implementations can be found in
/src/neuralNetworkMethods.cpp
. Seetrain_loop_base
- the naive implementation; andtrain_loop_faster
- which uses more memory because is keeps copies of the Weights and the derivative of the Weights in their base and transpose format to reduce the number of calculations. - CUDA code in
/src_gpu/neuralnetwork_gpu.cu
- Keras in
/python/keras.ipynb
Absolute performance is obviously dependant on the tin on which this runs. I will focus on relative performance and try to keep the machine specs out of the equation wherever possible. One spec which however is relevant is that the machine I used for this benchmark exercise has 8 cores and my NVIDAI RTX 3060 has 3584 cores.
My benchmark test consists of 200 epochs though a neural network with 784 input neurons, one hidden layer of 50 neurons and an output layer of 10 neurons i.e. we set the architecture to (784,50,10). All 42,000 examples in the input file will be used in the training set for performance benchmarking (I am not trying to enter the Kaggle competition so no need to reserve anything for testing).
- Python Base: Speed = 1.00 (normalised base). It uses numpy which is multithreaded and uses double precision arithmetic.
- Fastest C++: Speed = 0.54 (46% improvement in execution time). It uses the Eigen for Vectorisation which uses OpenMP for multithreading. Speedup was almost entirely due to moving from double to single digit precision (see the
/src/typedefs.h
file to make this change) - Loop (single thread, single precision) = 2.80x
- Loop (multi threaded, single precision) = 0.83 (16.5% improvement). Using intel's oneTBB for multithreading
- GPU (single precision) = 0.06 (17 x improvement)
For reference, the execution time, in seconds on my machine are as follows:
Python | Eigen3 | Eigen3 | Loop (1 thread) | Loop (tbb) | GPU | Keras | |
---|---|---|---|---|---|---|---|
Precision | double | single | double | single | single | single | single |
1 | 37.6 | 20.23 | 36.75 | 107.11 | 31.91 | 2.23 | 7.44 |
2 | 37.9 | 20.11 | 38.28 | 106.45 | 31.62 | 2.67 | 7.40 |
3 | 38.9 | 22.50 | 37.95 | 106.51 | 32.04 | 1.99 | 7.35 |
4 | 37.2 | 20.26 | 36.46 | 106.51 | 31.53 | 1.96 | 7.53 |
5 | 39.2 | 20.13 | 36.74 | 107.6 | 32.04 | 2.31 | 7.36 |
Average | 38.16 | 20.65 | 37.24 | 106.84 | 31.83 | 2.23 | 7.42 |
-
$l$ : the number of layers in the Neural Network (including input and output). -
$n^{[i]}$ for$i=1,..,l$ : the number of nodes in a layer. -
$m$ : the number of examples in the training set. -
$x^{(i)}$ for$i=1,...,m$ is the$i$ -th example represented as a column vector of size$(n^{[0]} \times 1)$ - $X=
\begin{bmatrix}
|&|&|&| \
x^{(1)}&x^{(2)}&...&x^{(m)} \
|&|&|&| \
\end{bmatrix}
$: all the training data where each column is one example. Note the input data is not stored like this so needs to be transformed before being used -
$y^{(i)}$ : the label of the$i$ -th example represented as a column vector of length$(n^{[l]} \times 1)$ which in this example is$(10 \times 1)$
$ 0 = \begin{bmatrix} 1\ 0\ 0\ |\ 0\ \end{bmatrix} 1 = \begin{bmatrix} 0\ 1\ 0\ |\ 0\ \end{bmatrix} 2 = \begin{bmatrix} 0\ 0\ 1\ |\ 0\ \end{bmatrix} ... 9 = \begin{bmatrix} 0\ 0\ 0\ |\ 1\ \end{bmatrix} $ - $Y= \begin{bmatrix}
|&|&|&| \
y^{(1)}&y^{(2)}&...&y^{(m)} \
|&|&|&| \
\end{bmatrix}
$ -
$\hat{y}^{(i)}$ : The model generated output when running$x^{[i]}$ though one iterate of forward propagation - $\hat{Y} = \begin{bmatrix}
|&|&|&| \
\hat{y}^{(1)}&\hat{y}^{(2)}&...&\hat{y}^{(m)} \
|&|&|&| \
\end{bmatrix}
$
To run through the Neural Network we define a
$
z^{1} = W^{[0]}x^{(i)} + b^{[0]} \
a^{1} = \sigma ( z^{1} )\
z^{2} = W^{[1]}a^{1} + b^{[1]} \
a^{2} = \sigma \left( z^{2}\right) \
...\
z^{l} = W^{[l]}a^{l](i)} + b^{[l]} \
\hat{y}^{(i)} = a^{l} = \sigma \left( z^{l}\right) \
$
Where
We can use matrix notation to perform the forward propagation of all the input examples more compactly.
$
Z^{[1]} = W^{[0]}X + b^{[0]} \
A^{[1]} = \sigma ( Z^{[1]} )\
Z^{[2]} = W^{[2]}A^{[2]} + b^{[2]} \
A^{[2]} = \sigma \left( Z^{[2]}\right) \
... \
Z^{[l]} = W^{[l]}A^{[l]} + b^{[l]} \
\hat{Y} = A^{[l]} = \sigma \left( Z^{[l]}\right) \
$
Where
$A^{[j]}=
\begin{bmatrix}
|&|&|&| \
a^{j}&a^{j}&...&a^{j} \
|&|&|&| \
\end{bmatrix}
$
and
$Z^{[j]}=
\begin{bmatrix}
|&|&|&| \
z^{j}&z^{j}&...&z^{j} \
|&|&|&| \
\end{bmatrix}
$, for
In the code, I call unactivated_values
and neuron_values
.
The estimators
$
\hat{Y} = \sigma \left(W^{[l]} \sigma \left( W^{[l-1]} \sigma(...) + b^{{l-1}} \right) + b^{[l]} \right)
$
Once we have the values
$
\mathscr{L} (A^{[l]}, Y) = \mathscr{L} (\hat{Y}, Y) = \frac{1}{m} \sum \limits _{i=1} ^m \left[ -y^{(i)} log(\hat{y}^{(i)}) - (1-y^{(i)}) log(1-\hat{y}^{(i)}) \right]
$
which is a column vector of size
If we define
- Use forward propagation to compute predicates
$\hat{Y}$ - Define
$dW^{[i]}=\frac{\partial J}{\partial W^{[i]}}$ , - Define $ db^{[i]}=\frac{\partial J}{\partial b^{[i]}}$,
- Update
$W^{[i]} := W^{[i]} - \alpha dW^{[i]}$ , - Update
$b^{[i]} := b^{[i]} - \alpha db^{[i]}$
The calculation of the partial derivatives, which are implemented in the 'back propagation' part of the code, can be shown to be
$
dA^{[l]} = -\frac{Y}{\hat{Y}} + \frac{(1-Y)}{\hat{Y}} \
dZ^{[i]} = \sigma^{'} (A^{[i]} ) \bullet Z^{[i]} \
dW^{[i]} = \frac{1}{m} dZ^{[i-1]} \bullet (A^{[i-1]})^{T}
$
$
db^{[i]} = \frac{1}{m} \sum_{j=1}^{m} dZ^{[i], j} $ (add the columns of
$
dA^{[i]} = W^{[i-1]} \bullet dZ^{[i]} \
$
We are using the Kaggle Digit Recognizer data (see https://www.kaggle.com/competitions/digit-recognizer). In this data a single digit is scanned and partitioned into a 28x28 matrix of greyscale (values between 0 and 255 which represent the colour intensity). matrix is indexed as
$\begin{bmatrix}
000&001&002&003&...&026&027 \
028&029&030&031&...&054&055 \
056&057&058&059&...&082&083 \
|&|&|&|&...&|&| \
728&729&730&731&...&754&755 \
756&757&758&759&...&782&783 \
\end{bmatrix}
$
and each scanned values is then rolled into a row vector
$X_{input} = \begin{bmatrix}
-(x^{(1)})^T- \
-(x^{(2)})^T- \
| \
-(x^{(m)})^T- \
\end{bmatrix}
$
where m = 42,000
A column of labels, values from 0 to 9, is prefixed to
In the benchmark test I use:
$l=3$ -
$n^{[0]}=784=28 \times 28$ ,$n^{[1]}=50$ and$n^{[2]}=10$ $m=42,000$ -
$\sigma(x) = \frac{1}{1+e^{-x}}$ : the sigmoid function