Documentation | Latest Release | Build Status | Code Coverage |
---|---|---|---|
To obtain current versions of all dependencies, git clone
(or git pull
to update) the following repositories:
To save time on installation, you can use a pre-built Docker image to run the package:
docker run -it ghcr.io/kamalsaleh/machine-learning-for-cap-docker:latest
This will install the computer algebra system Gap and all other dependencies in an isolated container environment.
This package is an implementation of the ideas presented in the paper Deep Learning with Parametric Lenses using the categorical programming language offered by the CAP project. The following is a brief overveiw of the operations offered by this repository.
Let
In machine learning, both predictions maps and loss maps can be seen as parametrized maps, any they play distinct but complementary roles in the model training process. Let us break down how each of these fits into the concept of parametrized maps.
Let
-
$x^{[i]}$ is the input-vector (feature vector) and it belongs to the input-space$X$ . -
$y^{[i]}$ is the label-vector of$x^{[i]}$ and it belongs to the output-space$Y$ .
A prediction map can be defined as a parametrized map
For example let
where
A loss map can be defined as a parametrized map
where
For a linear regression model, the loss map can be defined as follows:
where
Given a training set
In this example, we consider a training dataset consisting of the three points
We aim to compute a line that fits
gap> LoadPackage( "MachineLearningForCAP" );
true
gap> Para := CategoryOfParametrisedMorphisms( SkeletalSmoothMaps );
CategoryOfParametrisedMorphisms( SkeletalSmoothMaps )
gap> D := [ [ 1, 2.9 ], [ 2, 5.1 ], [ 3, 7.05 ] ];;
Let us create a neural network with the following architecture:
where the activation map applied on the output layer is the identity function IdFunc. Its input dimension is 1 and output dimension is 1 and has no hidden layers.
gap> input_dim := 1;; hidden_dims := [ ];; output_dim := 1;;
gap> f := PredictionMorphismOfNeuralNetwork( Para, input_dim, hidden_dims, output_dim, "IdFunc" );;
As a parametrized map this neural network is defined as:
Note that
gap> input := ConvertToExpressions( [ "theta_1", "theta_2", "x" ] );
[ theta_1, theta_2, x ]
gap> Display( f : dummy_input := input );
ℝ^1 -> ℝ^1 defined by:
Underlying Object:
-----------------
ℝ^2
Underlying Morphism:
-------------------
ℝ^3 -> ℝ^1
‣ theta_1 * x + theta_2
Let us now evaluate this prediction map on a random parameters-vector in
gap> theta := [ 2, 1 ];; x := [ 2 ];;
gap> Eval( f, [ theta, x ] );
[ 5 ]
To train the neural network, we need to specify a loss map that will be used to learn the weights by minimizing the total loss. Since the activation map applied on the output layer is IdFunc, we use the Quadratic-Loss map:
Note that
In the following we construct the aforementioned loss-map:
gap> ell := LossMorphismOfNeuralNetwork( Para, input_dim, hidden_dims, output_dim, "IdFunc" );;
gap> input := ConvertToExpressions( [ "theta_1", "theta_2", "x", "y" ] );
[ theta_1, theta_2, x, y ]
gap> Display( ell : dummy_input := input );
ℝ^2 -> ℝ^1 defined by:
Underlying Object:
-----------------
ℝ^2
Underlying Morphism:
-------------------
ℝ^4 -> ℝ^1
‣ (theta_1 * x + theta_2 - y) ^ 2 / 1
In order to learn the parameters we need to specifiy an optimization procedure. In this example, we will use Gradient-Descent-Optimizer. Starting with initial values, it computes the gradient of the loss function and updates the parameters in the opposite direction of the gradient, scaled by a learning rate. This process continues until the loss function converges to a minimum.
gap> Lenses := CategoryOfLenses( SkeletalSmoothMaps );
CategoryOfLenses( SkeletalSmoothMaps )
gap> optimizer := Lenses.GradientDescentOptimizer( : learning_rate := 0.01 );;
Now we compute the One-Epoch-Update-Lens using the batch size = 1:
gap> batch_size := 1;;
gap> one_epoch_update := OneEpochUpdateLens( ell, optimizer, D, batch_size );
(ℝ^2, ℝ^2) -> (ℝ^1, ℝ^0) defined by:
Get Morphism:
----------
ℝ^2 -> ℝ^1
Put Morphism:
----------
ℝ^2 -> ℝ^2
The Get Morphism computes the total loss associated to a parameter-vector
Let us initialize a parameter-vector:
gap> theta := [ 0.1, -0.1 ];;
To perform nr_epochs = 15 updates on
gap> nr_epochs := 10;;
gap> theta := Fit( one_epoch_update, nr_epochs, theta );
Epoch 0/15 - loss = 26.777499999999993
Epoch 1/15 - loss = 13.002145872163197
Epoch 2/15 - loss = 6.3171942035316935
Epoch 3/15 - loss = 3.0722513061917534
Epoch 4/15 - loss = 1.4965356389126505
Epoch 5/15 - loss = 0.73097379078374358
Epoch 6/15 - loss = 0.35874171019291579
Epoch 7/15 - loss = 0.1775574969062125
Epoch 8/15 - loss = 0.089228700384937534
Epoch 9/15 - loss = 0.046072054531129378
Epoch 10/15 - loss = 0.024919378473509772
Epoch 11/15 - loss = 0.014504998499450883
Epoch 12/15 - loss = 0.0093448161379050161
Epoch 13/15 - loss = 0.0067649700132868147
Epoch 14/15 - loss = 0.0054588596501628835
Epoch 15/15 - loss = 0.0047859930295160499
[ 2.08995, 0.802632 ]
The parameter-vector after 15 epochs is
gap> theta := SkeletalSmoothMaps.Constant( theta );
ℝ^0 -> ℝ^2
gap> f_theta := ReparametriseMorphism( f, theta );
ℝ^1 -> ℝ^1 defined by:
Underlying Object:
-----------------
ℝ^0
Underlying Morphism:
-------------------
ℝ^1 -> ℝ^1
gap> f_theta := UnderlyingMorphism( f_theta );;
gap> Display( f_theta );
ℝ^1 -> ℝ^1
‣ 2.08995 * x1 + 0.802632
Let us compute the predicted values associated to
gap> Eval( f_theta, [ 1 ] );
[ 2.89259 ]
gap> Eval( f_theta, [ 2 ] );
[ 4.98254 ]
gap> Eval( f_theta, [ 3 ] );
[ 7.07249 ]
The following image illustrates the lines defined by the parameter-vectors over the course of the 15 epochs.
In this example, we consider a training dataset consisting of points in the two-dimensional Euclidean space,
To facilitate this classification, we use a one-hot encoding scheme for the labels. There are three possible classes for the data points. We denote these classes as class-1 (red), class-2 (green), and class-3 (blue). The labels are encoded using one-hot vectors. A one-hot vector is a binary vector with a length equal to the number of classes, where only the element corresponding to the true class is 1, and all other elements are 0. That is
- class-1 : [1, 0, 0]
- class-2 : [0, 1, 0]
- class-3 : [0, 0, 1]
That is, the training set (the set of labeled exmaples) is a finite subset of
gap> LoadPackage( "MachineLearningForCAP" );
true
gap> Para := CategoryOfParametrisedMorphisms( SkeletalSmoothMaps );
CategoryOfParametrisedMorphisms( SkeletalSmoothMaps )
gap> class_1 := Concatenation( List( [ -2 .. 3 ], i -> [ [ i, i - 1, 1, 0, 0 ], [ i + 1, i - 1, 1, 0, 0 ] ] ) );;
gap> class_2 := Concatenation( List( [ -3 .. -1 ], i -> List( [ i + 1 .. - i - 1 ], j -> [ i, j, 0, 1, 0 ] ) ) );;
gap> class_3 := Concatenation( List( [ 1 .. 3 ], i -> List( [ - i + 1 .. i - 1 ], j -> [ j, i, 0, 0, 1 ] ) ) );;
gap> D := Concatenation( class_1, class_2, class_3 );
[ [ -2, -3, 1, 0, 0 ], [ -1, -3, 1, 0, 0 ], [ -1, -2, 1, 0, 0 ], [ 0, -2, 1, 0, 0 ], [ 0, -1, 1, 0, 0 ],
[ 1, -1, 1, 0, 0 ], [ 1, 0, 1, 0, 0 ], [ 2, 0, 1, 0, 0 ], [ 2, 1, 1, 0, 0 ], [ 3, 1, 1, 0, 0 ],
[ 3, 2, 1, 0, 0 ], [ 4, 2, 1, 0, 0 ], [ -3, -2, 0, 1, 0 ], [ -3, -1, 0, 1, 0 ], [ -3, 0, 0, 1, 0 ],
[ -3, 1, 0, 1, 0 ], [ -3, 2, 0, 1, 0 ], [ -2, -1, 0, 1, 0 ], [ -2, 0, 0, 1, 0 ], [ -2, 1, 0, 1, 0 ],
[ -1, 0, 0, 1, 0 ], [ 0, 1, 0, 0, 1 ], [ -1, 2, 0, 0, 1 ], [ 0, 2, 0, 0, 1 ], [ 1, 2, 0, 0, 1 ],
[ -2, 3, 0, 0, 1 ], [ -1, 3, 0, 0, 1 ], [ 0, 3, 0, 0, 1 ], [ 1, 3, 0, 0, 1 ], [ 2, 3, 0, 0, 1 ] ]
Let us create a neural network with the following architecture:
where the activation map applied on the output layer is Softmax.
Its input dimension is 2 and output dimension is 3 and has no hidden layers.
gap> input_dim := 2;; hidden_dims := [ ];; output_dim := 3;;
gap> f := PredictionMorphismOfNeuralNetwork( Para, input_dim, hidden_dims, output_dim, "Softmax" );;
As a parametrized map this neural network is defined as:
Note that
gap> input := ConvertToExpressions( [ "theta_1", "theta_2", "theta_3", "theta_4", "theta_5", "theta_6", "theta_7", "theta_8", "theta_9", "x1", "x2" ] );
[ theta_1, theta_2, theta_3, theta_4, theta_5, theta_6, theta_7, theta_8, theta_9, x1, x2 ]
gap> Display( f : dummy_input := input );
ℝ^2 -> ℝ^3 defined by:
Underlying Object:
-----------------
ℝ^9
Underlying Morphism:
-------------------
ℝ^11 -> ℝ^3
‣ Exp( (theta_1 * x1 + theta_2 * x2 + theta_3) )
/ (Exp( theta_1 * x1 + theta_2 * x2 + theta_3 ) + Exp( (theta_4 * x1 + theta_5 * x2 + theta_6) ) + Exp( (theta_7 * x1 + theta_8 * x2 + theta_9) ))
‣ Exp( (theta_4 * x1 + theta_5 * x2 + theta_6) )
/ (Exp( theta_1 * x1 + theta_2 * x2 + theta_3 ) + Exp( (theta_4 * x1 + theta_5 * x2 + theta_6) ) + Exp( (theta_7 * x1 + theta_8 * x2 + theta_9) ))
‣ Exp( (theta_7 * x1 + theta_8 * x2 + theta_9) )
/ (Exp( theta_1 * x1 + theta_2 * x2 + theta_3 ) + Exp( (theta_4 * x1 + theta_5 * x2 + theta_6) ) + Exp( (theta_7 * x1 + theta_8 * x2 + theta_9) ))
Let us now evaluate this prediction map on a random parameters-vector in
gap> theta := [ 0.1, -0.1, 0, 0.1, 0.2, 0, -0.2, 0.3, 0 ];;
gap> input_vec := [ 1, 2 ];;
gap> prediction_vec := Eval( f, [ theta, input_vec ] );
[ 0.223672, 0.407556, 0.368772 ]
gap> PositionMaximum( prediction_vec );
2
That is, the input-vector
To train the neural network, we need to specify a loss map that will be used to learn the weights by minimizing the total loss. Since the activation map applied on the output layer is Softmax, we use the Cross-Entropy loss map:
Note that
More explicitely, if
In the following we construct the aforementioned loss-map:
gap> ell := LossMorphismOfNeuralNetwork( Para, input_dim, hidden_dims, output_dim, "Softmax" );;
gap> input := ConvertToExpressions( [ "theta_1", "theta_2", "theta_3", "theta_4", "theta_5", "theta_6", "theta_7", "theta_8", "theta_9", "x1", "x2", "y1", "y2", "y3" ] );
gap> Display( ell : dummy_input := input );
ℝ^5 -> ℝ^1 defined by:
Underlying Object:
-----------------
ℝ^9
Underlying Morphism:
-------------------
ℝ^14 -> ℝ^1
‣ ((Log( Exp( theta_1 * x1 + theta_2 * x2 + theta_3 ) + Exp( (theta_4 * x1 + theta_5 * x2 + theta_6) ) + Exp( (theta_7 * x1 + theta_8 * x2 + theta_9) ) ) - (theta_1 * x1 + theta_2 * x2 + theta_3)) * y1
+ (Log( Exp( theta_1 * x1 + theta_2 * x2 + theta_3 ) + Exp( (theta_4 * x1 + theta_5 * x2 + theta_6) ) + Exp( (theta_7 * x1 + theta_8 * x2 + theta_9) ) ) - (theta_4 * x1 + theta_5 * x2 + theta_6)) * y2
+ (Log( Exp( theta_1 * x1 + theta_2 * x2 + theta_3 ) + Exp( (theta_4 * x1 + theta_5 * x2 + theta_6) ) + Exp( (theta_7 * x1 + theta_8 * x2 + theta_9) ) ) - (theta_7 * x1 + theta_8 * x2 + theta_9)) * y3) / 3
Until now, we have the training set
In this example, we want to use the Adam-Optimizer procedure. The Adam-Optimizer is an extension of stochastic gradient descent that computes adaptive learning rates for each parameter. When using the Adam-Optimizer, additional auxiliary weights are maintained for each parameter to store moment estimates. Specifically, for a model with
- Time Step (
$\mathbf{t}$ ): one auxiliary weight. - First Moment Estimates (
$\mathbf{m}$ ):$n$ auxiliary weights. - Second Moment Estimates (
$\mathbf{v}$ ):$n$ auxiliary weights. - Original Model Weights:
$n$ parameters.
So, in total, the Adam-Optimizer uses
gap> Lenses := CategoryOfLenses( SkeletalSmoothMaps );
CategoryOfLenses( SkeletalSmoothMaps )
gap> optimizer := Lenses.AdamOptimizer( : learning_rate := 0.01, beta_1 := 0.9, beta_2 := 0.999 );;
gap> optimizer( 9 )
(ℝ^28, ℝ^28) -> (ℝ^9, ℝ^9) defined by:
Get Morphism:
----------
ℝ^28 -> ℝ^9
Put Morphism:
----------
ℝ^37 -> ℝ^28
Now we compute the One-Epoch-Update-Lens using the batch size = 1:
gap> batch_size := 1;;
gap> one_epoch_update := OneEpochUpdateLens( ell, optimizer, D, batch_size );;
(ℝ^28, ℝ^28) -> (ℝ^1, ℝ^0) defined by:
Get Morphism:
----------
ℝ^28 -> ℝ^1
Put Morphism:
----------
ℝ^28 -> ℝ^28
The Get Morphism computes the total loss associated to the extended parameter-vector and Put Morphism updates the extended parameter-vector.
Let us initialize a parameter-vector:
gap> t := [ 1 ];; # one entry
gap> m := [ 0, 0, 0, 0, 0, 0, 0, 0, 0 ]; # 9 entries
gap> v := [ 0, 0, 0, 0, 0, 0, 0, 0, 0 ]; # 9 entries
gap> theta := [ 0.1, -0.1, 0, 0.1, 0.2, 0, -0.2, 0.3, 0 ];; # 9 entries
gap> w := Concatenation( t, m, v, theta );;
The total loss associated to the extended parameter-vector
gap> current_loss := Eval( GetMorphism( one_epoch_update ), w );
[ 0.345836 ]
And to update the extended parameter-vector (one-epoch update) we use the "Put Morphism":
gap> new_w := Eval( PutMorphism( one_epoch_update ), w );
[ 31, 0.104642, -0.355463, -0.197135, -0.109428, -0.147082, 0.00992963,
0.00478679, 0.502546, 0.187206, 0.0105493, 0.00642903, 0.00211548,
0.00660062, 0.00274907, 0.00110985, 0.00278786, 0.0065483, 0.00112838,
5.45195, -1.26208, 3.82578, -5.40639, -0.952146, -3.42835, -2.79496, 3.09008, -6.80672 ]
To perform nr_epochs = 4 updates on
gap> nr_epochs := 4;;
gap> w := Fit( one_epoch_update, nr_epochs, w );
Epoch 0/4 - loss = 0.34583622811001763
Epoch 1/4 - loss = 0.6449437167091393
Epoch 2/4 - loss = 0.023811108587716449
Epoch 3/4 - loss = 0.0036371652708073405
Epoch 4/4 - loss = 0.0030655216725219204
[ 121, -4.57215e-05, -0.00190417, -0.0014116, -0.00181528, 0.00108949, 0.00065518, 0.001861, 0.000814679,
0.000756424, 0.0104885, 0.00846858, 0.0022682, 0.00784643, 0.00551702, 0.0014626, 0.00351408, 0.00640225,
0.00115053, 5.09137, -4.83379, 3.06257, -5.70976, 0.837175, -4.23622, -1.71171, 5.54301, -4.80856 ]
Now let us use the updated theta (is the last
gap> theta := SplitDenseList( w, [ 19, 9 ] )[2];
[ 5.09137, -4.83379, 3.06257, -5.70976, 0.837175, -4.23622, -1.71171, 5.54301, -4.80856 ]
gap> theta := SkeletalSmoothMaps.Constant( theta );
ℝ^0 -> ℝ^9
gap> f_theta := ReparametriseMorphism( f, theta );
ℝ^2 -> ℝ^3 defined by:
Underlying Object:
-----------------
ℝ^0
Underlying Morphism:
-------------------
ℝ^2 -> ℝ^3
gap> f_theta := UnderlyingMorphism( f_theta );;
gap> Display( f_theta );
ℝ^2 -> ℝ^3
‣ Exp( (5.09137 * x1 + (- 4.83379) * x2 + 3.06257) ) /
(Exp( 5.09137 * x1 + (- 4.83379) * x2 + 3.06257 ) + Exp( ((- 5.70976) * x1 + 0.837175 * x2 + (- 4.23622)) ) + Exp( ((- 1.71171) * x1 + 5.54301 * x2 + (- 4.80856)) ))
‣ Exp( ((- 5.70976) * x1 + 0.837175 * x2 + (- 4.23622)) ) /
(Exp( 5.09137 * x1 + (- 4.83379) * x2 + 3.06257 ) + Exp( ((- 5.70976) * x1 + 0.837175 * x2 + (- 4.23622)) ) + Exp( ((- 1.71171) * x1 + 5.54301 * x2 + (- 4.80856)) ))
‣ Exp( ((- 1.71171) * x1 + 5.54301 * x2 + (- 4.80856)) ) /
(Exp( 5.09137 * x1 + (- 4.83379) * x2 + 3.06257 ) + Exp( ((- 5.70976) * x1 + 0.837175 * x2 + (- 4.23622)) ) + Exp( ((- 1.71171) * x1 + 5.54301 * x2 + (- 4.80856)) ))
gap> input_vec := [ 1, -1 ];;
gap> predictions_vec := Eval( f_theta, input_vec );
[ 1., 4.74723e-11, 1.31974e-11 ]
gap> PositionMaximum( predictions_vec );
1
That is, the predicted label of the input-vector
gap> predictions_vec := Eval( f_theta, [ 1, 3 ] );
[ 7.13122e-08, 2.40484e-08, 1. ]
gap> PositionMaximum( predictions_vec );
3
That is, the predicted label of the input-vector
The following image illustrates the predictions of the points {$(0.5i,0.5j)|i,j\in${$-6,-5,\dots,5,6$}} over the course of the