-
Notifications
You must be signed in to change notification settings - Fork 0
Understanding reverse mode automatic differentiation
Gradients are computed using reverse-mode autodiff. All computations are representated as a graph of tensors with each tensor holding a reference to a function which can compute the local gradient of that tensor. The calculation of the partial derivative for each tensor is completed when the entire graph is traversed.
The fundamental idea behind autodiff
is that it calculates the local derivative for each variable rather than its partial derivative. This way traversing through the computational graph is simple and modular, i.e we could calculate the partial derivative of any variable with respect to the output with just one traversal, with a complexity of .
The difference between partial and local derivative is the way each variable is treated in each equation. When calculating the partial derivative of a function, the expression is broken down into variables, for example and , instead of using , we say in the . On the other hand, when calculating the local derivative of a function, each element in the expression is considered a variable. I understand this might not be clear, so refer to the following example:
# Lets consider two variables `a` and `b`
a = 2
b = 3
# Function c which is a multiplied by b
c = a*b
# where d is a function of `a`, `b`, `c` but infact it is just a function of two variables `a` and `b`
d = a+c+b
And if we work out the partial derivative of
The important part to note is that c
is written as a*b
when calculating the derivative. Now, let's try the same computation using the local derivative method.
Local derivative of d
w.r.t a
is expressed as , note the bar over the d
.
Local derivative treats c
as a variable instead of a function of a
and b
. Therefore when calculating , we treat c
as a constant rather than a function of a
and b
. This gives us the
Similarly,
Therefore, to calculate we have to multiply edges of a path and add the different path to a node.