diff --git a/Gemfile b/Gemfile index 3aec09b..92e3497 100644 --- a/Gemfile +++ b/Gemfile @@ -5,3 +5,5 @@ gem "jekyll", "~> 4.3.3" # installed by `gem jekyll` gem "just-the-docs", "0.7.0" # pinned to the current release # gem "just-the-docs" # always download the latest release + +gem 'jekyll-sitemap' \ No newline at end of file diff --git a/Gemfile.lock b/Gemfile.lock index a9f3857..a02d9d1 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -39,6 +39,8 @@ GEM sass-embedded (~> 1.54) jekyll-seo-tag (2.8.0) jekyll (>= 3.8, < 5.0) + jekyll-sitemap (1.4.0) + jekyll (>= 3.7, < 5.0) jekyll-watch (2.2.1) listen (~> 3.0) just-the-docs (0.7.0) @@ -83,6 +85,7 @@ PLATFORMS DEPENDENCIES jekyll (~> 4.3.3) + jekyll-sitemap just-the-docs (= 0.7.0) BUNDLED WITH diff --git a/_config.yml b/_config.yml index 4b03bdc..32b1d03 100644 --- a/_config.yml +++ b/_config.yml @@ -9,8 +9,8 @@ aux_links: color_scheme: custom baseurl: / - - +plugins: + - jekyll-sitemap callouts_level: loud # or loud callouts: note-title: diff --git a/beep boop/foundation/back-propagation/index.md b/beep boop/foundation/back-propagation/index.md index 7d9a219..ea9b4ba 100644 --- a/beep boop/foundation/back-propagation/index.md +++ b/beep boop/foundation/back-propagation/index.md @@ -8,3 +8,107 @@ grand_parent: beep boop # Back Propagation Back propagation is the process of taking a series of nodes (equations), starting at the end, and calculating the effect each node has on the outcome of the equations. We do this by calculating the gradient ([derivative](../derivatives/)) of each node. + +## Code + +Here's an example of an individual value node that would exist inside of a chain of nodes and the functions it needs for back propagation. + +```python +class Value: + + def __init__(self, data, _children=(), _op='', label=''): + self.data = data + self.grad = 0.0 + self._backward = lambda: None + self._prev = set(_children) + self._op = _op + self.label = label + + def __repr__(self): + return f"Value(data={self.data})" + + def __add__(self, other): + other = other if isinstance(other, Value) else Value(other) + out = Value(self.data + other.data, (self, other), '+') + + def _backward(): + self.grad += 1.0 * out.grad + other.grad += 1.0 * out.grad + out._backward = _backward + + return out + + def __mul__(self, other): + other = other if isinstance(other, Value) else Value(other) + out = Value(self.data * other.data, (self, other), '*') + + def _backward(): + self.grad += other.data * out.grad + other.grad += self.data * out.grad + out._backward = _backward + + return out + + def __pow__(self, other): + assert isinstance(other, (int, float)), "only supporting int/float powers for now" + out = Value(self.data**other, (self,), f'**{other}') + + def _backward(): + self.grad += other * (self.data ** (other - 1)) * out.grad + out._backward = _backward + + return out + + def __rmul__(self, other): # other * self + return self * other + + def __truediv__(self, other): # self / other + return self * other**-1 + + def __neg__(self): # -self + return self * -1 + + def __sub__(self, other): # self - other + return self + (-other) + + def __radd__(self, other): # other + self + return self + other + + def tanh(self): + x = self.data + t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1) + out = Value(t, (self, ), 'tanh') + + def _backward(): + self.grad += (1 - t**2) * out.grad + out._backward = _backward + + return out + + def exp(self): + x = self.data + out = Value(math.exp(x), (self, ), 'exp') + + def _backward(): + self.grad += out.data * out.grad # NOTE: in the video I incorrectly used = instead of +=. Fixed here. + out._backward = _backward + + return out + + + def backward(self): + + topo = [] + visited = set() + def build_topo(v): + if v not in visited: + visited.add(v) + for child in v._prev: + build_topo(child) + topo.append(v) + build_topo(self) + + self.grad = 1.0 + for node in reversed(topo): + node._backward() +``` diff --git a/beep boop/foundation/gradient-descent/index.md b/beep boop/foundation/gradient-descent/index.md new file mode 100644 index 0000000..a639256 --- /dev/null +++ b/beep boop/foundation/gradient-descent/index.md @@ -0,0 +1,72 @@ +--- +title: Gradient Descent +parent: Foundation +grand_parent: beep boop +layout: default +math: katex +--- + +# Gradient Descent + +Gradient descent is the process fine tuning the weights and biases of a neural network to minimize our [loss function](../loss/). + +## Example + +We do this by performing [back propagation](../back-propagation/) across something like a [multi-layer perceptron](../multi-layer-perceptron/) to [calculate](../derivatives/) the gradients of each [neuron](../neuron/). We do this so when we do a forward-pass through the MLP, we can compare the expected outputs against the actual outputs using a [loss function](../loss/). Gradient descent is then the process of adjusting the weights and biases of each neuron, to get our loss function as low as possible. The gradient of each neuron helps us understand whether to change the weights/biases of that neuron in a positive or negative direction to achieve the output we want. + +Building off the [multi-layer perceptron](../multi-layer-perceptron/) implementation, we can perform gradient descent with the following: + +```python +n = MLP(3, [4, 4, 1]) +xs = [ + [2.0, 3.0, -1.0], + [3.0, -1.0, 0.5], + [0.5, 1.0, 1.0], + [1.0, 1.0, -1.0], +] +ys = [1.0, -1.0, -1.0, 1.0] + +for k in range(20): + + # forward pass + ypred = [n(x) for x in xs] + loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred)) + + # backward pass + for p in n.parameters(): + p.grad = 0.0 + loss.backward() + + # update + for p in n.parameters(): + p.data += -0.1 * p.grad + + print(k, loss.data) +``` + +The reasoning for `-0.1` here is actually super important. We have to remember the goal of this gradient descent is to lower the value of the loss function as much as possible. So when tuning our weights, we want to tune them such that they _decrease_ the loss function. Luckily we know that the gradient will tell us how much that value will change the output. Let's look at some examples: + +$$ +p.grad = 0.41\newline +p.data = 0.88 +$$ + +In this case, if we want to decrease the loss function, we want to decrease $$p.data$$, because $$p.grad$$ tells us that for every $$n$$ we increase $$p.data$$ the loss function changes by $$n \cdot 0.41$$. So it makes sense to instead do $$-0.1 * p.grad$$ here. + +But what if the signs are different? + +$$ +p.grad = -0.41\newline +p.data = 0.88 +$$ + +In this case, increasing $$p.data$$ decreases the loss function. If we do $$-0.1 \cdot -0.41$$ we get $$0.041$$ which will increase $$p.data$$ and further decrease the loss function. + +One more + +$$ +p.grad = -0.41\newline +p.data = -0.88 +$$ + +If we increase $$p.data$$, that will lower the loss function. And just like the previous example $$-0.1 * -0.41 = 0.041$$ which will end up increasing $$p.data$$ and lowering the resulting loss function. The sign of $$p.data$$ actually has no effect here, it's only the sign of $$p.grad$$ that matters. And we manage that by basically inverting it by multiplying with $$-0.1$$. If we were instead looking to maximize the loss function, we'd multiple by $$+0.1$$ diff --git a/beep boop/foundation/loss/LossFunction.drawio b/beep boop/foundation/loss/LossFunction.drawio new file mode 100644 index 0000000..217fea6 --- /dev/null +++ b/beep boop/foundation/loss/LossFunction.drawio @@ -0,0 +1,22 @@ + + + + + + + + + + + + + + + + + + + + + + diff --git a/beep boop/foundation/loss/index.md b/beep boop/foundation/loss/index.md new file mode 100644 index 0000000..59b4f93 --- /dev/null +++ b/beep boop/foundation/loss/index.md @@ -0,0 +1,41 @@ +--- +title: Loss +layout: default +parent: Foundation +grand_parent: beep boop +--- + +

+| ||
+|| |_
+

+ +The loss is a single number that helps us understand the performance of the neural network. The loss function is how we calculate that number. A lot of the time in training a neural network is spent optimizing this loss function. + +## Mean-squared error loss + +You calculate this by subtracting the actual output from the neural network with the expected output, squaring them, and then taking the mean of all values you tested. I _think_ this helps exaggerate values that are far from correct and shrink values that are closer to correct. But it also has the primary benefit of getting rid of the sign of the values, similar to $$abs$$. + +The curious thing to me is that we don't actually take the mean of the summated squared losses, at least not in anything I've seen so far. So I'm hoping to figure that out. It seems like the division by $$N$$ doesn't really matter, it's the squaring of the loss values that actually give us our metrics. Everything else is just syntactic sugar. + +![Mathematical expression of mean squared loss](./mean-squared-loss.png) + +## Example + +If we use our [multi-layer perceptron](../multi-layer-perceptron/) we can provide it with our initial inputs `xs` and our expected outputs `ys` for 4 passes, feed those through the MLP, and then calculate the loss. + +```python +n = MLP(3, [4, 4, 1]) +xs = [ + [2.0, 3.0, -1.0], + [3.0, -1.0, 0.5], + [0.5, 1.0, 1.0], + [1.0, 1.0, -1.0], +] +ys = [1.0, -1.0, -1.0, 1.0] +ypred = [n(x) for x in xs] + +loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred)) + +# 7.817821598365237 +``` diff --git a/beep boop/foundation/loss/mean-squared-loss.png b/beep boop/foundation/loss/mean-squared-loss.png new file mode 100644 index 0000000..7f84d81 Binary files /dev/null and b/beep boop/foundation/loss/mean-squared-loss.png differ diff --git a/beep boop/foundation/multi-layer-perceptron/index.md b/beep boop/foundation/multi-layer-perceptron/index.md index 7728897..eebf1e3 100644 --- a/beep boop/foundation/multi-layer-perceptron/index.md +++ b/beep boop/foundation/multi-layer-perceptron/index.md @@ -7,4 +7,57 @@ grand_parent: beep boop # Multi-layer Perceptron (MLP) -A MLP consists of many [neurons](../neuron/) lined up in order and feeding values between each other. +An MLP consists of many layers of [neurons](../neuron/) lined up in order and feeding values between each other. + +Since I'm very code inclined, here's the python that implements the following image: + +![a multilayer perceptron](./mlp.jpeg) + +The following code also uses the `Value` class from [Back Propagation](../back-propagation/) + +```python +class Neuron: + + def __init__(self, nin): + self.w = [Value(random.uniform(-1,1)) for _ in range(nin)] + self.b = Value(random.uniform(-1,1)) + + def __call__(self, x): + # w * x + b + act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b) + out = act.tanh() + return out + + def parameters(self): + return self.w + [self.b] + +class Layer: + + def __init__(self, nin, nout): + self.neurons = [Neuron(nin) for _ in range(nout)] + + def __call__(self, x): + outs = [n(x) for n in self.neurons] + return outs[0] if len(outs) == 1 else outs + + def parameters(self): + return [p for neuron in self.neurons for p in neuron.parameters()] + +class MLP: + + def __init__(self, nin, nouts): + sz = [nin] + nouts + self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))] + + def __call__(self, x): + for layer in self.layers: + x = layer(x) + return x + + def parameters(self): + return [p for layer in self.layers for p in layer.parameters()] + +x = [2.0, 3.0, -1.0] +n = MLP(3, [4, 4, 1]) +n(x) +``` diff --git a/beep boop/foundation/multi-layer-perceptron/mlp.jpeg b/beep boop/foundation/multi-layer-perceptron/mlp.jpeg new file mode 100644 index 0000000..0a495cf Binary files /dev/null and b/beep boop/foundation/multi-layer-perceptron/mlp.jpeg differ diff --git a/beep boop/foundation/neuron/index.md b/beep boop/foundation/neuron/index.md index cc114e9..a60f52d 100644 --- a/beep boop/foundation/neuron/index.md +++ b/beep boop/foundation/neuron/index.md @@ -11,3 +11,11 @@ Neurons are exactly what they sound like, the things in our brain! ![Diagram of a neuron](./neuron_model.jpeg) In machine learning, we model these in neural networks to simulate how the brain works. Neurons take in a series of values (x) and weights (w), which are individually multiplied and then added. The neuron fires by taking these and adding the bias of the neuron (how trigger happy it is) and passing it through an activation function which helps squash the values to something like -1 to 1. This is usually `tanh` or a `sigmoid` function. + +## Weights + +The weights for each input of a neuron are arbitrarily chosen. There's probably a whole field of mathematics that goes into determining the best starting weights, but at this point for me, it's random. Then through the process of training, these weights get adjusted to try and fit our loss function. + +## Biases + +Much like [weights](#weights), biases are also randomly chosen and updated throughout the training process to try and adjust the activation of that neuron to fit our loss function. diff --git a/beep boop/frameworks/index.md b/beep boop/frameworks/index.md new file mode 100644 index 0000000..f8a61ac --- /dev/null +++ b/beep boop/frameworks/index.md @@ -0,0 +1,8 @@ +--- +title: Frameworks +has_children: true +layout: default +parent: beep boop +--- + +Notes on various frameworks available for machine learning. diff --git a/beep boop/frameworks/pytorch/equation.png b/beep boop/frameworks/pytorch/equation.png new file mode 100644 index 0000000..dbc1b5e Binary files /dev/null and b/beep boop/frameworks/pytorch/equation.png differ diff --git a/beep boop/frameworks/pytorch/index.md b/beep boop/frameworks/pytorch/index.md new file mode 100644 index 0000000..c828af1 --- /dev/null +++ b/beep boop/frameworks/pytorch/index.md @@ -0,0 +1,47 @@ +--- +title: PyTorch +parent: Frameworks +layout: default +grand_parent: beep boop +--- + +# 🥧🔥 PyTorch + +If you read the word pytorch over and over, it really starts to lose it's meaning. Freaks me out when words do that. [PyTorch](https://pytorch.org/) is a production-grade machine learning framework written in Python that aids you in building and training neural networks. + +Here's a simple tree showing the equation and back propagation we're doing, alongside the torch code to calculate the same. + +![DAG of nodes for a math equation](./equation.png) + +```python +import torch +x1 = torch.Tensor([2.0]).double() ; x1.requires_grad = True +x2 = torch.Tensor([0.0]).double() ; x2.requires_grad = True +w1 = torch.Tensor([-3.0]).double() ; w1.requires_grad = True +w2 = torch.Tensor([1.0]).double() ; w2.requires_grad = True +b = torch.Tensor([6.8813735870195432]).double() ; b.requires_grad = True +n = x1*w1 + x2*w2 + b +o = torch.tanh(n) + +print(o.data.item()) +o.backward() + +print('---') +print('x2', x2.grad.item()) +print('w2', w2.grad.item()) +print('x1', x1.grad.item()) +print('w1', w1.grad.item()) + +# 0.7071066904050358 +# --- +# x2 0.5000001283844369 +# w2 0.0 +# x1 -1.5000003851533106 +# w1 1.0000002567688737 +``` + +You have to tell torch `x1.requires_grad = True` because they're leaf nodes and traditionally you don't want to calculate gradients for leaf nodes. My unconfirmed assumption, is because you usually aren't trying to change the inputs of the NN, you're trying to change the weights and biases of the [neurons](../../foundation/neuron/) inside of it. + +## Tensors + +Torch and other frameworks use the concept of a Tensor. A Tensor is an n-dimensional array of scalar values. This is done to take advantage of computer parallelism to speed up calculations. diff --git a/beep boop/index.md b/beep boop/index.md index 8bf1e6c..82d7178 100644 --- a/beep boop/index.md +++ b/beep boop/index.md @@ -13,6 +13,4 @@ This is my working area for my current machine learning research. I'm starting b Random working thoughts go here. -Tensors are basically scalar values from [micrograd](https://github.com/karpathy/micrograd), but put into arrays to take advantage of parallelism in computing. - Loss functions help us identify the gap between what we expect from a function and what we actually got from it. The lower the loss, the more accurate the function is