ml and seo

JacobReynolds · Feb 17, 2024 · fe0ef92 · fe0ef92
1 parent f4ad963
commit fe0ef92
Show file tree

Hide file tree

Showing 15 changed files with 363 additions and 5 deletions.
diff --git a/Gemfile b/Gemfile
@@ -5,3 +5,5 @@ gem "jekyll", "~> 4.3.3" # installed by `gem jekyll`
 
 gem "just-the-docs", "0.7.0" # pinned to the current release
 # gem "just-the-docs"        # always download the latest release
+
+gem 'jekyll-sitemap'
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -39,6 +39,8 @@ GEM
       sass-embedded (~> 1.54)
     jekyll-seo-tag (2.8.0)
       jekyll (>= 3.8, < 5.0)
+    jekyll-sitemap (1.4.0)
+      jekyll (>= 3.7, < 5.0)
     jekyll-watch (2.2.1)
       listen (~> 3.0)
     just-the-docs (0.7.0)
@@ -83,6 +85,7 @@ PLATFORMS
 
 DEPENDENCIES
   jekyll (~> 4.3.3)
+  jekyll-sitemap
   just-the-docs (= 0.7.0)
 
 BUNDLED WITH

diff --git a/_config.yml b/_config.yml
@@ -9,8 +9,8 @@ aux_links:
 color_scheme: custom
 
 baseurl: /
-
-
+plugins:
+  - jekyll-sitemap
 callouts_level: loud # or loud
 callouts:
   note-title:

diff --git a/beep boop/foundation/back-propagation/index.md b/beep boop/foundation/back-propagation/index.md
@@ -8,3 +8,107 @@ grand_parent: beep boop
 # Back Propagation
 
 Back propagation is the process of taking a series of nodes (equations), starting at the end, and calculating the effect each node has on the outcome of the equations. We do this by calculating the gradient ([derivative](../derivatives/)) of each node.
+
+## Code
+
+Here's an example of an individual value node that would exist inside of a chain of nodes and the functions it needs for back propagation.
+
+```python
+class Value:
+
+  def __init__(self, data, _children=(), _op='', label=''):
+    self.data = data
+    self.grad = 0.0
+    self._backward = lambda: None
+    self._prev = set(_children)
+    self._op = _op
+    self.label = label
+
+  def __repr__(self):
+    return f"Value(data={self.data})"
+
+  def __add__(self, other):
+    other = other if isinstance(other, Value) else Value(other)
+    out = Value(self.data + other.data, (self, other), '+')
+
+    def _backward():
+      self.grad += 1.0 * out.grad
+      other.grad += 1.0 * out.grad
+    out._backward = _backward
+
+    return out
+
+  def __mul__(self, other):
+    other = other if isinstance(other, Value) else Value(other)
+    out = Value(self.data * other.data, (self, other), '*')
+
+    def _backward():
+      self.grad += other.data * out.grad
+      other.grad += self.data * out.grad
+    out._backward = _backward
+
+    return out
+
+  def __pow__(self, other):
+    assert isinstance(other, (int, float)), "only supporting int/float powers for now"
+    out = Value(self.data**other, (self,), f'**{other}')
+
+    def _backward():
+        self.grad += other * (self.data ** (other - 1)) * out.grad
+    out._backward = _backward
+
+    return out
+
+  def __rmul__(self, other): # other * self
+    return self * other
+
+  def __truediv__(self, other): # self / other
+    return self * other**-1
+
+  def __neg__(self): # -self
+    return self * -1
+
+  def __sub__(self, other): # self - other
+    return self + (-other)
+
+  def __radd__(self, other): # other + self
+    return self + other
+
+  def tanh(self):
+    x = self.data
+    t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
+    out = Value(t, (self, ), 'tanh')
+
+    def _backward():
+      self.grad += (1 - t**2) * out.grad
+    out._backward = _backward
+
+    return out
+
+  def exp(self):
+    x = self.data
+    out = Value(math.exp(x), (self, ), 'exp')
+
+    def _backward():
+      self.grad += out.data * out.grad # NOTE: in the video I incorrectly used = instead of +=. Fixed here.
+    out._backward = _backward
+
+    return out
+
+
+  def backward(self):
+
+    topo = []
+    visited = set()
+    def build_topo(v):
+      if v not in visited:
+        visited.add(v)
+        for child in v._prev:
+          build_topo(child)
+        topo.append(v)
+    build_topo(self)
+
+    self.grad = 1.0
+    for node in reversed(topo):
+      node._backward()
+```
diff --git a/beep boop/foundation/gradient-descent/index.md b/beep boop/foundation/gradient-descent/index.md
@@ -0,0 +1,72 @@
+---
+title: Gradient Descent
+parent: Foundation
+grand_parent: beep boop
+layout: default
+math: katex
+---
+
+# Gradient Descent
+
+Gradient descent is the process fine tuning the weights and biases of a neural network to minimize our [loss function](../loss/).
+
+## Example
+
+We do this by performing [back propagation](../back-propagation/) across something like a [multi-layer perceptron](../multi-layer-perceptron/) to [calculate](../derivatives/) the gradients of each [neuron](../neuron/). We do this so when we do a forward-pass through the MLP, we can compare the expected outputs against the actual outputs using a [loss function](../loss/). Gradient descent is then the process of adjusting the weights and biases of each neuron, to get our loss function as low as possible. The gradient of each neuron helps us understand whether to change the weights/biases of that neuron in a positive or negative direction to achieve the output we want.
+
+Building off the [multi-layer perceptron](../multi-layer-perceptron/) implementation, we can perform gradient descent with the following:
+
+```python
+n = MLP(3, [4, 4, 1])
+xs = [
+  [2.0, 3.0, -1.0],
+  [3.0, -1.0, 0.5],
+  [0.5, 1.0, 1.0],
+  [1.0, 1.0, -1.0],
+]
+ys = [1.0, -1.0, -1.0, 1.0]
+
+for k in range(20):
+
+  # forward pass
+  ypred = [n(x) for x in xs]
+  loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
+
+  # backward pass
+  for p in n.parameters():
+    p.grad = 0.0
+  loss.backward()
+
+  # update
+  for p in n.parameters():
+    p.data += -0.1 * p.grad
+
+  print(k, loss.data)
+```
+
+The reasoning for `-0.1` here is actually super important. We have to remember the goal of this gradient descent is to lower the value of the loss function as much as possible. So when tuning our weights, we want to tune them such that they _decrease_ the loss function. Luckily we know that the gradient will tell us how much that value will change the output. Let's look at some examples:
+
+$$
+p.grad = 0.41\newline
+p.data = 0.88
+$$
+
+In this case, if we want to decrease the loss function, we want to decrease $$p.data$$, because $$p.grad$$ tells us that for every $$n$$ we increase $$p.data$$ the loss function changes by $$n \cdot 0.41$$. So it makes sense to instead do $$-0.1 * p.grad$$ here.
+
+But what if the signs are different?
+
+$$
+p.grad = -0.41\newline
+p.data = 0.88
+$$
+
+In this case, increasing $$p.data$$ decreases the loss function. If we do $$-0.1 \cdot -0.41$$ we get $$0.041$$ which will increase $$p.data$$ and further decrease the loss function.
+
+One more
+
+$$
+p.grad = -0.41\newline
+p.data = -0.88
+$$
+
+If we increase $$p.data$$, that will lower the loss function. And just like the previous example $$-0.1 * -0.41 = 0.041$$ which will end up increasing $$p.data$$ and lowering the resulting loss function. The sign of $$p.data$$ actually has no effect here, it's only the sign of $$p.grad$$ that matters. And we manage that by basically inverting it by multiplying with $$-0.1$$. If we were instead looking to maximize the loss function, we'd multiple by $$+0.1$$
diff --git a/beep boop/foundation/loss/LossFunction.drawio b/beep boop/foundation/loss/LossFunction.drawio
@@ -0,0 +1,22 @@
+<mxfile host="app.diagrams.net" modified="2024-02-17T16:01:54.058Z" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36" etag="hwVI0leB8-ZAtJqbm0MF" version="23.1.5" type="device">
+  <diagram name="Page-1" id="_gvUFp_ucttC2tlLnHjO">
+    <mxGraphModel dx="855" dy="570" grid="0" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="1" shadow="0">
+      <root>
+        <mxCell id="0" />
+        <mxCell id="1" parent="0" />
+        <mxCell id="TGnNtdPKK5nN1VZu4Hb8-1" value="$$&#xa;\frac{1}{N} \cdot \sum_{i=0}^{N} (actual_i - expected_i)^2&#xa;$$" style="text;strokeColor=none;align=center;fillColor=none;verticalAlign=middle;whiteSpace=wrap;rounded=0;fontSize=25;fontColor=default;" vertex="1" parent="1">
+          <mxGeometry x="215" y="210" width="420" height="210" as="geometry" />
+        </mxCell>
+        <mxCell id="TGnNtdPKK5nN1VZu4Hb8-2" value="&lt;font style=&quot;font-size: 24px;&quot;&gt;Loss&lt;/font&gt;" style="shape=curlyBracket;whiteSpace=wrap;html=1;rounded=1;labelPosition=left;verticalLabelPosition=middle;align=right;verticalAlign=middle;rotation=-90;" vertex="1" parent="1">
+          <mxGeometry x="460" y="230" width="20" height="260" as="geometry" />
+        </mxCell>
+        <mxCell id="TGnNtdPKK5nN1VZu4Hb8-3" value="&lt;font style=&quot;font-size: 24px;&quot;&gt;Squared&lt;/font&gt;" style="shape=curlyBracket;whiteSpace=wrap;html=1;rounded=1;labelPosition=left;verticalLabelPosition=middle;align=right;verticalAlign=middle;rotation=90;" vertex="1" parent="1">
+          <mxGeometry x="602.5" y="257.5" width="20" height="45" as="geometry" />
+        </mxCell>
+        <mxCell id="TGnNtdPKK5nN1VZu4Hb8-8" value="&lt;font style=&quot;font-size: 24px;&quot;&gt;Mean&lt;/font&gt;" style="shape=curlyBracket;whiteSpace=wrap;html=1;rounded=1;labelPosition=left;verticalLabelPosition=middle;align=right;verticalAlign=middle;rotation=-90;" vertex="1" parent="1">
+          <mxGeometry x="425" y="230" width="20" height="415" as="geometry" />
+        </mxCell>
+      </root>
+    </mxGraphModel>
+  </diagram>
+</mxfile>
diff --git a/beep boop/foundation/loss/index.md b/beep boop/foundation/loss/index.md
@@ -0,0 +1,41 @@
+---
+title: Loss
+layout: default
+parent: Foundation
+grand_parent: beep boop
+---
+
+<h1><pre>
+| ||
+|| |_
+</pre></h1>
+
+The loss is a single number that helps us understand the performance of the neural network. The loss function is how we calculate that number. A lot of the time in training a neural network is spent optimizing this loss function.
+
+## Mean-squared error loss
+
+You calculate this by subtracting the actual output from the neural network with the expected output, squaring them, and then taking the mean of all values you tested. I _think_ this helps exaggerate values that are far from correct and shrink values that are closer to correct. But it also has the primary benefit of getting rid of the sign of the values, similar to $$abs$$.
+
+The curious thing to me is that we don't actually take the mean of the summated squared losses, at least not in anything I've seen so far. So I'm hoping to figure that out. It seems like the division by $$N$$ doesn't really matter, it's the squaring of the loss values that actually give us our metrics. Everything else is just syntactic sugar.
+
+![Mathematical expression of mean squared loss](./mean-squared-loss.png)
+
+## Example
+
+If we use our [multi-layer perceptron](../multi-layer-perceptron/) we can provide it with our initial inputs `xs` and our expected outputs `ys` for 4 passes, feed those through the MLP, and then calculate the loss.
+
+```python
+n = MLP(3, [4, 4, 1])
+xs = [
+  [2.0, 3.0, -1.0],
+  [3.0, -1.0, 0.5],
+  [0.5, 1.0, 1.0],
+  [1.0, 1.0, -1.0],
+]
+ys = [1.0, -1.0, -1.0, 1.0]
+ypred = [n(x) for x in xs]
+
+loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
+
+# 7.817821598365237
+```
diff --git a/beep boop/foundation/loss/mean-squared-loss.png b/beep boop/foundation/loss/mean-squared-loss.png
diff --git a/beep boop/foundation/multi-layer-perceptron/index.md b/beep boop/foundation/multi-layer-perceptron/index.md
@@ -7,4 +7,57 @@ grand_parent: beep boop
 
 # Multi-layer Perceptron (MLP)
 
-A MLP consists of many [neurons](../neuron/) lined up in order and feeding values between each other.
+An MLP consists of many layers of [neurons](../neuron/) lined up in order and feeding values between each other.
+
+Since I'm very code inclined, here's the python that implements the following image:
+
+![a multilayer perceptron](./mlp.jpeg)
+
+The following code also uses the `Value` class from [Back Propagation](../back-propagation/)
+
+```python
+class Neuron:
+
+  def __init__(self, nin):
+    self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
+    self.b = Value(random.uniform(-1,1))
+
+  def __call__(self, x):
+    # w * x + b
+    act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
+    out = act.tanh()
+    return out
+
+  def parameters(self):
+    return self.w + [self.b]
+
+class Layer:
+
+  def __init__(self, nin, nout):
+    self.neurons = [Neuron(nin) for _ in range(nout)]
+
+  def __call__(self, x):
+    outs = [n(x) for n in self.neurons]
+    return outs[0] if len(outs) == 1 else outs
+
+  def parameters(self):
+    return [p for neuron in self.neurons for p in neuron.parameters()]
+
+class MLP:
+
+  def __init__(self, nin, nouts):
+    sz = [nin] + nouts
+    self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
+
+  def __call__(self, x):
+    for layer in self.layers:
+      x = layer(x)
+    return x
+
+  def parameters(self):
+    return [p for layer in self.layers for p in layer.parameters()]
+
+x = [2.0, 3.0, -1.0]
+n = MLP(3, [4, 4, 1])
+n(x)
+```
diff --git a/beep boop/foundation/multi-layer-perceptron/mlp.jpeg b/beep boop/foundation/multi-layer-perceptron/mlp.jpeg
diff --git a/beep boop/foundation/neuron/index.md b/beep boop/foundation/neuron/index.md
@@ -11,3 +11,11 @@ Neurons are exactly what they sound like, the things in our brain!
 ![Diagram of a neuron](./neuron_model.jpeg)
 
 In machine learning, we model these in neural networks to simulate how the brain works. Neurons take in a series of values (x) and weights (w), which are individually multiplied and then added. The neuron fires by taking these and adding the bias of the neuron (how trigger happy it is) and passing it through an activation function which helps squash the values to something like -1 to 1. This is usually `tanh` or a `sigmoid` function.
+
+## Weights
+
+The weights for each input of a neuron are arbitrarily chosen. There's probably a whole field of mathematics that goes into determining the best starting weights, but at this point for me, it's random. Then through the process of training, these weights get adjusted to try and fit our loss function.
+
+## Biases
+
+Much like [weights](#weights), biases are also randomly chosen and updated throughout the training process to try and adjust the activation of that neuron to fit our loss function.
diff --git a/beep boop/frameworks/index.md b/beep boop/frameworks/index.md
@@ -0,0 +1,8 @@
+---
+title: Frameworks
+has_children: true
+layout: default
+parent: beep boop
+---
+
+Notes on various frameworks available for machine learning.
diff --git a/beep boop/frameworks/pytorch/equation.png b/beep boop/frameworks/pytorch/equation.png
Original file line number	Diff line number	Diff line change
Expand Up		@@ -5,3 +5,5 @@ gem "jekyll", "~> 4.3.3" # installed by `gem jekyll`

		gem "just-the-docs", "0.7.0" # pinned to the current release
		# gem "just-the-docs" # always download the latest release

		gem 'jekyll-sitemap'