adding podcasts

JacobReynolds · Mar 21, 2024 · 5bdd076 · 5bdd076
1 parent 27426b3
commit 5bdd076
Show file tree

Hide file tree

Showing 4 changed files with 131 additions and 1 deletion.
diff --git a/neural-networks/splash-pad/index.md b/neural-networks/splash-pad/index.md
@@ -0,0 +1,9 @@
+---
+title: splash pad
+icon: square
+description: thoughts to be categorized
+---
+
+## Named Entity Recognition (NER)
+
+NER is a common type (feature?) of Natural Language Process (NLP) that can extract named entities from bodies of text.
diff --git a/neural-networks/transformers/attention.md b/neural-networks/transformers/attention.md
@@ -5,7 +5,9 @@ description: listen up, class
 
 Attention, or self-attention, is the mechanism that makes current transformer models so powerful. It gives a neural network the ability to take into account previous values of the context to determine the current values.
 
-## Elementary, my dear Watson
+## Simple averaging
+
+### Batched average
 
 A simple example is looking at your current batch and average the embedding values for all tokens that have preceded you.
 
@@ -40,6 +42,8 @@ for time in range(T):
     xbow[time] = x
 ```
 
+### Using `torch.tril`
+
 ===
 And because math is math, we can actually express this as a simple matrix multiplication using `torch.tril`. Using `tril` and then averaging across it makes sure that when we perform the dot product, we're only averaging the previous values up to that row.
 
@@ -87,3 +91,94 @@ print(torch.allclose(xbow, c))
 ```
 
 ===
+
+### Using `softmax`
+
+One more alternative is to instead use `softmax`
+
+==- Code example
+
+```python
+T, C = 8, 2
+torch.manual_seed(1)
+
+x = torch.randn(T, C)
+tril = torch.tril(torch.ones(T, T))
+
+wei = torch.zeros((T,T))
+wei = wei.masked_fill(tril == 0, -torch.inf)
+print(wei)
+# tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
+#         [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
+#         [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
+#         [0., 0., 0., 0., -inf, -inf, -inf, -inf],
+#         [0., 0., 0., 0., 0., -inf, -inf, -inf],
+#         [0., 0., 0., 0., 0., 0., -inf, -inf],
+#         [0., 0., 0., 0., 0., 0., 0., -inf],
+#         [0., 0., 0., 0., 0., 0., 0., 0.]])
+
+wei = torch.softmax(wei, dim=1)
+print(wei)
+# tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
+#         [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
+#         [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
+#         [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
+#         [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
+#         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
+#         [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
+#         [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
+
+xbow = wei @ x
+print(xbow)
+# tensor([[-1.5256, -0.7502],
+#         [-1.0898, -1.1799],
+#         [-0.7599, -0.9896],
+#         [-0.8149, -1.1445],
+#         [-0.7943, -0.8549],
+#         [-0.7915, -0.7543],
+#         [-0.7102, -0.4055],
+#         [-0.5929, -0.2964]])
+```
+
+===
+
+## Self-attention
+
+I really don't understand this yet, but here's some of the code and specific notes from Andrej's video.
+
+==- Code example
+
+```python
+B, T, C = 4, 8, 32
+torch.manual_seed(1)
+x = torch.randn(B, T, C)
+
+head_size = 16
+key = torch.nn.Linear(C, head_size, bias=False)
+query = torch.nn.Linear(C, head_size, bias=False)
+value = torch.nn.Linear(C, head_size, bias=False)
+k = key(x)   # (B, T, 16)
+q = query(x) # (B, T, 16)
+wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
+
+tril = torch.tril(torch.ones((T, T)))
+wei = wei.masked_fill(tril == 0, float(-torch.inf))
+wei = torch.nn.functional.softmax(wei, dim=-1)
+v = value(x)
+out = wei @ v
+```
+
+===
+
+- Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
+- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
+- Each example across batch dimension is of course processed completely independently and never "talk" to each other
+- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
+- "self-attention" just means that the keys and values are produced from the same source as queries. In - "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
+- "Scaled" attention additionally divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below
+
+The final note can be done by doing
+
+```python
+wei = q @ k.transpose(-2, -1) * headsize**-.5
+```
diff --git a/podcast.md b/podcast.md
@@ -0,0 +1,22 @@
+---
+title: podcast
+description: notes from different podcasts
+icon: unmute
+order: 899
+---
+
+# Podcast
+
+Various snippets, notes from podcasts that I've listened to that I'd like to remember.
+
+## Business Made Simple with Donald Miller
+
+### #142: Marlee Joseph - The Secret to Making Better Marketing Decisions
+
+Marlee talks about the ICE framework:
+
+- I | Impact
+- C | Confidence
+- E | Ease
+
+She also mentioned the HPPO framework (highest paid persons' opinion).
diff --git a/start-ups/snitch/index.md b/start-ups/snitch/index.md
@@ -14,3 +14,7 @@ A [recent SEC ruling](https://www.sec.gov/news/press-release/2023-139) now requi
 Luckily it's the government, and in _some_(ish) cases their data is public. You can search all public SEC filings via [EDGAR](https://www.sec.gov/edgar/searchedgar/companysearch), the Electronic Data Gathering, Analysis, and Retrieval system. You can go even deeper by using their [full-text search](https://www.sec.gov/edgar/search/#) to dive into all the filings.
 
 Unfortunately an 8-K is a common form used to notify investors and the SEC about a variety of things, only one of those items being a data breach. Searching EDGAR for [8-K filings](https://www.sec.gov/edgar/search/#/category=custom&forms=8-K) returns a ton of documents. Mostly about changes in executive staffing or compensation. But thanks to their, likely [lucene](https://lucene.apache.org) backend, we can add a full text match for the words "Item 1.05" and get [more specific results](https://www.sec.gov/edgar/search/#/q=%2522item%25201.05%2522&category=custom&forms=8-K). At the time of writing this, there are 5 results across 3 companies. When I originally performed this research there was only 1 filing. Because of that, I wasn't sure how often new filings would come out, and how they would differ in formats so it wasn't an appropriate time to start writing automation around it. However as more filings get created and we can start to measure their velocity, this could be a fun data feed or subscription service to provide those that like others' dirty cyber laundry.
+
+## Further thoughts
+
+Reading the SEC press release more closely, I now realize it also requires annual filings by all publicly traded companies to make a statement about their cybersecurity processes. Although reading some of them, they all seem to be written by a lawyer and lack any substantial information.