Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
alexchen4ai committed Feb 22, 2024
1 parent 0ad974b commit e83e386
Show file tree
Hide file tree
Showing 6 changed files with 6 additions and 4 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
878a7746
8136becc
2 changes: 1 addition & 1 deletion notes.html
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ <h1 class="title">Research notes</h1>

<div class="quarto-listing quarto-listing-container-default" id="listing-listing">
<div class="list quarto-listing-default">
<div class="quarto-post image-right" data-index="0" data-categories="Large Language Models" data-listing-date-sort="1708502400000" data-listing-file-modified-sort="1708635048651" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="731">
<div class="quarto-post image-right" data-index="0" data-categories="Large Language Models" data-listing-date-sort="1708502400000" data-listing-file-modified-sort="1708635208669" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="737">
<div class="thumbnail">
<p><a href="./notes/Large Language Model/moe.html" class="no-external"></a></p><a href="./notes/Large Language Model/moe.html" class="no-external">
<p class="card-img-top"><img src="images/llama2.png" class="thumbnail-image card-img"/></p>
Expand Down
1 change: 1 addition & 0 deletions notes.xml
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,7 @@ Tip
<h2 class="anchored" data-anchor-id="load-balancing-loss">Load balancing loss</h2>
<p>Since different portion of total tokens will enter different experts, like the unbalanced dataset problem, we need to add a load balancing loss. Given <img src="https://latex.codecogs.com/png.latex?N"> experts indexed by <img src="https://latex.codecogs.com/png.latex?i=1"> to <img src="https://latex.codecogs.com/png.latex?N"> and a batch <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BB%7D"> with <img src="https://latex.codecogs.com/png.latex?T"> tokens, the auxiliary loss is computed as the scaled dot-product between vectors <img src="https://latex.codecogs.com/png.latex?f"> and <img src="https://latex.codecogs.com/png.latex?P">, <img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%20%7B%20loss%20%7D=%5Calpha%20%5Ccdot%20N%20%5Ccdot%20%5Csum_%7Bi=1%7D%5EN%20f_i%20%5Ccdot%20P_i%0A"> where <img src="https://latex.codecogs.com/png.latex?f_i"> is the fraction of tokens dispatched to expert <img src="https://latex.codecogs.com/png.latex?i">, <img src="https://latex.codecogs.com/png.latex?%0Af_i=%5Cfrac%7B1%7D%7BT%7D%20%5Csum_%7Bx%20%5Cin%20%5Cmathcal%7BB%7D%7D%20%5Cmathbb%7B1%7D%5C%7B%5Coperatorname%7Bargmax%7D%20p(x)=i%5C%7D%0A"> and <img src="https://latex.codecogs.com/png.latex?P_i"> is the fraction of the router probability allocated for expert <img src="https://latex.codecogs.com/png.latex?i,%7B%20%7D%5E2"> <img src="https://latex.codecogs.com/png.latex?%0AP_i=%5Cfrac%7B1%7D%7BT%7D%20%5Csum_%7Bx%20%5Cin%20%5Cmathcal%7BB%7D%7D%20p_i(x)%0A"></p>
<p>We add this loss since we want to encourages uniform routing since the loss is minimized when <img src="https://latex.codecogs.com/png.latex?%0Af_i%20=%20P_i%20=%20%5Cfrac%7B1%7D%7BN%7D.%0A"></p>
<p>You can prove it by <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz inequality</a>.</p>
</section>
Expand Down
1 change: 1 addition & 0 deletions notes/Large Language Model/moe.html
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,7 @@ <h2 class="anchored" data-anchor-id="load-balancing-loss">Load balancing loss</h
<p>We add this loss since we want to encourages uniform routing since the loss is minimized when <span class="math display">\[
f_i = P_i = \frac{1}{N}.
\]</span></p>
<p>You can prove it by <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz inequality</a>.</p>


</section>
Expand Down
2 changes: 1 addition & 1 deletion search.json
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
"href": "notes/Large Language Model/moe.html#load-balancing-loss",
"title": "Mixture of expert",
"section": "Load balancing loss",
"text": "Load balancing loss\nSince different portion of total tokens will enter different experts, like the unbalanced dataset problem, we need to add a load balancing loss. Given \\(N\\) experts indexed by \\(i=1\\) to \\(N\\) and a batch \\(\\mathcal{B}\\) with \\(T\\) tokens, the auxiliary loss is computed as the scaled dot-product between vectors \\(f\\) and \\(P\\), \\[\n\\text { loss }=\\alpha \\cdot N \\cdot \\sum_{i=1}^N f_i \\cdot P_i\n\\] where \\(f_i\\) is the fraction of tokens dispatched to expert \\(i\\), \\[\nf_i=\\frac{1}{T} \\sum_{x \\in \\mathcal{B}} \\mathbb{1}\\{\\operatorname{argmax} p(x)=i\\}\n\\] and \\(P_i\\) is the fraction of the router probability allocated for expert \\(i,{ }^2\\) \\[\nP_i=\\frac{1}{T} \\sum_{x \\in \\mathcal{B}} p_i(x)\n\\]\nWe add this loss since we want to encourages uniform routing since the loss is minimized when \\[\nf_i = P_i = \\frac{1}{N}.\n\\]",
"text": "Load balancing loss\nSince different portion of total tokens will enter different experts, like the unbalanced dataset problem, we need to add a load balancing loss. Given \\(N\\) experts indexed by \\(i=1\\) to \\(N\\) and a batch \\(\\mathcal{B}\\) with \\(T\\) tokens, the auxiliary loss is computed as the scaled dot-product between vectors \\(f\\) and \\(P\\), \\[\n\\text { loss }=\\alpha \\cdot N \\cdot \\sum_{i=1}^N f_i \\cdot P_i\n\\] where \\(f_i\\) is the fraction of tokens dispatched to expert \\(i\\), \\[\nf_i=\\frac{1}{T} \\sum_{x \\in \\mathcal{B}} \\mathbb{1}\\{\\operatorname{argmax} p(x)=i\\}\n\\] and \\(P_i\\) is the fraction of the router probability allocated for expert \\(i,{ }^2\\) \\[\nP_i=\\frac{1}{T} \\sum_{x \\in \\mathcal{B}} p_i(x)\n\\]\nWe add this loss since we want to encourages uniform routing since the loss is minimized when \\[\nf_i = P_i = \\frac{1}{N}.\n\\]\nYou can prove it by Cauchy-Schwarz inequality.",
"crumbs": [
"Home",
"🗣️ **Large language models**",
Expand Down
2 changes: 1 addition & 1 deletion sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</url>
<url>
<loc>https://alexchen4ai.github.io/blog/notes/Large Language Model/moe.html</loc>
<lastmod>2024-02-22T20:50:48.651Z</lastmod>
<lastmod>2024-02-22T20:53:28.669Z</lastmod>
</url>
<url>
<loc>https://alexchen4ai.github.io/blog/about.html</loc>
Expand Down

0 comments on commit e83e386

Please sign in to comment.