A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.
Better viewed at https://blmoistawinde.github.io/ml_equations_latex/
encoder hidden state
decoder hidden state
h_t = RNN_{enc}(x_t, h_{t-1})
s_t = RNN_{dec}(y_t, s_{t-1})
The
-
LSTM (paper: Long short-term memory)
-
GRU (paper: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation).
The attention weight
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}
e_{ij} = a(s_{i-1}, h_j)
Paper: Neural Machine Translation by Jointly Learning to Align and Translate
e_{ij} = v^T tanh(W[s_{i-1}; h_j])
Paper: Effective Approaches to Attention-based Neural Machine Translation
If
otherwise
e_{ij} = s_{i-1}^T h_j
e_{ij} = s_{i-1}^T W h_j
Finally, the output
s_t = tanh(W[s_{t-1};y_t;c_t])
o_t = softmax(Vs_t)
Paper: Attention Is All You Need
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
where
where $$ head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i) $$
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)
Paper: Generative Adversarial Networks
$$ \min_{G}\max_{D}\mathbb{E}{x\sim p{\text{data}}(x)}[\log{D(x)}] + \mathbb{E}{z\sim p{\text{z}}(z)}[1 - \log{D(G(z))}] $$
\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] + \mathbb{E}_{z\sim p_{\text{z}}(z)}[1 - \log{D(G(z))}]
Paper: Auto-Encoding Variational Bayes
To produce a latent variable z such that
z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)
\epsilon \sim \mathcal{N}(0,1)
z = \mu + \epsilon \cdot \sigma
Above is for 1-D case. For a multi-dimensional (vector) case we use:
\epsilon \sim \mathcal{N}(0, \textbf{I})
\vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I})
Related to Logistic Regression. For single-label/multi-label binary classification.
\sigma(z) = \frac{1} {1 + e^{-z}}
tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}}
For multi-class single label classification.
\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K
Relu(z) = max(0, z)
where
Gelu(x) = x\Phi(x)
Below
\sum_{i=1}^{D}|x_i-y_i|
\sum_{i=1}^{D}(x_i-y_i)^2
It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.
L_{\delta}=
\left\{\begin{matrix}
\frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y}) \right | < \delta\\
\delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise
\end{matrix}\right.
- In binary classification, where the number of classes
$M$ equals 2, Binary Cross-Entropy(BCE) can be calculated as:
- If
$M > 2$ (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.
-{(y\log(p) + (1 - y)\log(1 - p))}
-\sum_{c=1}^My_{o,c}\log(p_{o,c})
M - number of classes
log - the natural log
y - binary indicator (0 or 1) if class label c is the correct classification for observation o
p - predicted probability observation o is of class c
Minimizing negative loglikelihood
is equivalent to Maximum Likelihood Estimation(MLE).
Here
NLL(y) = -{\log(p(y))}
\min_{\theta} \sum_y {-\log(p(y;\theta))}
\max_{\theta} \prod_y p(y;\theta)
Used in Support Vector Machine(SVM).
max(0, 1 - y \cdot \hat{y})
KL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}
JS(\hat{y} || y) = \frac{1}{2}(KL(y||\frac{y+\hat{y}}{2}) + KL(\hat{y}||\frac{y+\hat{y}}{2}))
The
A regression model that uses L1 regularization technique is called Lasso Regression.
Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|
A regression model that uses L1 regularization technique is called Ridge Regression.
Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n w_i^{2}
Some of them overlaps with loss, like MAE, KL-divergence.
$$F1 = \frac{2PrecisionRecall}{Precision+Recall} = \frac{2TP}{2TP+FP+FN}$$
Accuracy = \frac{TP+TN}{TP+TN+FP+FN}
Precision = \frac{TP}{TP+FP}
Recall = \frac{TP}{TP+FN}
F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}
Sensitivity = Recall = \frac{TP}{TP+FN}
Specificity = \frac{TN}{FP+TN}
AUC is calculated as the Area Under the
MAE, MSE, equation above.
The Mutual Information is a measure of the similarity between two labels of the same data. Where
MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}
\log\frac{N|U_i \cap V_j|}{|U_i||V_j|}
Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by some generalized mean of H(labels_true) and H(labels_pred)), See wiki.
Skip RI, ARI for complexity.
Also skip metrics for related tasks (e.g. modularity for community detection[graph clustering], coherence score for topic modeling[soft clustering]).
Skip nDCG (Normalized Discounted Cumulative Gain) for its complexity.
Average Precision is calculated as:
\text{AP} = \sum_n (R_n - R_{n-1}) P_n
where
AP can also be regarded as the area under the precision-recall curve.
MAP is the mean of AP over all the queries.
Cosine(x,y) = \frac{x \cdot y}{|x||y|}
Similarity of two sets
Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}
Relevance of two events
PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}
For example,
This repository now only contains simple equations for ML. They are mainly about deep learning and NLP now due to personal research interests.
For time issues, elegant equations in traditional ML approaches like SVM, SVD, PCA, LDA are not included yet.
Moreover, there is a trend towards more complex metrics, which have to be calculated with complicated program (e.g. BLEU, ROUGE, METEOR), iterative algorithms (e.g. PageRank), optimization (e.g. Earth Mover Distance), or even learning based (e.g. BERTScore). They thus cannot be described using simple equations.
https://blog.floydhub.com/gans-story-so-far/
https://ermongroup.github.io/cs228-notes/extras/vae/
Thanks for a-rodin's solution to show Latex in Github markdown, which I have wrapped into latex2pic.py
.