scores_metrics

Prediction scores and metrics

Scale-dependent-errors
Percentage errors
Scaled errors
Goodness of Fit
Probabilistic Model Selection (AIC, BIC)
Correlation and Synchrony
Information/Entropy measures
Cross-validation

Spatial Data

Spatial-temporal data

Cluster data

Probability Distributions

Time series data

Scale dependent errors

The errors are on the same scale as the data, i.e. different data sets cannot be compared

Mean absolute error (MAE)

$MAE = mean(|y_i - \hat{y}_i|) = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|$

Minimizing MAE leads to prediction of median.

Mean square error (MSE)

$MSE = mean(|y - \hat{y}_i|^2)$

Strongly penalizes large wrong predictions

Root mean square error (RMSE)

$RMSE = \sqrt{mean(|y_i - \hat{y}_i|^2)}$

Minimizing RMSE leads to prediction of mean

Percentage errors

Percentage errors are unit free and therefore allow comparison of different data sets.

Mean absolute percentage error (MAPE)

$MAPE = mean(\frac{|y_i - \hat{y}_i|}{y_i})$

Problems:

cannot be used for zero values
puts more weight on negative than positive errors

Symmetric absolute percentage error (sMAPE)

Used to overcome the problems of MAPE

$sMAPE = mean(\frac{|y_i - \hat{y}_i|}{( |y_i| |\hat{y}_i| )/2})$

Scaled errors

Alternative to percentage errors when comparing different datasets

Mean absolute scaled error (MASE)

$MASE = \frac{mean(|y_i - \hat{y}_i|)}{\frac{1}{N-1} \sum_{t=2}^{N} |y_t - y_{t-1}|}$

scale invariance
symmetric
less than one if it arises from a better forecast than the average naïve forecast and conversely it is greater than one if the forecast is worse than the average naïve forecast

https://otexts.com/fpp2/accuracy.html

https://scikit-learn.org/stable/modules/model_evaluation.html#

Goodness of Fit

Check if a hypothesis is correct. Often referred to as explained variance scores.

Coefficient of determination (R^2)

Describes the variance (of y) which is explained by the model prediction.

$R^2 = 1 - \frac{\sum_{i=1}^{N}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{N}(y_i - \bar{y})^2}$

Chi-square test

Statistical test that measures if a hypothesis significantly rejects the null hypothesis. Chi-square is obtained by

$\chi^2 = \sum_{i}^{N} \frac{(\hat{y}_i - y_i)^2}{y_i}$

The sum of square errors follows the chi-square distribution if the errors are independent and normally distributed. The probability of the calculated $\chi^2$ , $p(\chi^2)$ is called "p-value". If the p-value is smaller than the threshold $\alpha$ , the $\chi^2$ lies within the tail of the distribution. Thus, the null hypothesis can be significantly rejected.

F-test

The F-value expresses how much of the model has improved compared to the mean (null hypothesis) given the variance of the model and data.

For regression, it can be generalized in order to compare the fit quality of a complex model as compared to a simpler version of the same model.

Here, we assume model 1 with k_1 number of parameters and model 2 with k_2 number of parameters, where k_1 > k_2. The F-test is obtained as follows:

Compute the sum of squares of residual errors $RSS = \sum_{i} (y_i - \hat{y}_i)^2$ for each model
The F-statistics of the two regression models is obtained by
$F = \frac{\frac{RSS_1 - RSS_2}{k_1 - k_2}}{\frac{RSS_2}{n-k_2}}$
where n is the number of data points.
The RSS are random variables described by a probability distribution. The sum of independent and standard normal variables follow the pdf of the chi-square distribution. This happens to be the case in the equation above. The ratio of the two scaled chi-square distributions is described by the F-distribution.
We can evaluate the probability of occurrence p and compare it to our threshold value $\alpha$ (some small number we choose). If $p < \alpha$ we can conclude that model 1 is able to explain the variance in the data better than model 2.

https://towardsdatascience.com/fisher-test-for-regression-analysis-1e1687867259

Probabilistic Model Selection (AIC, BIC)

From a probabilistic perspective, the model selection problem is solved by computing the posterior over models,

$p(m|D) = \frac{p(D|m)p(m)}{\sum_{m} p(m,D) }$

Thus, the Bayesian model selection requires to maximize the p(m|D). For a uniform prior over the models this is equivalent to maximizing the evidence or marginal likelihood $p(D|m) = \int p(D|\Theta) p(\Theta|m)d\Theta$ . Computing this integral is often intractable and therefore approximated.

Bayesian information criterion (BIC)

One popular approximation is the BIC, which is defined as

$\text{BIC} = \log p(D|\hat{\Theta}) - \frac{d}{2} \logN$

where $\hat{\Theta}$ is the maximum likelihood estimation (MLE) of the model and d is the degree of freedom in the model.

Akaike information criterion (AIC)

The AIC is a special form of the minimum description length or MDL principle, which characterizes the score for a model in terms of how well it fits the data, minus how complex the model is to define. The AIC is defined as,

$AIC=\log p(D|\hat{\Theta}) - d$

The AIC tends to pick more complex models than the BIC.

[Murphy - Chapter 5]

Correlation and Synchrony

Pearson correlation

Pearson correlation measures how two continuous signals co-vary over time. The linear relationship between these signals are given from -1 (anticorrelated) to 0 (incorrelated) to 1 (perfecly correlated).

The Pearson correlation coefficient for two random variables X_1 and X_2 is:

$\rho_{X_1, X_2} = \frac{cov(X, Y)}{\sigma_X \sigma_Y} = \frac{\mathbb{E}\left[ (X - \mu_x)(Y - \mu_Y)\right]}{\sigma_X \sigma_Y}$

For time-series on can calculate a

global correlation coefficient: a single value
local correlation coefficient: determine correlation in a rolling window over time

Caution:

outliers can skew the correlation
assuming the data is homoscedatic, i.e. constant variances

Anomaly correlation coefficient (ACC)

In climate science and meteorology correlating forecasts directly with observations may give misleadingly high values because of the seasonal variations. It is therefore established practice to subtract the climate average from both the forecast and the verification. The anomaly correlation coefficient is obtained by

$ACC = \frac{ \sum_{i} (\hat{y}_i - \bar{c}_i)(y_i - \bar{c}_i)}{\sqrt{\sum_{i}(\hat{y}_i - \bar{c}_i)^2 \sum_{i}(y_i - \bar{c}_i)^2 }}$

where $\bar{c}_i$ is the climate average over time.

https://www.jma.go.jp/jma/jma-eng/jma-center/nwp/outline2013-nwp/pdf/outline2013_Appendix_A.pdf

Time Lagged Cross Correlation (TLCC)

TLCC is a measure of similarity of two series as a function of displacement. It captures directionality between two signals, i.e. leader-follower relationship.

Idea: Similar to convolution of two signals, i.e. shifting one signal with respect to the other while repeatedly calculating the correlation.

$(f \star g)(\tau)\ \triangleq \int_{-\infty}^{\infty} f^*(t) g(t \tau)\,dt$

Windowed time lagged cross correlations (WTLCC) are an extension of TLCC where local correlations coefficients are computed for each lag-time which is then plotted as a matrix.

Dynamic Time Wrapping (DTW)

DTW computes the path between two signals that minimize the distance between the two signals. DTW computes the euclidean distance at each frame across every other frames to compute the minimum path that will match the two signals.

Properties:

deal with signals of different length
requires interpolation of missing data
DTW can be used for event based time-series as well

Instantaneous phase synchrony

For time series with oscillating properties the instantaneous phase synchrony measures the phase similarities between signals at each timepoint. The phase between signals is refered to as angle which is obtained by a Hilbert transformation of the signals. Phase coherence can be quantified by subtracting the angular difference from 1.

http://jinhyuncheong.com/jekyll/update/2017/12/10/Timeseries_synchrony_tutorial_and_simulations.html

https://towardsdatascience.com/four-ways-to-quantify-synchrony-between-time-series-data-b99136c4a9c9

Granger causality

Statistical hypothesis test to determine whether one time series is useful in forecasting another.

Information/Entropy measures

Information measures on time series amounts to constructing empirical distributions and apply Shanon Information Measures on them.

Mutual Information (MI)

MI is a measure of the mutual dependence of two random variable. Applied to two time series $Y = \{ y_i | i = 1,...,N_y\}$ and $X = \{ x_i | i = 1, ..., N_x\}$ , we first construct the empirical distributions > $p(x_i), p(y_i)$ and the joint $p(x_i, y_i)$ . The mutual information is defined as

$I (X, Y) = \sum_{x_i, y_i} p(x_i, y_i) \log \left(\frac{p(x_i, y_i)}{p(x_i) p(y_i)}\right)$

which is essential the KL-divergence of the distributions.

Estimating the empirical distribution is difficult since empirical distributions can be strongly biased if there is little data. One could also assume a distribution which however needs to be verified.

https://elife-asu.github.io/PyInform/timeseries.html

Cross-validation

Leave-one out error (LOOE)

Bootstrap

Spatial data

Euclidian distance (ACC)
correlation maps
density metrics

Categorical Performance Diagram

Representation of 2x2 verification tables whose axes are 1–FAR (false alarm ration) and H (hit rate). Where

hit rate is the ratio of correct forecasts to the number of times this event occurred. Equivalently this statistic can be regarded as the fraction of those occasions when the forecast event occurred on which it was also forecast, and so is also called the probability of detection (POD).
FAR is the fraction of “yes” forecasts that turn out to be wrong, or that proportion of the forecast events that fail to materialize. The FAR has a negative orientation, so that smaller values of FAR are to be preferred. The best possible FAR is zero, and the worst possible FAR is one. [Wilks2020]

Categorical Performance Diagram. Each combinations of line represents a performance of one NN. Best performance is defined to be a diagonal line, i.e. having equal tendency to overestimate and underestimate. [Uphof2020]

Heat Maps

Saliency maps

Saliency maps quantify the influence of changes of each input value x (i.e., each predictor at each grid point) on changes of the activation of some part p of the NN (could be neurons or final prediction). Thus, we get a map of gradients $\frac{\partial p}{\partial x_{i,j,k}}$ for on predictor p.

saliency maps have the same dimension as the input
no importance measure

Gradient-weighted class-activation maps

Grad-CAM quantifies the influence of each grid point on the predicted probability of a give class p_k at a given convolutional layer in the network. In other words, at a given depth in the network, Grad-CAM indicates which spatial locations support the pre-diction of the kth class.

only for classification
only for CNNs

Layer-wise relevance propagation (LRP)

LRP uses the network weights and the neural activations created by the forward-pass to propagate the output back through the network up until the input layer. There we visualize which pixels really contributed to the output.

The contribution of an input to a particular prediction c is called relevance $R$. The relevance map is obtained by propagating the prediction back to the input space. The basic LRP rule is

$R_j = \sum_{k} \frac{a_j w_{jk}}{\sum_{j} a_j w_{jk} } R_k$

where a is the activation of neuron j and $w_{jk}$ is the weight between neuron j and k.

https://towardsdatascience.com/indepth-layer-wise-relevance-propagation-340f95deb1ea

[Uphoff2020]

Spatial temporal data

Latitude weighted RMSE

Skill metric used in climate science to compare time-series from different spatial points:

$RMSE = \frac{1}{N} \sum_{i}^{N} \sqrt{\frac{1}{N_\text{lat}} \frac{1}{N_\text{lon}} \sum_{j}^{N_\text{lat}} \sum_{k}^{N_\text{lon}} L(j)(\hat{y}_{i,j,k} - y_{i,j,k})^2}$

with latitude weighting factor

$L(j) = \frac{\cos(\text{lat}(j))}{\frac{1}{N_\text{lat}} \sum_{j}^{N_\text{lat}} \cos(\text{lat}(j))}$

Cluster data

rang-index
PIT diagram
reliability diagram

Probability Distributions

KL-divergence

Wasserstein metric

The Wasserstein metric is a metric from Optimal Transport, being able to compare the similarity of two arbitrary probability distributions. The core concept here is transportation: how costly is it to change one probability distribution, so it becomes the other distribution? The more costly this transformation is, the less similar the two distributions are.

https://victorzhaoblog.wordpress.com/2019/04/13/wasserstein-distance-vs-dynamic-time-warping/

Wasserstein metric have recently be applied to time-series. https://arxiv.org/pdf/1912.05509.pdf
in statistics WM is called earth mover's distance (EMD)
computationally expensive method
computational methods: minimum cost flow problem, e.g. the network simplex algorithm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scores_metrics

Prediction scores and metrics

Time series data

Scale dependent errors

Percentage errors

Scaled errors

Goodness of Fit

Probabilistic Model Selection (AIC, BIC)

Correlation and Synchrony

Information/Entropy measures

Cross-validation

Spatial data

Categorical Performance Diagram

Heat Maps

Spatial temporal data

Cluster data

Probability Distributions

Table of contents

Clone this wiki locally