Feature request: Sample weights #17

romanovzky · 2024-06-06T09:44:50Z

In this work (preprint), we had to resample the dataset so that the target variable had a flat distribution to prevent pyoperon from only learning about the most common values. A better alternative, that pyoperon does not currently support, is to use sample_weights when computing the loss function.

For example, we would have X, y, w (sample weights), then the loss function is computed as the weighted average of the losses of each data instance
$$loss_{total} = \frac{1}{\sum_i w_i} \sum_i w_i \times loss(X_i, y_i)$$
this is already supported by many scikit-learn estimators.

The use case would be varied: from specifying the focus of a regression task, to use datasets that have been sampled in a way that the data come with sampling weights (this is common practice in social sciences and in simulation-based inference in STEM).

The text was updated successfully, but these errors were encountered:

gkronber · 2024-06-06T09:58:49Z

Yes, this feature is useful in many applications.
However, I would recommend not to handle this via sample weights for a loss function but via likelihood functions, which are anyway prepared to be integrated into Operon.
For squared-error loss, your case is equivalent to a Gaussian likelihood with sample-specific error variance. The negative log likelihood is:

$$ nll = 1/2 \sum_i^n \log (2 \pi \sigma_i^2) + 1/2 \sum_i^n \sigma_i^{-2}(y_i - f(x_i, \theta))^2$$

The $\sigma_i$ would be proportional to your weights. If they are constant the first term in the nll can be ignored.

Handling this issue via likelihoods, with a few commonly used ones predefined in operon, is more generally applicable. Weights for a loss-function are a special case.

romanovzky · 2024-06-06T11:07:03Z

Hmmm I think these are different things.

The likelihood method that you are mentioning is with respect to data that have uncertainty associateds to the target values, in this case the $\sigma_i$ corresponde to the uncertainty (e.g. experimental error) associated with the measured $y_i$. This is what is used to fit a (explicit and domain knowledge-driven) model with the so-called $\chi^2$ (chi-squared) minimisation.

What I suggest is to apply these weights at a sample statistical level. It is common to have data sampled which needs to be statistically re-weighted, a good example is social data where sampling is difficult and from a random sampling of data one then needs to re-weight the sample to better approximate known global statistics (apparently this is called post-stratification). Another example happens in Physics a lot when sampling data using an MCMC with a bias towards tails of the distribution, the end sample is then re-weighted back to the "true" distribution. In both cases, one uses weights to derive the statistics of the dataset through weighted averages, standard deviations, etc.

As such, I believe that the first case (your suggestion) is a different type of loss function, say "chi-squarted MSE" (or non-constant std MSE), whereas the second case (my suggestion) is a statistical re-weighting of virtually any loss function.

gkronber · 2024-06-06T12:03:20Z

I was refering to maximum likelihood estimation of parameters.
I see your point that sometimes we would not call the objective function a likelihood even when the mathematical structure matches exactly a well-known likelihood.

romanovzky · 2024-06-06T13:41:05Z

Oh, completely. There are many loss functions in the ML literature that do not have a likelihood interpretation.

My request, however, is at the level of how the loss is computed to train the ML model over data sample, i.e. at the distributional level of the data points. For example, when pyoperon computes the MSE over the dataset, it performs

$$MSE = \frac{1}{N}\sum_i^N MSE(X_i, y_i) = \frac{1}{N}\sum_i^N (\hat y(X_i) - y_i)^2$$

which implies that all data points, $(X_i, y_i)$, are equally likely with probability $p_i=1/N$, i.e. the MSE is the expected value of the MSE over the distribution of the data, $(X_i, y_i) \sim \mathcal{D}$,

$$MSE = \frac{1}{N}\sum_i^N (\hat y(X_i) - y_i)^2 = \sum_i^N p_i (\hat y(X_i) - y_i)^2 = \mathbb{E}_{\mathcal{D}}[[MSE(X_i,y_i)]]$$

Now, I might have a data generating process, i.e. a sampling procedure, that does not produce data points with the same probability, but instead $p_i = w_i/\sum_i w_i$, such that

$$MSE = \mathbb{E}_{\mathcal{D}}[[MSE(X_i,y_i)]]= \frac{1}{w_i} \sum_i^N w_i MSE(X_i, y_i)$$

where $w_i$ are (non-negative) weights.

This functionality exists out-of-the-box for scikit-learn estimators, from which SymbolicRegressor inherits, (see here,here and here), as well as in other common packages such as Keras, XGBoost, Lightgbm, etc

If pyoperon was purely python I'd propose a MR, but since it's a wrapper to operon and my C++ is rusty I don't have the time to delve on this at the moment... this was the main shortcoming of pyoperon in our paper, sample-weights would have made our study a lot easier.

foolnotion · 2024-06-06T18:47:07Z

There's already a plan in place to factor out the metrics to a different library:
https://heal-research.heuristiclab.com/vstat/group__Metrics.html

Do the metrics in there satisfy your use case?

romanovzky · 2024-06-11T09:54:23Z

Yes! This is exactly the requested functionality!

foolnotion self-assigned this Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Sample weights #17

Feature request: Sample weights #17

romanovzky commented Jun 6, 2024

gkronber commented Jun 6, 2024 •

edited

Loading

romanovzky commented Jun 6, 2024

gkronber commented Jun 6, 2024

romanovzky commented Jun 6, 2024

foolnotion commented Jun 6, 2024 •

edited

Loading

romanovzky commented Jun 11, 2024

Feature request: Sample weights #17

Feature request: Sample weights #17

Comments

romanovzky commented Jun 6, 2024

gkronber commented Jun 6, 2024 • edited Loading

romanovzky commented Jun 6, 2024

gkronber commented Jun 6, 2024

romanovzky commented Jun 6, 2024

foolnotion commented Jun 6, 2024 • edited Loading

romanovzky commented Jun 11, 2024

gkronber commented Jun 6, 2024 •

edited

Loading

foolnotion commented Jun 6, 2024 •

edited

Loading