-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Sample weights #17
Comments
Yes, this feature is useful in many applications. The Handling this issue via likelihoods, with a few commonly used ones predefined in operon, is more generally applicable. Weights for a loss-function are a special case. |
Hmmm I think these are different things. The likelihood method that you are mentioning is with respect to data that have uncertainty associateds to the target values, in this case the What I suggest is to apply these weights at a sample statistical level. It is common to have data sampled which needs to be statistically re-weighted, a good example is social data where sampling is difficult and from a random sampling of data one then needs to re-weight the sample to better approximate known global statistics (apparently this is called post-stratification). Another example happens in Physics a lot when sampling data using an MCMC with a bias towards tails of the distribution, the end sample is then re-weighted back to the "true" distribution. In both cases, one uses weights to derive the statistics of the dataset through weighted averages, standard deviations, etc. As such, I believe that the first case (your suggestion) is a different type of loss function, say "chi-squarted MSE" (or non-constant std MSE), whereas the second case (my suggestion) is a statistical re-weighting of virtually any loss function. |
I was refering to maximum likelihood estimation of parameters. |
Oh, completely. There are many loss functions in the ML literature that do not have a likelihood interpretation. My request, however, is at the level of how the loss is computed to train the ML model over data sample, i.e. at the distributional level of the data points. For example, when which implies that all data points, Now, I might have a data generating process, i.e. a sampling procedure, that does not produce data points with the same probability, but instead where This functionality exists out-of-the-box for If |
There's already a plan in place to factor out the metrics to a different library: Do the metrics in there satisfy your use case? |
Yes! This is exactly the requested functionality! |
In this work (preprint), we had to resample the dataset so that the target variable had a flat distribution to prevent
pyoperon
from only learning about the most common values. A better alternative, thatpyoperon
does not currently support, is to usesample_weights
when computing the loss function.For example, we would have
$$loss_{total} = \frac{1}{\sum_i w_i} \sum_i w_i \times loss(X_i, y_i)$$
X
,y
,w
(sample weights), then the loss function is computed as the weighted average of the losses of each data instancethis is already supported by many
scikit-learn
estimators.The use case would be varied: from specifying the focus of a regression task, to use datasets that have been sampled in a way that the data come with sampling weights (this is common practice in social sciences and in simulation-based inference in STEM).
The text was updated successfully, but these errors were encountered: