Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft implementation of point estimation #281

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

han-ol
Copy link
Contributor

@han-ol han-ol commented Dec 12, 2024

This (draft) pull request is meant to hold discussion and development of point estimation within BayesFlow.

The functionality per se was also discussed in the issue #121.

The implementation should make it easy to

  • use standard point estimators,
  • extend an existing collection with more complex point estimation,
  • and use custom research estimators.

Commit 093785d contains a first example including ONLY quantile estimation.

Writing down draft specifications and guiding thoughts

Names for everything

What is "inference", does it include point estimation? Are we calling networks discussed below PointInferenceNetworks, PointRegressors, or something else?
For now I stick with ContinuousPointApproximator and PointInferenceNetwork to make their roles in relation to the existing codebase obvious.

Components

A ContinuousPointApproximator parallels the ContinuousApproximator and bundles some feed-forward architecture with an optional summary network suitable for learning point estimates optimized to minimize some Bayes risk.
Thus it serves the same roles (including Adapter calls, summary network, etc), but instead of a sample method it provides an estimate method.

PointInferenceNetwork parallels the InferenceNetwork by providing a base class to inherit from for generative model classes suiteable for the role of approximating a conditional distribution feed-forward model classes suitable for point estimation.

Convenient default estimator

The API for functionality that covers the need of most users could be something like

point_inference_net = NameForStandardCompositePointInferenceNetwork()
point_approximator = ContinuousPointApproximator(
	summary_network = ...,
	inference_network = point_inference_net,
)
# define optimizer, compile(), fit() as usual

, with optional constructor arguments to tweak it a bit:

NameForStandardCompositePointInferenceNetwork.__init__(
	self,
	mean: bool = True,
	std: bool = True,
	quantile_levels: Sequence[float] = [0.1, 0.5, 0.9]
	subnet: str | type = "mlp",
	**kwargs,  # e.g. subnet_kwargs = dict(widths=[64, 32], dropout=0.1)
)

choices in output of estimate

For inference the method estimate(data_dict)) produces the point estimates for a given input/condition/observation.

The default PointInferenceNetwork would produce 5 point estimates (mean, std and three quantile levels) of the marginal distribution for each inference variable.

These estimates need more explanation than samples provided by generative networks typically need. We need to communicate to user, diagnostics code or other downstream tasks which estimate lands where. It seems to me that it would be helpful if such estimate output is structured as a dictionary with the point estimates names as keys, like
dict(mean: tensor, std: tensor, ...)
rather than a tensor from concatenating all individual estimates.

Architecture

The architecture has one shared "body" network as well as separate "head" networks for each scoring rule.

The subnet keyword argument has a default of "mlp", in general the argument is resolved by find_network from the bayesflow.utils which can take a string for predefined networks or a user defined custom class.

  • future extension direction: I made some pleasant experiences with using regression forests and boosted regression trees as point estimation tech for when a learned summary network is unnecessary. They are attractive for their robust performance and near instant fitting speed if applicable. sklearn and xgboost have mature implementations that are naturally not in pure keras3. Can we still somehow make it possible to use a preexisting random forest /tree boosting implementation?
    Currently the non-shared networks are just linear layers.
  • Instead we could relax this and provide optional subnet_heads that can replace default linear heads. Since I can't currently see anyone having a preference against weight sharing I am avoiding the complexity for now.
  • For some point estimates (scoring rules) it might be beneficial to specify a specialized activation function. For example exploiting that quantiles ordered by quantile level are always sorted since the CDF monotonously increases.

Extendable design

The first draft only includes a PointInferenceNetwork subclass for quantile estimation, currently called called QuantileRegressor and found in bayesflow/networks/regressors/quantile_regressor.py.
This is just the first step, and we want to support different loss functions (which are typically called scoring rules in this context) that result in other point estimates.

A more flexible API including custom scoring rules /losses could use a PointInferenceNetwork that accepts a single or a sequence of ScoringRules.
This can generate the appropriate number of heads in the build()method.
It also can pass the respective outputs to the corresponding scoring rules that compute their actual loss contributions and sum them up.

PointInferenceNetwork.__init__(
	self,
	loss_func: ScoringRule = None,
	loss_funcs: Sequence[ScoringRule] = None,
	subnet: str | type = "mlp",
	**kwargs,  # e.g. subnet_kwargs = dict(widths=[64, 32], dropout=0.1)
)

A ScoringRule / ScoringLoss has a name (we want to distinguish multiple of them later) and a score / compute method.

  • in the method score is meant as a verb
  • it seems the literature has not converged on whether a score (noun) is always maximized (like the opposite of loss) or can be both -> terminology question

A ScoringRule could also compute a prediction_shape in the constructor to be accessed by generic code that generates a corresponding head for the scoring rule.

choice we have here

  • prediction_shape could also be a prediction_size, an int rather than a tuple
  • it is possible to score for example a whole covariance matrix or other multi-dimensional predictions, so allowing an arbitrary shape might be a good idea.

The head would have output_shape=(*batch_shape, *prediction_shape, num_variables)

  • e.g.:
    • a batch_size of 32 -> batch_shape of (32,) and
    • a one dimensional point estimator -> (1,)
    • and 4 inference variables
    • -> (32, 1, 4) shall be head output_shape
    • build() creates a Dense layer with 32*1*4 units and reshape operation which together form a head

interaction with choice of output of estimate method: If we choose to allow multidimensional prediction_shapes rather than only prediction_sizes, we can not concatenate output of different heads since each can have an individual prediction_shape.
Thus the decision interacts with whether the predict method should return tensors or a dict of tensors (see above).

Below are some notes how ScoringRule definitions could look like.

class ScoringRule()
	def __init__(
		self,
		name: str = None,
		**kwargs,
	):
		...
	def score(self, prediction, realized_value):
		raise NotImplementedError

class NormedDifferenceLoss(ScoringRule)
	def __init__(
		self,
		name: str = "normed_difference",
		k: int = 2,  # results in an estimator for the mean
		**kwargs,
	):
		super().__init__(**kwargs)

		self.name = name
		self.k = k
		self.prediction_shape = (1,)
		
	def score(self, prediction, realized_value):
		pointwise_differance = prediction - realized_value[:,None,:]
		score = keras.ops.absolute(pointwise_differance) ** self.k
        score = keras.ops.mean(score)
		return score

class QuantileLoss(ScoringRule)
	def __init__(
		self,
		name: str = "quantile",
		quantile_levels: Sequence[float] = [0.1, 0.5, 0.9],
		**kwargs,
	):
		self.name = name
		self.quantile_levels = quantile_levels
		self.prediction_shape = (len(self.quantile_levels),)
		
	def score(self, prediction, realized_value):
        pointwise_differance = prediction - realized_value[:,None,:]

		score = pointwise_differance * (
	            keras.ops.cast(pointwise_differance > 0, float)  - self.quantile_levels[None, :, None]
        )
        score = keras.ops.mean(score)
		return score

If we want to support specific activation functions for different scoring rules, we might add a non-parametric activation function to the ScoringRule's definition.
We could also go all the way and have the ScoringRule also contain the head itself, taking the last joint embedding and mapping it to estimates. This then would naturally include an optional nonlinearity in the end and simultaneously give users the option to tweak the architecture of the non-shared weights.

Other notes:

  • For a PointInferenceNetwork that returns mean and std we could provide sample() and log_prob() under a Gaussian assumption. Maybe someone is interested in implementing even things like a LaplaceApproximator.
  • later: is there anything to consider for point estimation to fit neatly with the bf.Workflow?

@han-ol han-ol marked this pull request as draft December 12, 2024 14:02
@han-ol
Copy link
Contributor Author

han-ol commented Dec 12, 2024

Find a notebook to look at training and inference for such a point estimator here: https://github.com/han-ol/bayesflow/blob/point-estimation/examples/draft-point-est.ipynb

@han-ol
Copy link
Contributor Author

han-ol commented Dec 12, 2024

After writing this up, in my opinion the best structure is probably to have:

  • objects that specify both a loss function and a head (like proposed in the end of the first comment above),
  • as well as a object constructed from a a sequence of those and bundles them together as one PointInferenceNetwork.

For class names I would propose:
The heads could be called by their estimator like MeanEstimator, QuantileEstimator, CovarianceEstimator, ... with a base class called Estimator.

If we want to support convenient creation of heads by just specifying a loss function, this can be done by subclassing a BasicEstimator with prediction_shape.ndim=1 and a linear (or small mlp) architecture. Or even more succinctly by providing a corresonding macro.

Additionally, I would prefer an implementation where estimate() returns a dictionary with named estimates corresponding to the individual heads.

After collecting some of your thoughts I would proceed with implementing whatever we converge on. What do you think?

@stefanradev93
Copy link
Contributor

Hans, thanks for the great ideas and the very mature first draft! Here are some initial thoughts from my side. More to come later:

  • What do you think about the possibility to inherit from ContinuousApproximator instead of Approximator? As far as I can think, the only conceptual differences are in compute_metrics and estimate vs sample.
  • I wouldn't go for sample methods on the point estimators in order to keep their interface consistent. Even though it may make sense for certain kinds of parametric point estimators (e.g., heteroskedastic regression), one rarely needs anything beyond the distribution parameters in these cases.
  • I would definitely go for named entities in the output of estimate. Perhaps its time to start thinking in terms of "data classes".
  • I like the idea of an abstract ScoringRule class. I would call the arguments targets and references in keeping with the rest of the lib (e.g., diagnostics).
  • The WorkFlow object would need to know whether it's dealing with a point estimator or a fully Baysesian approximator. It can check whether an estimate or a sample method is present, or check for a type (the latter may invalidate my point about inheriting from ContinuousApproximator.
  • The "non-parametric activation function" for scoring rules is a good idea, because it lets users specify heads with linear output layers (@paul-buerkner - we may call these "link functions"?).
  • I would go for a single argument in the PointInferenceNetwork that can be either a single scoring rule or a sequence thereof.

@paul-buerkner
Copy link
Contributor

This looks really cool already! Thank you for this PR!

There is a lot of content here in this thread already and I may benefit from a call where you show me the current state. This would help me give reasonable feedback. I will contact you offline about it.

@han-ol
Copy link
Contributor Author

han-ol commented Dec 17, 2024

Ok cool, Paul!

Stefan, thanks for your takes already! Some notes to some of them:

  • What do you think about the possibility to inherit from ContinuousApproximator instead of Approximator? As far as I can think, the only conceptual differences are in compute_metrics and estimate vs sample.

I am not sure about this. Tagging @LarsKue for this question, I think you mentioned a preference of parallel implementation rather than inheritance?

  • I would definitely go for named entities in the output of estimate. Perhaps its time to start thinking in terms of "data classes".

Agreed, that some form of naming is necessary. How do the "data classes" you imagine differ from dictionaries of the type:

# assuming a batch_size of two, two quantile levels and 3 inference variables:
dict(
    mean=[
        [ 1, 2, 3],
        [2, 1, 3]
    ],                               # shape=(2,3)
    quantiles=[
        [[ -1, 0, 1], [ 3, 4, 5]],
        [[1, -1, -2], [3, 2, 5]]
    ],                               # shape=(2,2,3)
    ...
)

For one thing, it might be good to make the quantile levels accessible somewhere close to the estimated quantiles. This could be part of a data class.

  • I like the idea of an abstract ScoringRule class. I would call the arguments targets and references in keeping with the rest of the lib (e.g., diagnostics).

👍
We need to decide on which class holds the heads' neural networks. It seems reasonable to bundle them with the ScoringRule directly, because there is a one-to-one relationship between head and ScoringRule.
If you prefer to keep the head out of an abstract ScoringRule, I would propose to have a Head object which owns both an abstract ScoringRule and a keras.Layer.

  • The WorkFlow object would need to know whether it's dealing with a point estimator or a fully Baysesian approximator. It can check whether an estimate or a sample method is present, or check for a type (the latter may invalidate my point about inheriting from ContinuousApproximator.

Ok! Just in their defence, I'd say point estimators are also fully Bayesian ;) Functionals of the proper Bayesian posterior distribution.

  • I would go for a single argument in the PointInferenceNetwork that can be either a single scoring rule or a sequence thereof.

Good point! Using both loss_func and loss_funcs is a bit convoluted for no good reason.

@stefanradev93
Copy link
Contributor

Regarding the output, I think we can go with dictionaries for now. I believe some custom data classes will come in handy rather soon.

Just a thought: Can the heads simply be determined automatically assuming the scoring rules know their dims?

@han-ol
Copy link
Contributor Author

han-ol commented Dec 21, 2024

Yes, they can mostly be inferred, and I'd suggest linear layers followed by a reshape as an overwritable default.

However, some scoring rules benefit from (e.g. quantiles, monotonously increasing) or need (e.g. covariance matrix, positive semidefinite) a specific architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants