Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROTEUS grid-search or forward model #204

Open
timlichtenberg opened this issue Oct 7, 2024 · 7 comments
Open

PROTEUS grid-search or forward model #204

timlichtenberg opened this issue Oct 7, 2024 · 7 comments
Labels
ensembles Relating to grids or forward models Priority 4: tbd Priority level 4: nice to have features and/or has some time

Comments

@timlichtenberg
Copy link
Collaborator

timlichtenberg commented Oct 7, 2024

For moving towards an inverse method of PROTEUS sometime down the road, we need to consider a computationally feasible approach to run many models to fit a given set of observations.

To give an example of the problem: Let's assume a given exoplanet has the following known/observed parameters with uncertainties: stellar age, orbital distance, planet radius, planet mass, transmission/emission spectrum. Given these parameters, we would like to compute the best-fitting PROTEUS models over a set of input parameters, and then compute a goodness-of-fit metric. This is essentially the description of an atmospheric retrieval, only that PROTEUS simulations are way too computationally expensive as to perform 100k+ simulations.

I am not certain yet what is the best strategy to approach this problem. Here are a few that have some opportunities and drawbacks:

  • A modified chi-squared or r-squared algorithm to compute some measure of goodness-of-fit for an arbitrary grid. E.g. Madhusudhan & Seager (2009).
  • Train a machine learning model on the simulation data and use the machine learning model for the retrieval, e.g., Ardevol Martinez et al. (2024).
  • Brute-force retrieval approach, e.g., nested sampling, MCMC, or some other random sampler, ignoring the computational cost, and possibly achieving a pretty low-confidence result.
@timlichtenberg timlichtenberg converted this from a draft issue Oct 7, 2024
@nichollsh
Copy link
Contributor

I agree that this would be incredibly powerful. I can imagine that running an MCMC (or similar method) would be tricky because of the slow runtimes. When we are ready to look into this, maybe we could involve someone who has experience doing retrievals with large models?

@nichollsh
Copy link
Contributor

The ML paper you cited is interesting - they ran 50k simulations to train the model. I am finding that a grid of 22 simulations takes about 14 hours to run (on 22 threads). If we scaled this to 50k simulations on 256 threads this would take 50000*14/256 = 114 days. We could of course speed this up by reducing the resolution, etc.

@timlichtenberg
Copy link
Collaborator Author

timlichtenberg commented Oct 8, 2024

I believe they need fewer simulations than a "normal" Bayesian model, which is one of their selling points. Nevertheless, even 100k simulations are not impossible when using a large-scale computing facility. We can and should do this sometime in the next year to achieve a large simulation grid, once the current plans with aragog and zephyrus are done. Cosmology solves this problem by running updated large-scale forward models every few years with high-performance codes (e.g. TNG project) and then using these models to train machine learning on them. This is a way to go, but if we can find an algorithm that enables running highly specialised simulations to compute the Bayesian evidence directly for a single planet on ~week(s) timescale, this would be preferable I think.

@nichollsh
Copy link
Contributor

A simpler option might be to run a grid of models (>~2000 points) and use a clustering algorithm on their binned spectra (or bulk density) to identify particular groups. This can pick out particular features that may allow us to infer particular parameters of an observed planet by identifying the group into which it best fits.

https://hdbscan.readthedocs.io/en/stable/advanced_hdbscan.html

@timlichtenberg
Copy link
Collaborator Author

This is pretty cool! I like it a lot, this seems like a good solution to analyse our outputs. However, it still requires us to set the parameter space by hand, which means that we run many models that are not particularly useful/necessary and do not add valuable information. A Bayesian model selection or sth similar would save computation time until a statistically robust answer is achieved.

@lsoucasse lsoucasse added the ensembles Relating to grids or forward models label Nov 11, 2024
@timlichtenberg timlichtenberg added the Priority 4: tbd Priority level 4: nice to have features and/or has some time label Nov 19, 2024
@stefsmeets
Copy link
Contributor

This tool came up today: https://wandb.ai/

It's meant for hyperparameter optimization in machine learning. Via a yaml file you define the parameters to search. You just have to write the interface in python.

@nichollsh nichollsh moved this from TBD to JOSS Publication in PROTEUS Development Roadmap Nov 27, 2024
@nichollsh nichollsh changed the title PROTEUS grid-search PROTEUS grid-search or forward model Nov 27, 2024
@nichollsh
Copy link
Contributor

Also on the theme of optimisation, I think we should consider Emcee, since it's well established within the astronomy community.

https://emcee.readthedocs.io/en/stable/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ensembles Relating to grids or forward models Priority 4: tbd Priority level 4: nice to have features and/or has some time
Projects
Status: JOSS Publication
Development

No branches or pull requests

4 participants