diff --git a/paper/paper.md b/paper/paper.md index c82d3eee..c5b1f7a4 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -42,15 +42,19 @@ date: 28 November 2024 bibliography: paper.bib --- -# Summary +# Summary Simulations are ubiquitous in research and application, but are often too slow and computationally expensive to deeply explore the underlying system. One solution is to create efficient emulators (also surrogate- or meta-models) to approximate simulations, but this requires substantial expertise. Here, we present AutoEmulate, a low-code, AutoML-style python package for emulation. AutoEmulate makes it easy to fit and compare emulators, abstracting away the need for extensive machine learning (ML) experimentation. The package includes a range of emulators, from Gaussian Processes, Support Vector Machines and Gradient Boosting Models to novel, experimental deep learning emulators such as Neural Processes [@garnelo_conditional_2018]. It also implements global sensitivity analysis as a common emulator application, which quantifies the relative contribution of different inputs to the output variance. In the future, with user feedback and contributions, we aim to organically grow AutoEmulate into an end-to-end tool for most emulation problems. # Statement of need -To understand complex real-world systems, researchers and engineers often construct computer simulations. These can be computationally expensive and take minutes, hours or even days to run. For tasks like optimisation, sensitivity analysis or uncertainty quantification where thousands or even millions of runs are needed, a solution has long been to approximate simulations with efficient emulators, which can be orders of magnitudes faster [@forrester_recent_2009; @kudela_recent_2022]. Emulation is becoming increasingly widespread, ranging from engineering [@yondo_review_2018], architecture [@westermann_surrogate_2019], biomedical [@strocchi_cell_2023] and climate science [@bounceur_global_2015], to agent-based models [@angione_using_2022]. A typical emulation pipeline involves three steps: 1. Evaluating the simulation at a small, strategically chosen set of inputs using techniques such as Latin Hypercube Sampling [@mckay_comparison_1979] to create a representative dataset, 2. constructing a high-accuracy emulator using that dataset, which involves model selection, hyperparameter optimisation and evaluation and 3. applying the emulator to tasks such as prediction, sensitivity analysis, or optimisation. Building an emulator in particular requires substantial machine learning experience and knowledge of an ever increasing ecosystem of models and packages. This puts a substantial burden on practitioners whose main focus is to explore the underlying system, not building the emulator. +To understand complex real-world systems, researchers and engineers often construct computer simulations. These can be computationally expensive and take minutes, hours or even days to run. For tasks like optimisation, sensitivity analysis or uncertainty quantification where thousands or even millions of runs are needed, a solution has long been to approximate simulations with efficient emulators, which can be orders of magnitudes faster [@forrester_recent_2009; @kudela_recent_2022]. Emulation is becoming increasingly widespread, ranging from engineering [@yondo_review_2018], architecture [@westermann_surrogate_2019], biomedical [@strocchi_cell_2023] and climate science [@bounceur_global_2015], to agent-based models [@angione_using_2022]. -AutoEmulate automates emulator building, with the goal to eventually streamline the whole emulation pipeline. For people new to ML, AutoEmulate compares, optimises and evaluates a range of models to create an efficient emulator for their simulation in just a few lines of code. For experienced surrogate modellers, AutoEmulate provides a reference set of cutting-edge emulators to quickly benchmark new models against. The package includes classic emulators such as Radial Basis Functions and Gaussian Processes, established ML models like Gradient Boosting and Support Vector Machines, as well as experimental deep learning emulators such as [Conditional Neural Processes](https://yanndubs.github.io/Neural-Process-Family/text/Intro.html) [@garnelo_conditional_2018]. AutoEmulate is built to be extensible. Emulators follow the popular [scikit-learn estimator template](https://scikit-learn.org/1.5/developers/develop.html#rolling-your-own-estimator) and deep learning models are supported with little overhead using PyTorch [@paszke_pytorch_2019] with a skorch [@tietz_skorch_2017] interface. AutoEmulate fills a gap in the current landscape of surrogate modeling tools as it’s both highly accessible for newcomers while providing cutting-edge emulators for experienced surrogate modelers. In contrast, existing libraries either focus on lower level implementations of specific models, like GPflow [@matthews_gpflow_2017] and GPytorch [@gardner_gpytorch_2018], or provide multiple emulators and applications but require to manually pre-process data, compare emulators and optimise hyperparameters like SMT in Python [@saves_smt_2024] or [Surrogates.jl](https://docs.sciml.ai/Surrogates/latest/) in Julia. +A typical emulation pipeline involves three steps: 1. Evaluating the simulation at a small, strategically chosen set of inputs using techniques such as Latin Hypercube Sampling [@mckay_comparison_1979] to create a representative dataset, 2. constructing a high-accuracy emulator using that dataset, which involves model selection, hyperparameter optimisation and evaluation and 3. applying the emulator to tasks such as prediction, sensitivity analysis, or optimisation. Building an emulator in particular requires substantial machine learning experience and knowledge of an ever increasing ecosystem of models and packages. This puts a substantial burden on practitioners whose main focus is to explore the underlying system, not building the emulator. + +AutoEmulate automates emulator building, with the goal to eventually streamline the whole emulation pipeline. For people new to ML, AutoEmulate compares, optimises and evaluates a range of models to create an efficient emulator for their simulation in just a few lines of code. For experienced surrogate modellers, AutoEmulate provides a reference set of cutting-edge emulators to quickly benchmark new models against. The package includes classic emulators such as Radial Basis Functions and Gaussian Processes, established ML models like Gradient Boosting and Support Vector Machines, as well as experimental deep learning emulators such as [Conditional Neural Processes](https://yanndubs.github.io/Neural-Process-Family/text/Intro.html) [@garnelo_conditional_2018]. AutoEmulate is built to be extensible. Emulators follow the popular [scikit-learn estimator template](https://scikit-learn.org/1.5/developers/develop.html#rolling-your-own-estimator) and deep learning models are supported with little overhead using PyTorch [@paszke_pytorch_2019] with a skorch [@tietz_skorch_2017] interface. + +AutoEmulate fills a gap in the current landscape of surrogate modeling tools as it’s both highly accessible for newcomers while providing cutting-edge emulators for experienced surrogate modelers. In contrast, existing libraries either focus on lower level implementations of specific models, like GPflow [@matthews_gpflow_2017] and GPytorch [@gardner_gpytorch_2018], or provide multiple emulators and applications but require to manually pre-process data, compare emulators and optimise hyperparameters like SMT in Python [@saves_smt_2024] or [Surrogates.jl](https://docs.sciml.ai/Surrogates/latest/) in Julia. # Pipeline