Skip to content

Commit

Permalink
PR #15 Update CW 48
Browse files Browse the repository at this point in the history
Update Calender Week 48 - Custom Sigma
  • Loading branch information
davidkowalk authored Dec 1, 2024
2 parents a91934f + 7ec6866 commit 1c70611
Show file tree
Hide file tree
Showing 5 changed files with 91 additions and 22 deletions.
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
# Fitting Toolkit
This toolkit aims at providing flexible and powerful tools for data analysis and modelling, but remain easy to use.

Here I aim to walk the line between the extremes in this field. On the one side lie toolkits like Kafe2 which are easy and comfortable, however the resulting graphics are highly specified, cluttered and usually unfit for publication. On the other lie data analysis systems like Cern's ROOT, which are fast and highly capable, however have a steep learning curve and overshoot the requirements for most experiments.
Here, I aim to strike a balance between the two extremes in this field. On one side are toolkits such as Kafe2, which prioritize ease of use and convenience but limit user control over the output, often resulting in highly specialized graphics that frequently do not meet standards required for publication without considerable effort. On the other side are data analysis systems like CERN's ROOT, which offer exceptional speed and capability but come with a steep learning curve and often exceed the requirements of most experiments.

This toolkit is aimed primarily at my peers, students of physics at the university of bonn, and to a degree at professionals within my field. I am optimizing this toolkit to be used on the scale typical of lab courses and homework assignments but if possible it should be powerful enough to run decently sized datasets on an average laptop.

Expand All @@ -20,13 +20,18 @@ Check out the `docs` folder for documentation and tutorials.
## Quick Introduction

### Requirements
This project requires the following modules:
This project requires the following modules along with their dependencies:
- numpy
- matplotlib
- scipy

It is highly recommended that the user familiarizes themselves with the functionality of these modules first. A rudimentary understanding of `numpy` and `matplotlib.pyplot` is required.

To install the dependencies, first a [virtual environment](https://docs.python.org/3/library/venv.html) should be created. `requirements.txt` lists all necessary packages. Run:
```
pip install -r requirements.txt
```

### Getting Started

To get started find the `fitting_toolkit.py` in the `src` folder and copy it into your project.
Expand Down Expand Up @@ -65,4 +70,8 @@ Note that the fitted function is not automatically displayed. Instead the figure

![Example Graph](./docs/img/example_fit.png)

For a deeper explanation and tutorials please reference the [documentation](./docs/manual.md/).
For a deeper explanation and tutorials please reference the [documentation](./docs/manual.md/).

## Literature:
[1] Vugrin, K. W., L. P. Swiler, R. M. Roberts, N. J. Stucky-Mack, and S. P. Sullivan (2007), Confidence region estimation techniques for nonlinear regression in groundwater flow: Three case studies, Water Resour. Res., 43, W03423, https://doi.org/10.1029/2005WR004804. \
[2] Dennis D. Boos. "Introduction to the Bootstrap World." Statist. Sci. 18 (2) 168 - 174, May 2003. https://doi.org/10.1214/ss/1063994971
19 changes: 17 additions & 2 deletions docs/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ By separating the fitting functionality from the display options, a user can uti

To fit a dataset, call:
```python
curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, **kwargs)
curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, confidence_resolution: int = None, nsigma:float = 1, **kwargs)
```

| Parameters | | |
Expand Down Expand Up @@ -107,7 +107,7 @@ The matplotlib objects used are returned:

## Calculate Confidence Interval for an Existing Fit

Given already fitted parameters and a covariance matrix, a confidence interval can be calculated using `confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int)`
Given already fitted parameters and a covariance matrix, a confidence interval can be calculated using `confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int, nsigma: float = 1)`

| Parameters | | |
|----------|----------|-----------------|
Expand Down Expand Up @@ -141,6 +141,21 @@ So that the fitted value `x[i]` are
print(f"f({x[i]:.2e}) = {f(x[i], *params):.2e} (+{sigma_pos[i]:.2e}/-{sigma_neg[i]:.2e})")
```

## Generate Probability for Sigma Interval

To get probability to fall into n-sigma interval call `get_sigma_probability(n: float = 1)`

| Parameters | | |
|----------|----------|-----------------|
| **Name** | **Type** | **Description** |
| n | float | Number of sigmas in interval

| Returns | | |
|----------|----------|-----------------|
| **Name** | **Type** | **Description** |
| p | float | Probability of falling into sigma interval. $P(\mu - n*\sigma < X < \mu + n*\sigma) $


## Generating Thresholds

Given a bootstrapped distribution, generate a custom confidence interval.
Expand Down
32 changes: 27 additions & 5 deletions docs/manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Note that the fitted function is not automatically displayed. Instead the figure

Understanding how the tools we use work with and calculate errors and uncertainties is a vital part of judging and quantifying the quality of our work. The toolkit's `curve_fit()` function uses scipy's `curve_fit()` function which handles the fitting process as well as the calculation of errors and correlation of the fitted parameters. You can reference the modules [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) for a detailed explanation.

### Covariance Matrix and Sigma
### Covariance Matrix and Standard Deviation

SciPy returns a [covariance matrix](https://en.wikipedia.org/wiki/Covariance) along with the optimal parameters of the fitted function. In [error propagation](https://en.wikipedia.org/wiki/Propagation_of_uncertainty) according to gauß we assume that the measured values are uncorrelated. In the real world this is usually not the case.

Expand All @@ -60,7 +60,7 @@ The closer the absolute value of the covariance between two values is to the pro
s1, s2, ..., sn = np.sqrt(np.diagonal(cov))
```

> Note that the relationship between cov and parameter error estimates is derived based on a linear approximation to the model function around the optimum. When this approximation becomes inaccurate, cov may not provide an accurate measure of uncertainty.
> Note that the relationship between cov and parameter error estimates is derived based on a linear approximation to the model function around the optimum. When this approximation becomes inaccurate, cov may not provide an accurate measure of uncertainty. [[1]](https://doi.org/10.1029/2005WR004804)
Further the documentation notes:

Expand All @@ -82,7 +82,7 @@ To estimate the confidence interval the fitted parameters are then resampled usi

For each point on the x-axis, the fitted function is calculated using all the resampled parameter sets. Then, the `numpy.percentile` function is used to estimate the upper and lower bounds of the confidence interval. By default, these bounds are set so that 1/6 of the resampled values are above the interval and 1/6 are below it. This means that 2/3 of the resampled values fall within the interval, which corresponds to a 1-sigma confidence level.

This method is often referred to as ["bootstrapping"](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) and presents a simple example of this method.
This method is referred to as ["parametric bootstrapping"](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) [[2]](https://doi.org/10.1214/ss/1063994971). The resampling method assumes normally distributed parameters, when errors on the fitted parameters turn out to be asymmetric the numerical approximation may prove out to be inaccurate.

By default the confidence interval is estimated at each x-position of the data, however this may cause issues of resolution when data points are sparse or non-uniformly distributed along the x-axis, or it may be computationally expensive for large datasets. When a `confidence_resolution: int` parameter is set in `curve_fit`, that number of points is generated between the highest and lowest point on the x-axis and used as the x-axis instead.

Expand All @@ -92,21 +92,43 @@ resampled_x_axis = np.linspace(min(xdata), max(xdata), confidence_resolution)

Note that the `confidence_resolution` must be provided to both `curve_fit` and `plot_fit`

### Specifying the Number of Standard Deviations

To specify the number of standard deviations to be calculated provide the `nsigma` parameter to either the `curve_fit()` function, or directly to `confidence_interval()`. By default one standard deviation is used. The upper and lower thresholds of the resulting gaussian curve are calculated by evaluating the gaussian integral between $\mu - n\cdot \sigma$ and $\mu + n\cdot \sigma$. By substituting

$$
z = \frac{x - \mu} {\sigma}
$$

the probability to fall within an n- $\sigma$ interval can be calculated as

$$
P(\mu - n\sigma < X < \mu + n\sigma) = \frac{1}{2}(\text{erf}(n/\sqrt{2}) - \text{erf}(-n/\sqrt{2})) := P_n
$$

The cutoff for the resampled points above the upper threshold ($u$) and the lower threshold ($l$) then is

$u_n = \frac{1}{2} + \frac{P_n}{2}$; $l_n = \frac{1}{2} - \frac{P_n}{2}$

Be aware, that
- For $n\gtrapprox 5.5$ the error function incorrectly evaluates to $\pm 1$. Larger standard deviations will all evaluate to the same distance from the fit.
- For large standard deviations ($\gtrapprox 3$) the approximation by resampling may require larger resampling sizes to be accurate.

## Using Plot-Fit

This section instructs on the basics of using the plot fit function.
For function documentation consult `./functions.md`.

### Plotting a Fitted Function

`fitting_toolkit.plot_fit()` displays the fitted function generated by `fitting_toolkit.curve_fit()`. The following parameters are required:
`fitting_toolkit.plot_fit()` displays the fitted function generated by `fitting_toolkit.curve_fit()` via non linear regression. The following parameters are required:

| Parameters | | |
|----------|------------|-----------------|
| **Name** | **Type** | **Description** |
| xdata | np.ndarray | The x-values of the data points.
| ydata | np.ndarray | The y-values of the data points.
| model | np.ndarray | The model function that takes `xdata` and model parameters as inputs.
| model | function | The model function that takes `xdata` and model parameters as inputs.
| params | np.ndarray | The parameters for the fitted model.
| lower | np.ndarray | The lower bounds of the confidence intervals for the model predictions.
| upper | np.ndarray | The upper bounds of the confidence intervals for the model predictions.
Expand Down
Binary file added requirements.txt
Binary file not shown.
47 changes: 35 additions & 12 deletions src/fitting_toolkit.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from scipy.optimize import curve_fit as sc_curve_fit
from scipy.special import erf
import numpy as np
from matplotlib import pyplot as plt

def generate_thresholds(data, lower_frac=1/6, upper_frac=5/6):
def generate_thresholds(data, lower_frac=0.15865, upper_frac=0.84135):
"""
Generates two thresholds such that:
- A fraction (lower_frac) of the data is below the lower threshold.
Expand All @@ -16,11 +17,25 @@ def generate_thresholds(data, lower_frac=1/6, upper_frac=5/6):
Returns:
tuple: (lower_threshold, upper_threshold)
"""
lower_threshold = np.percentile(data, lower_frac * 100) # 16.67th percentile
upper_threshold = np.percentile(data, upper_frac * 100) # 83.33rd percentile
lower_threshold = np.percentile(data, lower_frac * 100)
upper_threshold = np.percentile(data, upper_frac * 100)
return lower_threshold, upper_threshold

def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int) -> tuple[np.array, np.array]:
def get_sigma_probability(n: float = 1):
"""
Returns probability for event to fall into n-sigma interval assuming a gaussian distribution:
P(mu - n*sigma < X < mu + n*sigma)
Args:
n (float): Number of sigmas in interval
Returns:
p (float): Probability of falling into sigma interval.
"""

return 1/2 * (erf(n / 1.4142135623730951) - erf(-n / 1.4142135623730951))

def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int, nsigma: float = 1) -> tuple[np.array, np.array]:
"""
Computes the confidence intervals for the predictions of a model based on a set of input data.
Expand All @@ -33,6 +48,7 @@ def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array,
params (numpy.ndarray): The initial model parameters.
cov (numpy.ndarray): The covariance matrix of the model parameters, used to generate resamples.
resamples (int): The number of resampling iterations to generate for bootstrapping.
nsigma (float): Number of standard deviation in interval.
Returns:
tuple: A tuple containing:
Expand All @@ -43,18 +59,22 @@ def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array,

params_resamples = random.transpose()

P = get_sigma_probability(nsigma)
upper_threshold = 0.5 + P/2
lower_threshold = 0.5 - P/2

lower_conf = list()
upper_conf = list()

for x in xdata:
distr = model(x, *params_resamples)
interval = generate_thresholds(distr)
interval = generate_thresholds(distr, lower_frac = lower_threshold, upper_frac = upper_threshold)
lower_conf.append(interval[0])
upper_conf.append(interval[1])

return np.array(lower_conf), np.array(upper_conf)

def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, confidence_resolution: int = None, **kwargs) -> tuple[np.array, np.array, np.array, np.array]:
def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, confidence_resolution: int = None, nsigma:float = 1, **kwargs) -> tuple[np.array, np.array, np.array, np.array]:
"""
Fits a model to data and calculates confidence intervals for the fitted parameters and predictions.
Expand All @@ -69,6 +89,7 @@ def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples
yerror (numpy.ndarray, optional): The uncertainties in the observed data `ydata`. Default is None.
resamples (int, optional): The number of resampling iterations for bootstrapping confidence intervals. Default is 5000.
confidence_resolution (int, optional): If specified the confidence interval will be calculated at linearly spaced points along x-axis. Otherwise xdata is used.
nsigma (float): Number of standard deviation passed to confidence_interval()
**kwargs: Additional arguments passed to SciPy's `curve_fit` function.
Returns:
Expand All @@ -86,7 +107,7 @@ def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples
else:
raise ValueError("Unable to specify confidence points")

lower_conf, upper_conf = confidence_interval(model, resampled_points, params, cov, resamples)
lower_conf, upper_conf = confidence_interval(model, resampled_points, params, cov, resamples, nsigma)

return params, cov, lower_conf, upper_conf

Expand Down Expand Up @@ -138,8 +159,14 @@ def plot_fit(xdata, ydata, model, params, lower, upper, xerror = None, yerror =

if fig is None and ax is None:
fig, ax = plt.subplots(**kwargs)

ax.spines[["top", "right"]].set_visible(False)
ax.grid("both")
ax.set_axisbelow(True)

elif ax is None:
ax = fig.axes
ax = fig.axes[0] #Choose first axes object in Figure

elif fig is None:
fig = ax.get_figure()

Expand All @@ -148,9 +175,5 @@ def plot_fit(xdata, ydata, model, params, lower, upper, xerror = None, yerror =
ax.plot(resampled_points, upper, color = fit_color, linewidth = 0.75, linestyle = "--", label = confidence_label)
ax.plot(resampled_points, lower, color = fit_color, linewidth = 0.75, linestyle = "--")

ax.spines[["top", "right"]].set_visible(False)
ax.grid("both")
ax.set_axisbelow(True)

return fig, ax

0 comments on commit 1c70611

Please sign in to comment.