PR #15 Update CW 48

Update Calender Week 48 - Custom Sigma
davidkowalk · Dec 1, 2024 · 1c70611 · 1c70611
2 parents a91934f + 7ec6866
commit 1c70611
Show file tree

Hide file tree

Showing 5 changed files with 91 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 # Fitting Toolkit
 This toolkit aims at providing flexible and powerful tools for data analysis and modelling, but remain easy to use.
 
-Here I aim to walk the line between the extremes in this field. On the one side lie toolkits like Kafe2 which are easy and comfortable, however the resulting graphics are highly specified, cluttered and usually unfit for publication. On the other lie data analysis systems like Cern's ROOT, which are fast and highly capable, however have a steep learning curve and overshoot the requirements for most experiments.
+Here, I aim to strike a balance between the two extremes in this field. On one side are toolkits such as Kafe2, which prioritize ease of use and convenience but limit user control over the output, often resulting in highly specialized graphics that frequently do not meet standards required for publication without considerable effort. On the other side are data analysis systems like CERN's ROOT, which offer exceptional speed and capability but come with a steep learning curve and often exceed the requirements of most experiments.
 
 This toolkit is aimed primarily at my peers, students of physics at the university of bonn, and to a degree at professionals within my field. I am optimizing this toolkit to be used on the scale typical of lab courses and homework assignments but if possible it should be powerful enough to run decently sized datasets on an average laptop.
 
@@ -20,13 +20,18 @@ Check out the `docs` folder for documentation and tutorials.
 ## Quick Introduction
 
 ### Requirements
-This project requires the following modules:
+This project requires the following modules along with their dependencies:
 - numpy
 - matplotlib
 - scipy
 
 It is highly recommended that the user familiarizes themselves with the functionality of these modules first. A rudimentary understanding of `numpy` and `matplotlib.pyplot` is required.
 
+To install the dependencies, first a [virtual environment](https://docs.python.org/3/library/venv.html) should be created. `requirements.txt` lists all necessary packages. Run:
+```
+pip install -r requirements.txt
+```
+
 ### Getting Started
 
 To get started find the `fitting_toolkit.py` in the `src` folder and copy it into your project.
@@ -65,4 +70,8 @@ Note that the fitted function is not automatically displayed. Instead the figure
 
 ![Example Graph](./docs/img/example_fit.png)
 
-For a deeper explanation and tutorials please reference the [documentation](./docs/manual.md/).
+For a deeper explanation and tutorials please reference the [documentation](./docs/manual.md/).
+
+## Literature:
+[1] Vugrin, K. W., L. P. Swiler, R. M. Roberts, N. J. Stucky-Mack, and S. P. Sullivan (2007), Confidence region estimation techniques for nonlinear regression in groundwater flow: Three case studies, Water Resour. Res., 43, W03423, https://doi.org/10.1029/2005WR004804. \
+[2] Dennis D. Boos. "Introduction to the Bootstrap World." Statist. Sci. 18 (2) 168 - 174, May 2003. https://doi.org/10.1214/ss/1063994971
diff --git a/docs/functions.md b/docs/functions.md
@@ -7,7 +7,7 @@ By separating the fitting functionality from the display options, a user can uti
 
 To fit a dataset, call:
 ```python
-curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, **kwargs)
+curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, confidence_resolution: int = None, nsigma:float = 1, **kwargs)
 ```
 
 | Parameters | | |
@@ -107,7 +107,7 @@ The matplotlib objects used are returned:
 
 ## Calculate Confidence Interval for an Existing Fit
 
-Given already fitted parameters and a covariance matrix, a confidence interval can be calculated using `confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int)`
+Given already fitted parameters and a covariance matrix, a confidence interval can be calculated using `confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int, nsigma: float = 1)`
 
 | Parameters | | |
 |----------|----------|-----------------|
@@ -141,6 +141,21 @@ So that the fitted value `x[i]` are
 print(f"f({x[i]:.2e}) = {f(x[i], *params):.2e} (+{sigma_pos[i]:.2e}/-{sigma_neg[i]:.2e})")
 ```
 
+## Generate Probability for Sigma Interval
+
+To get probability to fall into n-sigma interval call `get_sigma_probability(n: float = 1)`
+
+| Parameters | | |
+|----------|----------|-----------------|
+| **Name** | **Type** | **Description** |
+| n        | float    | Number of sigmas in interval
+
+| Returns | | |
+|----------|----------|-----------------|
+| **Name** | **Type** | **Description** |
+| p        | float    | Probability of falling into sigma interval. $P(\mu - n*\sigma < X < \mu + n*\sigma) $
+
+
 ## Generating Thresholds
 
 Given a bootstrapped distribution, generate a custom confidence interval.

diff --git a/docs/manual.md b/docs/manual.md
@@ -48,7 +48,7 @@ Note that the fitted function is not automatically displayed. Instead the figure
 
 Understanding how the tools we use work with and calculate errors and uncertainties is a vital part of judging and quantifying the quality of our work. The toolkit's `curve_fit()` function uses scipy's `curve_fit()` function which handles the fitting process as well as the calculation of errors and correlation of the fitted parameters. You can reference the modules [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) for a detailed explanation.
 
-### Covariance Matrix and Sigma
+### Covariance Matrix and Standard Deviation
 
 SciPy returns a [covariance matrix](https://en.wikipedia.org/wiki/Covariance) along with the optimal parameters of the fitted function. In [error propagation](https://en.wikipedia.org/wiki/Propagation_of_uncertainty) according to gauß we assume that the measured values are uncorrelated. In the real world this is usually not the case.
 
@@ -60,7 +60,7 @@ The closer the absolute value of the covariance between two values is to the pro
 s1, s2, ..., sn = np.sqrt(np.diagonal(cov))
 ```
 
-> Note that the relationship between cov and parameter error estimates is derived based on a linear approximation to the model function around the optimum. When this approximation becomes inaccurate, cov may not provide an accurate measure of uncertainty.
+> Note that the relationship between cov and parameter error estimates is derived based on a linear approximation to the model function around the optimum. When this approximation becomes inaccurate, cov may not provide an accurate measure of uncertainty. [[1]](https://doi.org/10.1029/2005WR004804)
 
 Further the documentation notes:
 
@@ -82,7 +82,7 @@ To estimate the confidence interval the fitted parameters are then resampled usi
 
 For each point on the x-axis, the fitted function is calculated using all the resampled parameter sets. Then, the `numpy.percentile` function is used to estimate the upper and lower bounds of the confidence interval. By default, these bounds are set so that 1/6 of the resampled values are above the interval and 1/6 are below it. This means that 2/3 of the resampled values fall within the interval, which corresponds to a 1-sigma confidence level.
 
-This method is often referred to as ["bootstrapping"](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) and presents a simple example of this method.
+This method is referred to as ["parametric bootstrapping"](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) [[2]](https://doi.org/10.1214/ss/1063994971). The resampling method assumes normally distributed parameters, when errors on the fitted parameters turn out to be asymmetric the numerical approximation may prove out to be inaccurate.
 
 By default the confidence interval is estimated at each x-position of the data, however this may cause issues of resolution when data points are sparse or non-uniformly distributed along the x-axis, or it may be computationally expensive for large datasets. When a `confidence_resolution: int` parameter is set in `curve_fit`, that number of points is generated between the highest and lowest point on the x-axis and used as the x-axis instead.
 
@@ -92,21 +92,43 @@ resampled_x_axis = np.linspace(min(xdata), max(xdata), confidence_resolution)
 
 Note that the `confidence_resolution` must be provided to both `curve_fit` and `plot_fit`  
 
+### Specifying the Number of Standard Deviations
+
+To specify the number of standard deviations to be calculated provide the `nsigma` parameter to either the `curve_fit()` function, or directly to `confidence_interval()`. By default one standard deviation is used. The upper and lower thresholds of the resulting gaussian curve are calculated by evaluating the gaussian integral between $\mu - n\cdot \sigma$ and $\mu + n\cdot \sigma$. By substituting
+
+$$
+    z = \frac{x - \mu} {\sigma}
+$$
+
+the probability to fall within an n- $\sigma$ interval can be calculated as
+
+$$
+    P(\mu - n\sigma < X < \mu + n\sigma) = \frac{1}{2}(\text{erf}(n/\sqrt{2}) - \text{erf}(-n/\sqrt{2})) := P_n
+$$
+
+The cutoff for the resampled points above the upper threshold ($u$) and the lower threshold ($l$) then is
+
+$u_n = \frac{1}{2} + \frac{P_n}{2}$; $l_n = \frac{1}{2} - \frac{P_n}{2}$
+
+Be aware, that
+- For $n\gtrapprox 5.5$ the error function incorrectly evaluates to $\pm 1$. Larger standard deviations will all evaluate to the same distance from the fit.
+- For large standard deviations ($\gtrapprox 3$) the approximation by resampling may require larger resampling sizes to be accurate.
+
 ## Using Plot-Fit
 
 This section instructs on the basics of using the plot fit function.
 For function documentation consult `./functions.md`.
 
 ### Plotting a Fitted Function
 
-`fitting_toolkit.plot_fit()` displays the fitted function generated by `fitting_toolkit.curve_fit()`. The following parameters are required:
+`fitting_toolkit.plot_fit()` displays the fitted function generated by `fitting_toolkit.curve_fit()` via non linear regression. The following parameters are required:
 
 | Parameters | | |
 |----------|------------|-----------------|
 | **Name** | **Type**   | **Description** |
 | xdata    | np.ndarray | The x-values of the data points.
 | ydata    | np.ndarray | The y-values of the data points.
-| model    | np.ndarray | The model function that takes `xdata` and model parameters as inputs.
+| model    | function   | The model function that takes `xdata` and model parameters as inputs.
 | params   | np.ndarray | The parameters for the fitted model.
 | lower    | np.ndarray | The lower bounds of the confidence intervals for the model predictions. 
 | upper    | np.ndarray | The upper bounds of the confidence intervals for the model predictions.

diff --git a/requirements.txt b/requirements.txt
diff --git a/src/fitting_toolkit.py b/src/fitting_toolkit.py
@@ -1,8 +1,9 @@
 from scipy.optimize import curve_fit as sc_curve_fit
+from scipy.special import erf
 import numpy as np
 from matplotlib import pyplot as plt
 
-def generate_thresholds(data, lower_frac=1/6, upper_frac=5/6):
+def generate_thresholds(data, lower_frac=0.15865, upper_frac=0.84135):
     """
     Generates two thresholds such that:
     - A fraction (lower_frac) of the data is below the lower threshold.
@@ -16,11 +17,25 @@ def generate_thresholds(data, lower_frac=1/6, upper_frac=5/6):
     Returns:
         tuple: (lower_threshold, upper_threshold)
     """
-    lower_threshold = np.percentile(data, lower_frac * 100)  # 16.67th percentile
-    upper_threshold = np.percentile(data, upper_frac * 100)  # 83.33rd percentile
+    lower_threshold = np.percentile(data, lower_frac * 100)
+    upper_threshold = np.percentile(data, upper_frac * 100)
     return lower_threshold, upper_threshold
 
-def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int) -> tuple[np.array, np.array]:
+def get_sigma_probability(n: float = 1):
+    """
+    Returns probability for event to fall into n-sigma interval assuming a gaussian distribution:
+    P(mu - n*sigma < X < mu + n*sigma)  
+
+    Args:
+        n (float): Number of sigmas in interval
+
+    Returns:
+        p (float): Probability of falling into sigma interval.
+    """
+
+    return 1/2 * (erf(n / 1.4142135623730951) - erf(-n / 1.4142135623730951))
+
+def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array, resamples: int, nsigma: float = 1) -> tuple[np.array, np.array]:
     """
     Computes the confidence intervals for the predictions of a model based on a set of input data.
 
@@ -33,6 +48,7 @@ def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array,
         params (numpy.ndarray): The initial model parameters.
         cov (numpy.ndarray): The covariance matrix of the model parameters, used to generate resamples.
         resamples (int): The number of resampling iterations to generate for bootstrapping.
+        nsigma (float): Number of standard deviation in interval.
 
     Returns:
         tuple: A tuple containing:
@@ -43,18 +59,22 @@ def confidence_interval(model, xdata: np.array, params: np.array, cov: np.array,
 
     params_resamples = random.transpose()
 
+    P = get_sigma_probability(nsigma)
+    upper_threshold = 0.5 + P/2
+    lower_threshold = 0.5 - P/2
+
     lower_conf = list()
     upper_conf = list()
 
     for x in xdata:
         distr = model(x, *params_resamples)
-        interval = generate_thresholds(distr)
+        interval = generate_thresholds(distr, lower_frac = lower_threshold, upper_frac = upper_threshold)
         lower_conf.append(interval[0])
         upper_conf.append(interval[1])
 
     return np.array(lower_conf), np.array(upper_conf)
 
-def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, confidence_resolution: int = None, **kwargs) -> tuple[np.array, np.array, np.array, np.array]:
+def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples = 5000, confidence_resolution: int = None, nsigma:float = 1, **kwargs) -> tuple[np.array, np.array, np.array, np.array]:
     """
     Fits a model to data and calculates confidence intervals for the fitted parameters and predictions.
 
@@ -69,6 +89,7 @@ def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples
         yerror (numpy.ndarray, optional): The uncertainties in the observed data `ydata`. Default is None.
         resamples (int, optional): The number of resampling iterations for bootstrapping confidence intervals. Default is 5000.
         confidence_resolution (int, optional): If specified the confidence interval will be calculated at linearly spaced points along x-axis. Otherwise xdata is used.
+        nsigma (float): Number of standard deviation passed to confidence_interval()
         **kwargs: Additional arguments passed to SciPy's `curve_fit` function.
 
     Returns:
@@ -86,7 +107,7 @@ def curve_fit(model, xdata: np.array, ydata: np.array, yerror = None, resamples
     else:
         raise ValueError("Unable to specify confidence points")
 
-    lower_conf, upper_conf = confidence_interval(model, resampled_points, params, cov, resamples)
+    lower_conf, upper_conf = confidence_interval(model, resampled_points, params, cov, resamples, nsigma)
 
     return params, cov, lower_conf, upper_conf
 
@@ -138,8 +159,14 @@ def plot_fit(xdata, ydata, model, params, lower, upper, xerror = None, yerror =
 
     if fig is None and ax is None:
         fig, ax = plt.subplots(**kwargs)
+
+        ax.spines[["top", "right"]].set_visible(False)
+        ax.grid("both")
+        ax.set_axisbelow(True)
+
     elif ax is None:
-        ax = fig.axes
+        ax = fig.axes[0] #Choose first axes object in Figure
+
     elif fig is None:
         fig = ax.get_figure()
 
@@ -148,9 +175,5 @@ def plot_fit(xdata, ydata, model, params, lower, upper, xerror = None, yerror =
     ax.plot(resampled_points, upper, color = fit_color, linewidth = 0.75, linestyle = "--", label = confidence_label)
     ax.plot(resampled_points, lower, color = fit_color, linewidth = 0.75, linestyle = "--")
 
-    ax.spines[["top", "right"]].set_visible(False)
-    ax.grid("both")
-    ax.set_axisbelow(True)
-
     return fig, ax