Skip to content

Latest commit

 

History

History
185 lines (128 loc) · 10.8 KB

manual.md

File metadata and controls

185 lines (128 loc) · 10.8 KB

Manual for the Fitting Toolkit

This toolkit provides all necessary functions for fitting and displaying data in a scatter plot with error bars, but we also maintain access to the basic functionalities.

These instructions provide a basic introduction on how to use the toolkit and later explains how to use its parts independently of each other.

Getting Started

This section covers the use of the toolkit's functions as intended, which will cover most use cases.

To get started install the package via

pip install fitting-toolkit

or copy fitting_toolkit.py into your working directory. You can now import the relevant functions into your code:

from fitting_toolkit import curve_fit, plot_fit 
import numpy as np

The curve_fit requires numpy-arrays. Therefore numpy has to be imported as well.

We can now start by simply defining our data.

x = np.array((1, 2, 3, 4, 5))
y = np.array((1, 2, 1.75, 2.25, 3))
dy = 0.1*y+0.05
dx = 0.1

For a model we chose a simple linear model:

def f(x, a, b):
    return a * x + b

We can now fit the model to the data:

params, cov, lower_conf, upper_conf = curve_fit(f, x, y, yerror=dy)

This function returns 4 arrays. First the parameters of the model, the covariance matrix of those parameters and then the lower and upper limits of the confidence interval around the fit. Note that the confidence interval is absolute. To get the error in relation to the fitted function you would need to find the difference at each point.

The resulting fit can now be plotted. This toolkit provides a premade function to generate plots:

from matplotlib import pyplot as plt
fig, ax = plot_fit(x, y, f, params, lower_conf, upper_conf, xerror=dx, yerror=dy)
plt.show()

Note that the fitted function is not automatically displayed. Instead the figure and axis objects are returned.

Example Graph

Errors and Uncertainties

Understanding how the tools we use work with and calculate errors and uncertainties is a vital part of judging and quantifying the quality of our work. The toolkit's curve_fit() function uses scipy's curve_fit() function which handles the fitting process as well as the calculation of errors and correlation of the fitted parameters. You can reference the modules documentation for a detailed explanation.

Covariance Matrix and Standard Deviation

SciPy returns a covariance matrix along with the optimal parameters of the fitted function. In error propagation according to gauß we assume that the measured values are uncorrelated. In the real world this is usually not the case.

Covariance in probability theory and statistics is a measure of the joint variability of two random variables.

The closer the absolute value of the covariance between two values is to the product of their standard deviations the more correlated they are. SciPy automatically finds a numerical approximation for the covariance between fit parameters. To extract the standard deviation of the fitted parameters take the root of the matrix diagonal.

s1, s2, ..., sn = np.sqrt(np.diagonal(cov))

Note that the relationship between cov and parameter error estimates is derived based on a linear approximation to the model function around the optimum. When this approximation becomes inaccurate, cov may not provide an accurate measure of uncertainty. [1]

Further the documentation notes:

How the sigma parameter affects the estimated covariance depends on absolute_sigma argument, as described above.

The absolute_sigma determines the interpretation of and effect the data errors have on the variance and covariance of the fitted parameters. By default the absolute_sigma = False which means that the errors are only used as weights, i.e., as hints how important a point should be in the least-squares computation. To compute the covariance matrix, scipy.optimize.curve_fit then estimates the actual error from the average deviation from the optimized curve. (Read More Here) When fitting data with absolute error bars as can be seen in the examples provided the absolute_sigma parameter should be set to True via the **kwargs.

params, cov, lower_conf, upper_conf = curve_fit(model, x, y, yerror=dy, absolute_sigma = True)

Also note that only the y-error affects the uncertainty

Calculation of the Confidence Interval

To estimate the confidence interval the fitted parameters are then resampled using their estimated means and covariance matrix. The number of resamples can be set via the resamples = 5000 parameter.

PD of Example Graph

For each point on the x-axis, the fitted function is calculated using all the resampled parameter sets. Then, the numpy.percentile function is used to estimate the upper and lower bounds of the confidence interval. By default, these bounds are set so that 1/6 of the resampled values are above the interval and 1/6 are below it. This means that 2/3 of the resampled values fall within the interval, which corresponds to a 1-sigma confidence level.

This method is referred to as "parametric bootstrapping" [2]. The resampling method assumes normally distributed parameters, when errors on the fitted parameters turn out to be asymmetric the numerical approximation may prove out to be inaccurate.

By default the confidence interval is estimated at each x-position of the data, however this may cause issues of resolution when data points are sparse or non-uniformly distributed along the x-axis, or it may be computationally expensive for large datasets. When a model_resolution: int parameter is set in curve_fit, that number of points is generated between the highest and lowest point on the x-axis and used as the x-axis instead.

resampled_x_axis = np.linspace(min(xdata), max(xdata), model_resolution) 

Note that the model_resolution must be provided to both curve_fit and plot_fit

Specifying the Number of Standard Deviations

To specify the number of standard deviations to be calculated provide the nsigma parameter to either the curve_fit() function, or directly to confidence_interval(). By default one standard deviation is used. The upper and lower thresholds of the resulting gaussian curve are calculated by evaluating the gaussian integral between $\mu - n\cdot \sigma$ and $\mu + n\cdot \sigma$. By substituting

$$ z = \frac{x - \mu} {\sigma} $$

the probability to fall within an n- $\sigma$ interval can be calculated as

$$ P(\mu - n\sigma < X < \mu + n\sigma) = \frac{1}{2}(\text{erf}(n/\sqrt{2}) - \text{erf}(-n/\sqrt{2})) := P_n $$

The cutoff for the resampled points above the upper threshold ($u$) and the lower threshold ($l$) then is

$u_n = \frac{1}{2} + \frac{P_n}{2}$; $l_n = \frac{1}{2} - \frac{P_n}{2}$

Be aware, that

  • For $n\gtrapprox 5.5$ the error function incorrectly evaluates to $\pm 1$. Larger standard deviations will all evaluate to the same distance from the fit.
  • For large standard deviations ($\gtrapprox 3$) the approximation by resampling may require larger resampling sizes to be accurate.

Using Plot-Fit

This section instructs on the basics of using the plot fit function. For function documentation consult ./functions.md.

Plotting a Fitted Function

fitting_toolkit.plot_fit() displays the fitted function generated by fitting_toolkit.curve_fit() via non linear regression. The following parameters are required:

Parameters
Name Type Description
xdata np.ndarray The x-values of the data points.
ydata np.ndarray The y-values of the data points.
model function The model function that takes xdata and model parameters as inputs.
params np.ndarray The parameters for the fitted model.
lower np.ndarray The lower bounds of the confidence intervals for the model predictions.
upper np.ndarray The upper bounds of the confidence intervals for the model predictions.

If the upper and lower bounds were generated with a custom resolution, the same resolution must be provided in the model_resolution parameter.

For customization the function also provides the keywords markersize, capsize, fit_label, and confidence_label

You may also pass keyword arguments to matplotlib.pyplot.subplots() via **kwargs. For comprehensive documentation please consult subplots(), figure() and add_plot(). Please note that it's assumed that subplots() returns a figure object and a single axes object.

Common parameters include:

Parameters
Name Type Description
figsize (float, float) Width, height in inches.
facecolor color The background color.
layout str The layout mechanism for positioning of plot elements to avoid overlapping Axes decorations (labels, ticks, etc).
aspect str or float {'auto', 'equal'} or float, aspect ratio of axes.
alpha float Set the alpha value used for blending - not supported on all backends.

Plotting Multiple Fits In the Same Graph

To add a new fit to an existing graph provide the figure and the axes object returned when calling the function for the first time.

# fit two datasets to the same model separately
params1, cov1, lower_conf1, upper_conf1 = curve_fit(f, x1, y1)
params2, cov2, lower_conf2, upper_conf2 = curve_fit(f, x2, y2)

# display separately fitted models in the same graph
fig, ax = plot_fit(x1, y1, f, params1, lower_conf1, upper_conf1)
fig, ax = plot_fit(x2, y2, f, params2, lower_conf2, upper_conf2, fig = fig, ax = ax)
# fig and ax are returned; however, reassigning them is not required

In this example both datasets are fitted to the same model f, however this is an arbitrary choice.

Example of multiple fits

It may be required to change the color of the fitted function for clarity, as can be seen in the example above.

fig, ax = plot_fit(x, y, model, params, lower, upper, fit_color = "crimson", fit_label="data label", confidence_label="label confidence interval")

You may set the color of the graph with the fit_color attribute. To label your data pass a fit_label and a confidence_label or pass Noneto remove them.