diff --git a/README.md b/README.md index b5453a6e..00e7bfc1 100644 --- a/README.md +++ b/README.md @@ -26,6 +26,7 @@ or navigate to any of the documents listed below and download it individually. 9. [Tutorial: Plotting Fractals](content/tutorial-plotting-fractals.ipynb) 10. [Tutorial: NumPy natural language processing from scratch with a focus on ethics](content/tutorial-nlp-from-scratch.md) 11. [Tutorial: Analysing the impact of the lockdown on air quality in Delhi, India](content/tutorial-air-quality-analysis.md) +12. [Tutorial: Stochastic Simulations](content/tutorial-stochastic-simulations.ipynb) ## Contributing diff --git a/content/tutorial-stochastic-simulations.md b/content/tutorial-stochastic-simulations.md new file mode 100644 index 00000000..ae5fa508 --- /dev/null +++ b/content/tutorial-stochastic-simulations.md @@ -0,0 +1,483 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.14.7 +kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +# Simulating Using Random Variables + +Randomness, or apparent randomness, is a major part of the world we live from +the movement of gas molecules in the air we breath to the outcome of a coin toss. +While many events are not truely random, they are often close enough to random +or seemingly random enough to be simulated as such. +Despite many processes being random, their outcomes often conform to a +statistical pattern if enough of these random events occur. +For example, if we roll a six-sided die over a large number of rolls, we expect +to roll a 5 about $\frac{1}{6}$ of the time. +As a general rule, the larger the number of random events observed, the closer +the distribution of these events will *likely* be to the expected disbribution. +In this tutorial, we will use a random number generator provided in NumPy to +solve a number of problems through stochastic simulations. +While the problems demonstrated below all have know solutions or equations, you +may face other problems that do not have an analytical model or known solution. +The methodology demonstrated below can be used to solve problems when no known +solution or model exists. + +## What you'll do +- Generate random values using NumPy to power stochastic simulations +- Perform stochastic simulations to determine the + - Likelyhood of multiple children in a classroom of 25 having the same birthday + - Distribution of marbles in a Galton board + - Value of $\pi$ + - Amount of radioactive material left versus time +- Compare the above simulations to known values or an analytical model + +## What you'll learn: +- Generating large quantites of random values in a desired distribution +- Using random values to simulation a range of scenarios and solve problems + +## What you'll need: +- [NumPy](https://numpy.org/) +- [Matplotlib](https://matplotlib.org/) +- Python's `math` module + +The above two packages are imported using the following commands: + +```{code-cell} ipython3 +import numpy as np +import matplotlib.pyplot as plt +import math +``` + +## Generating random variables + +NumPy can generate random variables using a Generator from the `np.random` module. +While these values are not truely random, they are close enough for the following +applications. +One of the advantages of using NumPy to generate variables over the standard +`random` Python module is that NumPy can generate a large number of random +variables efficiently without the use of a `for` loop. +These variables can be generated in a variety of probability distribution +including normal, poisson, and binomial; but for the following simulations, we +will mostly use even distributions where all numbers in the range have an equal +likelyhood of being generated. +This is accomplished by first creating a Generator using the +`np.random.default_rng()` function and then using `random(n)` or +`np.integers(low, high=, size=n)` methods with the Generator to produce random +values where `n` is the number of random values returned. +Below are some common methods for generating random values. + +| Method | Description | +|-----------|-------------| +|`random(n)`| Generates `n` float values in the [0,1) range with an even distribution | +|`integers(low, high=, size=)`| Generates `size` integers in the requested range [low, high) in a uniform distribution| +|`normal(loc, scale=, size=)`| Generates `size` floats in a normal distribution centered on `loc` with a `scale` standard deviation| +|`binomial(t, p, size=)`| Generates `size` integers in a bionimial distribution with `t` trials for each value with a probability `p`| +|`choice(a, size=)`| Selects `size` elements from `a` list, tuple or ndarray with an even probability | +|`shuffle(a)`| Shuffles items in list, tubple, or ndarray `a` | + +```{code-cell} ipython3 +rng = np.random.default_rng(seed=18) +rng.random(10) +``` + +You may have guessed by looking at the returned values that `random()` produces +values in the 0 $\rightarrow$ 1 range, but what happens if we need values in a +different range? +We can modify these values by mutiplying them by a coefficient to increase the +range and using addition or substration to shift the center of the range. +For example, if we need random values from -10 $\rightarrow$ +10, we can +accomplish this by the following. + +```{code-cell} ipython3 +20 * rng.random(10) - 10 +``` + +It's worth noting that the `random()` method returns values from the [0,1) range +which includes the lower end and excludes the upper end. +This is close enough to [0,1] for our purposes. + ++++ + +## Solving the birthday problem + +The goal of the [birthday problem](https://en.wikipedia.org/wiki/Birthday_problem) +is to determine the probability of multiple children in a classroom of 25 +students having the same birthday. +For simiplicity, we will ignore leap years (i.e., February 29 birthdays) and +assume that the students are equilly likely to be born any day of the year. +One method of determining the answer is to repeately simulate a classroom of 25 +student birthdays by using a random number generator and see how often all +students in the class have unique birthdays. +The general steps for this processes are outlined below. + +1. Simulate 25 random birthdays with a random integer generator +2. Check to see if any of the above birthdays match +3. Record the outcome if all birthdays are unique +4. Repeat setps 1-3 a large number of times + +To make things easier, we will just generate integers 0 $\rightarrow$ 364 +representating days of the year instead of actual months and days. + +```{code-cell} ipython3 +def all_unique(class_size, n_classrooms=100): + '''Randomly generates birthdays for groups of kids of a given group size and + outputs the percent of time all students have uniquie birthdays. All input + and output values are ints or floats. + + (classroom size, number of classrooms) -> percentage classrooms with all unique birthdays + ''' + + all_unique_class = 0 # number of classrooms with students NOT sharing birthdays + + for classroom in range(n_classrooms): + birthdays = rng.integers(0, high=365, size=class_size) + if class_size == np.unique(birthdays).size: + all_unique_class += 1 + + return 100 * all_unique_class / n_classrooms +``` + +Below we simulate 50000 classrooms of 25 students and find that only around 43% +of the time every student in a classroom has a unique birthday from all the other students. + +```{code-cell} ipython3 +all_unique(25, n_classrooms=50000) +``` + +This problem also has an equation for the probability given below where **p** +is the probability of all students having a unique birthday and **n** is the +classroom size. +Performing this calculation below, we find that 43.13% of the time, students +should have all unique birthdays which is quite close to what was found in the +simulation above. + +$$ p = \frac {365!}{365^n(365-n)!} $$ + +```{code-cell} ipython3 +n = 25 +math.factorial(365)/(365**n * math.factorial(365-n)) +``` + +## Simulating a Galton board + +Next, we will simulate a [Galton board](https://en.wikipedia.org/wiki/Galton_board) +which allows mables to fall down levels of staggered pegs. +Each marble starts in the horizontal center of the board, and at each level, +the marble hits a peg. To continue falling to the next level, the marble must +move around the peg, and it has an equal likeliness of moving to the left as it +does to the right. +At the bottom of the board, the marbles collect in bins with the height of +marbles yielding a normal distribution. + ++++ + +![](tutorial-stochastic/galton_board.png) + ++++ + +**Figure 1.** An example of a Galton board with three marbles with three +different potential paths (dotted lines) shown. + +The key operational detail of a Galton board is that every marble moves one +increment to the left or right at each level, so our simulation centers around +a series of randomly generated +1 and -1 to determine this movement by the +following methodolog. +1. Generate a random series of +1 and -1 in an even distribution - one value for + every level on the Galton board +2. Sum up the values in step 1 to determine the final location of the mable +3. Repeat steps 1-2 for each simulated marble +4. Visualize the result + +Below is the simulation for a single marble moving down a 10-layered Galton board. +Note how the integers 0 and 1 generated by `integers()` and converted to a +series of -1 and +1 values, respectively. + +```{code-cell} ipython3 +moves = 2 * rng.integers(0, high=2, size=10) - 1 +final_position = moves.sum() +final_position +``` + +To simulate additional marbles, we could use a `for` loop, but because we need +the same number of integers for each marble, a more efficient way is to generate +the random integers in a two-dimensional array with a row for every level on the +Galton board and column for every marble simulated. +We then sum up the values in each column to obtain the final positions of all marbles. + +```{code-cell} ipython3 +def galton(marbles, levels=10): + '''integer -> ndarray of integers + + This function takes in an integer number of marbles and number of levels and outputs + the horizontal position of these marbles at the bottom of a Galton board. + ''' + moves = 2 * rng.integers(0, high=2, size=(levels, marbles)) - 1 + final_position = moves.sum(axis=0) + + return final_position +``` + +```{code-cell} ipython3 +sim = galton(1200) +sim[:20] +``` + +We visualize the results below using a histogram plot. +Being that all values are integers, we set `align='left'` so that each bar +resides directly over the integer it represents. + +```{code-cell} ipython3 +bins = np.arange(-10,12) +plt.hist(sim, bins=bins, align='left') +plt.xticks(bins[::2]) +plt.xlabel('Horizontal Marble Position') +plt.ylabel('Number of Marbles'); +``` + +One interesting trend that appears in the above histogram is that all locations +are integers. +This is because we used whole numbers for our left and right movements of +marbles (i.e., +1 and -1) and used ten layers in the above simulation. +There is no way to produce an odd integer by totaling a series of ten +1 or -1 +integers. +If we changed the number of layers to an odd number, we'd only get odd positions +in the result, and if we used +1/2 and -1/2 for our horizontal movement, we'd get +both even and odd integers. + ++++ + +## Calculating pi + +As a second appliation of random number generaters, we will calclate a numerical +value for $\pi$. +Imagine a circle with radius = $r$ inscribed inside a square of side length $2r$. + ++++ + +![](tutorial-stochastic/circle_square.png) + ++++ {"editable": true, "slideshow": {"slide_type": ""}} + +**Figure 2.** The circle radius r inscribed in a square with side length of 2r. + +The areas of the circle and square are described by $A +_{circle} = \pi r^2$ and $A_{square} = 4r^2$, respectively. +If we divide the area of the circle by the area of the square, we get $\pi / 4$. + +$$ \frac { A_{circle} }{ A_{square} } = \frac {\pi r^2} {(2r)^2} = \frac {\pi r^2} {4r^2} = \frac {\pi}{4} $$ + +This means that four times the ratio of the area of the circle and square equals $\pi$. + +$$ \frac { A_{circle} }{ A_{square}} \times 4 = \pi $$ + +The challenge is now finding the ratio between the areas of the two shapes which +can be done through a variety of creative means. +Our method will be counting randomly placed dots. +Imagine we painted the circle inscribed in a square on pavement outside and +count the water droplets that fall in the circle and square during a rain storm. + ++++ + +![](tutorial-stochastic/circle_square_dots.png) + ++++ + +**Figure 3.** Random dots inside the circle and square used to estimate a value of $\pi$. + +Because the rain drops fall randomly, the probability of a drop falling in one +of the shapes is proportional to the area of that shape, so dividing the number +of drops in the circle by the number of drop in the square provides a estimate +of $\frac { A_{circle} }{ A_{square} }$ which then can be used to calculate $\pi$. +Counting rain is a bit tedious and challenging, so we can simulate this using a +random number generator to produce random ($x, y$) coordinates of dots inside +the square. +We then total the number of these that fall inside the circle versus the square +and complete the calculation to obtain a value of $\pi$. + +For smilicity, we will center our circle and square around the origin and give +the circle a radius = 1. +This requires the coordinates to fall in the [-1, 1) ranges along both the $x$- +and $y$-axes. +We have no random number generator that produces values in this range, but we +can modify the `random()` method which generates floats in the [0, 1) range by +multiplying by 2 and substracting 1 as demonstrated below. + +~~~python +rng = np.random.default_rng() +x = 2 * rng.random() - 1 +~~~ + ++++ + +Below is a function that performs this simulation and calculation of $\pi$ with +the only function argument `n_samples` as the number of dots generated. +The general proceedure is ouline here. +1. Generate `n_samples` of ($x, y$) coordinates using the `random()` method +2. Calculate the distance from the origin using the `np.hypot()` function to + determine if the dot is inside the circle (all dots are inside the square) +3. Calculate $\pi$ + +```{code-cell} ipython3 +def estimate_pi(n_samples): + '''(integer) -> pi (float) + + Estimates the value of pi by finding the ratio of the area of a unit circle/area + of a unit square and multiplying by 4. + ''' + + n_samples = int(n_samples) + + # Step 1 + rng = np.random.default_rng() + coords = rng.random(size=(n_samples, 2)) + + # Step 2 + dist_from_origin = np.hypot(coords[:,0], coords[:,1]) # distance from the origion + in_circle = dist_from_origin[dist_from_origin <= 1] # if <= 1, the point is inside the circle + + # Step 3 + pi = 4 * (in_circle.size / n_samples) + + return pi +``` + +We can then call this function with 100 dots generated. +Because we are using a random number generator, every time we call this function, +we likley get a different value for $\pi$. + +```{code-cell} ipython3 +estimate_pi(100) +``` + +```{code-cell} ipython3 +estimate_pi(100) +``` + +We can also generate a value for $\pi$ using the `np.pi` to compare our answer to. + +```{code-cell} ipython3 +np.pi +``` + +Odds are that the values calculated above for $\pi$ above is close but not +exactly 3.14. +If we increase the number of dots, the value for $\pi$ will on average become +closer to the true value. +While it may be tempting to just call the function with an extremely large number +of dots, this can take substatially longer to computer. + +```{code-cell} ipython3 +estimate_pi(1E8) +``` + +## Simulating radioactive decay + +As a final example, we will simulate the radioactive decay of the radioactive +nuclide $^{137m}$Ba to determine the quantity of undecayed material versus time. +We will then compare it to an analytical model to see how well the two agree. +Radioactive nuclei decay by first-order kinetics, which means that each radioactive +nucleus has a fixed probability of decaying at any given time. +This is analogous to how each six-sided die in a bucket has a fixed probability +of rolling a one each time they are dumped out. +This analogy is so good that we can simulate the decay process by generating a +large number of random integers with fixed probabilities. + +The first thing we need to is to know the probability of a nucleus decaying in +a single second. +Because this is a first-order process, the rate (-dA/dt) is described by + +$$ \frac{-dA}{dt} = kA $$ + +where A is the quantity of $^{137m}$Ba, $t$ is time in seconds, and $k$ is the +rate constant for this process. +We also know that the half-life (t$_{1/2}$) of $^{137m}$Ba is about 151 seconds +which is related to the rate constant by the following equation. + +$$ t_{1/2} = \frac{ln(2)}{k} $$ + +This means that $k$ = 4.59e$^{-3}$ s$^{-1}$. +For first-order kinetics, this means that 4.59e$^{-3}$ fraction of the nuclei +decays, so this is the probability we need to simulate this process using random +numbers. +The function below calculates the fraction of radioactive material remaining +based on a rate constant `k` and allows the users to choose the number of +simulated nuclei `n` and duraction of the simulation `n_final`. + +```{code-cell} ipython3 +def decay(k, n=1000, t_final=1000): + '''floats and integers -> ndarray of integers + + Calculates the fraction of nuclei remaining at each second in the simulation + given n initial nuclei with a rate constant k of decaying each second and + a final simulation time t_final. + ''' + + n_undecayed = n + undecayed_array = np.full(t_final + 1, n) + + for second in range(1, t_final + 1): + decays = rng.binomial(1, p=k, size=n_undecayed).sum() + n_undecayed -= decays + undecayed_array[second] = n_undecayed + + fraction = undecayed_array / n + + + return fraction +``` + +Below we simulate the decay of $^{137m}$Ba using a hundred simulated nuclei. + +```{code-cell} ipython3 +decay(4.59e-3, n=100) +``` + +We can plot and compare this simulation to the analytical model for this +simulation knowing that the fraction of undecayed nuclei ($\frac{A}{A_0}$) is +described by the following first-order equation where $A_0$ and $A_t$ are the +initial quantities and quantities of nuclei at time $t$. + +$$ \frac{A_t}{A_0} = e^{-kt} $$ + +```{code-cell} ipython3 +seconds = np.arange(1001) +fraction = np.exp(-4.59e-3 * seconds) + +plt.plot(seconds, fraction, linestyle='-', label='Theoretical') +plt.plot(seconds, decay(4.59e-3, n=500), linestyle='--', label='Simulation') +plt.xlabel('Time, s') +plt.ylabel('Fraction of Undecayed Nuclei') +plt.legend(); +``` + +The simulation is in good agreement with the theoretical model, but this is some +discrepency. +If we increase the number of simulated nuclei, the simulation should align +better with the theoretical values. + +```{code-cell} ipython3 +seconds = np.arange(1001) +fraction = np.exp(-4.59e-3 * seconds) + +plt.plot(seconds, fraction, linestyle='-', label='Theoretical') +plt.plot(seconds, decay(4.59e-3, n=5000), linestyle='--', label='Simulation') +plt.xlabel('Time, s') +plt.ylabel('Fraction of Undecayed Nuclei') +plt.legend(); +``` + +## Further reading + +- To learn more about NumPy random number generators, see [Random Generator](https://numpy.org/doc/stable/reference/random/index.html) +- To learn more about random number distributions and their generation, see + [Monte Carlo Simulation Part 5: Randomness & Random Number Generation](https://towardsdatascience.com/monte-carlo-simulation-421110b678c5?gi=f86d4ec9f452) +- For examples of using random variables for chemical simulations, see chapter + 9 of [Scientific Computing for Chemists](https://weisscharlesj.github.io/SciCompforChemists/intro.html) diff --git a/content/tutorial-stochastic/circle_square.png b/content/tutorial-stochastic/circle_square.png new file mode 100644 index 00000000..2d508db8 Binary files /dev/null and b/content/tutorial-stochastic/circle_square.png differ diff --git a/content/tutorial-stochastic/circle_square_dots.png b/content/tutorial-stochastic/circle_square_dots.png new file mode 100644 index 00000000..88cc936c Binary files /dev/null and b/content/tutorial-stochastic/circle_square_dots.png differ diff --git a/content/tutorial-stochastic/galton_board.png b/content/tutorial-stochastic/galton_board.png new file mode 100644 index 00000000..0b049856 Binary files /dev/null and b/content/tutorial-stochastic/galton_board.png differ diff --git a/site/features.md b/site/features.md index fb966a8a..e98581c6 100644 --- a/site/features.md +++ b/site/features.md @@ -10,4 +10,5 @@ maxdepth: 1 content/tutorial-svd content/save-load-arrays content/tutorial-ma +content/tutorial-stochastic-simulations ```