diff --git a/notebooks/05-Probability.ipynb b/notebooks/05-Probability.ipynb index 0ed71ff..20a76b7 100644 --- a/notebooks/05-Probability.ipynb +++ b/notebooks/05-Probability.ipynb @@ -1,259 +1,314 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Probability\n", - "In this chapter we will go over how to perform probability computations in Python.\n", - "\n", - "## Basic probability calculations\n", - "\n", - "Let's create a vector of outcomes from one to 6, using the `np.arange()` function to create such a sequence. This function takes the minimum and maximum values as its inputs, but note that the maximum is not included in the sequence; that is, the sequence goes up to but not including the maximum. Thus, we would have to give 1 and 7 as the minimum and maximum in order to get a sequence of numbers from 1 to 6:" - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "_FaOYtAF8lp3" + }, + "source": [ + "# Probability\n", + "In this chapter we will go over how to perform probability computations in Python.\n", + "\n", + "## Basic probability calculations\n", + "\n", + "Let's create a vector of outcomes from one to 6, using the `np.arange()` function to create such a sequence. This function takes the minimum and maximum values as its inputs, but note that the maximum is not included in the sequence; that is, the sequence goes up to but not including the maximum. Thus, we would have to give 1 and 7 as the minimum and maximum in order to get a sequence of numbers from 1 to 6:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "prt56R7U8lp5" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "outcomes = np.arange(1, 7)\n", + "print(outcomes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KNWTWGLL8lp5" + }, + "source": [ + "Now let's create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that `==` tests for equality of each element in a vector:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vDryGcal8lp5" + }, + "outputs": [], + "source": [ + "outcome1isTrue = outcomes == 1\n", + "print(outcome1isTrue)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M3B1-MPD8lp6" + }, + "source": [ + "Remember that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in Python. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u3_F8OpQ8lp6" + }, + "outputs": [], + "source": [ + "p1isTrue = np.mean(outcome1isTrue)\n", + "print(p1isTrue)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lAZCaTvd8lp6" + }, + "source": [ + "## Empirical frequency\n", + "Let's walk through how [we computed empirical frequency of rain in San Francisco](https://statsthinking21.github.io/statsthinking21-core-site/probability.html#empirical-frequency).\n", + "\n", + "First we load the data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Jsi4szsT8lp6" + }, + "outputs": [], + "source": [ + "#+\n", + "import pandas as pd\n", + "SFrain = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-python/master/notebooks/data/SanFranciscoRain.csv')\n", + "\n", + "# we will remove the STATION and NAME variables\n", + "# since they are identical for all rows\n", + "\n", + "SFrain = SFrain.drop(columns=['STATION', 'NAME'])\n", + "print(SFrain)\n", + "#-" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_MlEaKt38lp6" + }, + "source": [ + "We see that the data frame contains a variable called `PRCP` which denotes the amount of rain each day. Let's create a new variable called `rainToday` that denotes whether the amount of precipitation was above zero:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-RzbJtFp8lp6" + }, + "outputs": [], + "source": [ + "SFrain['rainToday'] = SFrain['PRCP'] > 0\n", + "print(SFrain)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xX9JA59D8lp6" + }, + "source": [ + "Now we will summarize the data to compute the probability of rain:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "z0nMoJCU8lp7" + }, + "outputs": [], + "source": [ + "pRainInSF = SFrain['rainToday'].mean()\n", + "print(pRainInSF)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XIrFx09f8lp7" + }, + "source": [ + "## Conditional probability\n", + "Let's determine the conditional probability of someone having hearing problems, given that they are over 70 years of age, using the NHANES dataset. First, let's create a new variable called `Over70` that denotes whether each individual is over 70 or not." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nScHP-x28lp7" + }, + "outputs": [], + "source": [ + "! pip install nhanes\n", + "from nhanes.load import load_NHANES_data\n", + "nhanes_data = load_NHANES_data()\n", + "\n", + "nhanes_data['Over70'] = nhanes_data['AgeInYearsAtScreening'] > 70" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jSFI9IYJ8lp7" + }, + "source": [ + "Now let's create a cleaned-up dataset that only includes the over70 variable along with the variable called `HaveSeriousDifficultyHearing` that denotes whether a person reports having serious hearing difficulty (coded as 1 for \"yes\" and 0 for \"no\")." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k_qa4BMJ8lp7" + }, + "outputs": [], + "source": [ + "hearing_data = nhanes_data[['Over70', 'HaveSeriousDifficultyHearing']].dropna()\n", + "print(hearing_data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9_RsFSm68lp7" + }, + "source": [ + "First, what's the probability of being over 70?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Db3h91eL8lp7" + }, + "outputs": [], + "source": [ + "p_over_70 = hearing_data['Over70'].mean()\n", + "print(p_over_70)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gTGdXcnZ8lp7" + }, + "source": [ + "Second, what's the probability of having hearing problems?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bHwHewwI8lp7" + }, + "outputs": [], + "source": [ + "p_hearing_problem = hearing_data['HaveSeriousDifficultyHearing'].mean()\n", + "print(p_hearing_problem)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d6uHpfiB8lp8" + }, + "source": [ + "What's the probability for each combination of hearing problems/no problems and over 70/ not? We can create a table that finds the joint probability for each combination, using the `pd.crosstab()` function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "CZsdLo-V8lp8" + }, + "outputs": [], + "source": [ + "joint_table = pd.crosstab(hearing_data['Over70'], hearing_data['HaveSeriousDifficultyHearing'], normalize=True)\n", + "print(joint_table)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pa8eEh0L8lp8" + }, + "source": [ + "Finally, what's the probability of someone having hearing problems, given that they are over 70 years of age? To do this, we limit the computation of the probability of having hearing problems to only include those people who are over 70:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1KMwvBok8lp8" + }, + "outputs": [], + "source": [ + "p_hearingproblem_given_over_70 = hearing_data.query('Over70 == True')['HaveSeriousDifficultyHearing'].mean()\n", + "print(p_hearingproblem_given_over_70)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lCAN0XG38lp8" + }, + "source": [ + "Now compute the opposite: What is the probability of being over 70 given that one has a hearing problem?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8y6YdAqQ8lp9" + }, + "outputs": [], + "source": [ + "p_over_70_given_hearingproblem = hearing_data.query('HaveSeriousDifficultyHearing == True')['Over70'].mean()\n", + "print(p_over_70_given_hearingproblem)" + ] + } + ], + "metadata": { + "jupytext": { + "formats": "ipynb,py:percent" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "colab": { + "provenance": [] + } }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "outcomes = np.arange(1, 7)\n", - "outcomes" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that `==` tests for equality of each element in a vector:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "outcome1isTrue = outcomes == 1 \n", - "outcome1isTrue" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Remember that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in Python. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "p1isTrue = np.mean(outcome1isTrue)\n", - "p1isTrue" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Empirical frequency\n", - "Let's walk through how [we computed empirical frequency of rain in San Francisco](https://statsthinking21.github.io/statsthinking21-core-site/probability.html#empirical-frequency).\n", - "\n", - "First we load the data:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#+\n", - "import pandas as pd\n", - "SFrain = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-python/master/notebooks/data/SanFranciscoRain.csv')\n", - "\n", - "# we will remove the STATION and NAME variables \n", - "# since they are identical for all rows\n", - "\n", - "SFrain = SFrain.drop(columns=['STATION', 'NAME'])\n", - "SFrain\n", - "#-" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We see that the data frame contains a variable called `PRCP` which denotes the amount of rain each day. Let's create a new variable called `rainToday` that denotes whether the amount of precipitation was above zero:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "SFrain['rainToday'] = SFrain['PRCP'] > 0\n", - "SFrain" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we will summarize the data to compute the probability of rain:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pRainInSF = SFrain['rainToday'].mean()\n", - "pRainInSF" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conditional probability\n", - "Let's determine the conditional probability of someone having hearing problems, given that they are over 70 years of age, using the NHANES dataset. First, let's create a new variable called `Over70` that denotes whether each individual is over 70 or not." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from nhanes.load import load_NHANES_data\n", - "nhanes_data = load_NHANES_data()\n", - "\n", - "nhanes_data['Over70'] = nhanes_data['AgeInYearsAtScreening'] > 70" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's create a cleaned-up dataset that only includes the over70 variable along with the variable called `HaveSeriousDifficultyHearing` that denotes whether a person reports having serious hearing difficulty (coded as 1 for \"yes\" and 0 for \"no\")." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "hearing_data = nhanes_data[['Over70', 'HaveSeriousDifficultyHearing']].dropna()\n", - "hearing_data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, what's the probability of being over 70?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "p_over_70 = hearing_data['Over70'].mean()\n", - "p_over_70" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Second, what's the probability of having hearing problems?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "p_hearing_problem = hearing_data['HaveSeriousDifficultyHearing'].mean()\n", - "p_hearing_problem" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "What's the probability for each combination of hearing problems/no problems and over 70/ not? We can create a table that finds the joint probability for each combination, using the `pd.crosstab()` function:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "joint_table = pd.crosstab(hearing_data.Over70, hearing_data['HaveSeriousDifficultyHearing'], normalize=True)\n", - "joint_table" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, what's the probability of someone having hearing problems, given that they are over 70 years of age? To do this, we limit the computation of the probability of having hearing problems to only include those people who are over 70:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "p_hearingproblem_given_over_70 = hearing_data.query('Over70 == True')['HaveSeriousDifficultyHearing'].mean()\n", - "p_hearingproblem_given_over_70" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now compute the opposite: What is the probability of being over 70 given that one has a hearing problem?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "p_over_70_given_hearingproblem = hearing_data.query('HaveSeriousDifficultyHearing == True')['Over70'].mean()\n", - "p_over_70_given_hearingproblem" - ] - } - ], - "metadata": { - "jupytext": { - "formats": "ipynb,py:percent" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/notebooks/06-Sampling.ipynb b/notebooks/06-Sampling.ipynb index 7223dd8..d0697f3 100644 --- a/notebooks/06-Sampling.ipynb +++ b/notebooks/06-Sampling.ipynb @@ -1,301 +1,386 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Sampling\n", - "In this chapter we will learn how to use Python to understand sampling and sampling error.\n", - "\n", - "## Sampling error\n", - "Here we will repeatedly sample from the NHANES Height variable in order to obtain the sampling distribution of the mean. First let's load the data and clean them up." - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "MOg8ZuPL7SjT" + }, + "source": [ + "# Sampling\n", + "In this chapter we will learn how to use Python to understand sampling and sampling error.\n", + "\n", + "## Sampling error\n", + "Here we will repeatedly sample from the NHANES Height variable in order to obtain the sampling distribution of the mean. First let's load the data and clean them up." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "FEFht0qo7SjV" + }, + "outputs": [], + "source": [ + "! pip install nhanes\n", + "from nhanes.load import load_NHANES_data\n", + "nhanes_data = load_NHANES_data()\n", + "adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')\n", + "adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now let's draw a sample of 50 individuals from the dataset, and calculate its mean.\n", + "Try to execude the next cell repeatedly. What do you see?" + ], + "metadata": { + "id": "t_pKb6uq7qsX" + } + }, + { + "cell_type": "code", + "source": [ + "sample_size = 50\n", + "sample = adult_nhanes_data.sample(sample_size)\n", + "print('Sample mean:', sample['Height'].mean())\n", + "print('Sample standard deviation:', sample['Height'].std())" + ], + "metadata": { + "id": "FN_DN2Lo7qCb" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rak5pDws7SjW" + }, + "source": [ + "Now let's repeatedly sample 50 individuals from the dataset, compute the mean, and store the resulting values. For this we are going to use a *for loop*, which allows us to repeatedly perform a particular set of actions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "s_gDIauW7SjW" + }, + "outputs": [], + "source": [ + "#+\n", + "sample_size = 50\n", + "num_samples = 5000\n", + "\n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "# set up a variable to store the result\n", + "sampling_results = pd.DataFrame({'mean': np.zeros(num_samples)})\n", + "print('An empty data frame to be filled with sampling means:')\n", + "print(sampling_results)\n", + "for sample_num in range(num_samples):\n", + " sample = adult_nhanes_data.sample(sample_size)\n", + " sampling_results.loc[sample_num, 'mean'] = sample['Height'].mean()\n", + "#-\n", + "print('Means of 5000 samples:')\n", + "print(sampling_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kNLpg7uL7SjW" + }, + "source": [ + "Now let's plot the sampling distribution. We will also overlay the sampling distribution of the mean predicted on the basis of the population mean and standard deviation, to show that it properly describes the actual sampling distribution. We also place a vertical line at the population mean." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cfWIzi1j7SjW" + }, + "outputs": [], + "source": [ + "#+\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import scipy.stats\n", + "import seaborn as sns\n", + "\n", + "hist = plt.hist(sampling_results['mean'], 100, density=True)\n", + "# hist[0] contains the histogram data\n", + "# we need to use the maximum of those data to set\n", + "# the height of the vertical line that shows the mean\n", + "plt.axvline(x=adult_nhanes_data['Height'].mean(),\n", + " ymax=1, color='k')\n", + "\n", + "# draw the normal distribution with same mean and standard deviation\n", + "# as the sampling distribution\n", + "hist_bin_min = np.min(hist[1])\n", + "hist_bin_max = np.max(hist[1])\n", + "step_size = 0.01\n", + "x_values = np.arange(hist_bin_min, hist_bin_max, step_size)\n", + "normal_values = scipy.stats.norm.pdf(\n", + " x_values,\n", + " loc=sampling_results['mean'].mean(),\n", + " scale=sampling_results['mean'].std())\n", + "plt.plot(x_values, normal_values, color='r')\n", + "#+\n", + "print('standard deviation of the sample means:', sampling_results['mean'].std())" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now, can you redo the simulation of sampling above, but make the following changes each time?\n", + "\n", + "- Changing the sample size to 5 or 500. What difference do you observe in the distribution of sample means?\n", + "\n", + "- Changing the number of times to draw the samples to 50,000. Does the histogram appear closer to a normal distribution?" + ], + "metadata": { + "id": "p5J5iklPDqhu" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NCHSLEpH7SjW" + }, + "source": [ + "## Central limit theorem\n", + "The central limit theorem tells us that the sampling distribution of the mean becomes normal as the sample size grows. Let's test this by sampling a clearly non-normal variable and look at the normality of the results using a Q-Q plot. For example, let's look at the variable that represents annual family income. This variable is oddly distributed:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "IWfLWmCo7SjX" + }, + "outputs": [], + "source": [ + "plt.hist(adult_nhanes_data['AnnualFamilyIncome'])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 2, + "id": "MpwtW2qq7SjX" + }, + "source": [ + "This odd distribution comes in part from the how the variable is coded, as shown [here](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm#INDFMIN2). Let's resample this variable 5000 times, compute the mean, and examine the distribution. To do this, we will create a function that resamples and returns the mean:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "TsMEGZm07SjX" + }, + "outputs": [], + "source": [ + "def sample_and_return_mean(df, variable_name,\n", + " sample_size=250, num_samples=5000):\n", + " \"\"\"\n", + " repeatedly take samples from a particular variable in a data frame\n", + " and compute the mean\n", + "\n", + " Parameters:\n", + " -----------\n", + " df: data frame containing the data\n", + " variable_name: the name of the variable to be analyzed\n", + " sample_size: the number of observations to sample each time\n", + " num_samples: the number of samples to take\n", + "\n", + " Returns:\n", + " --------\n", + " sampling_distribution: data frame containing the means\n", + " \"\"\"\n", + " sampling_distribution = pd.DataFrame({'mean': np.zeros(num_samples)})\n", + " for sample_number in range(num_samples):\n", + " sample_df = df.sample(sample_size)\n", + " sampling_distribution.loc[sample_number, 'mean'] = sample_df[variable_name].mean()\n", + " return(sampling_distribution)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 2, + "id": "os---mao7SjX" + }, + "source": [ + "Now, using this function, let's compute the sampling distribution for the annual family income variable and plot its histogram." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "VYUT9Zo97SjX" + }, + "outputs": [], + "source": [ + "adult_income_data = adult_nhanes_data.dropna(subset=['AnnualFamilyIncome'])\n", + "family_income_sampling_dist = sample_and_return_mean(adult_income_data, 'AnnualFamilyIncome')\n", + "_ = plt.hist(family_income_sampling_dist['mean'], 100)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O3FH7bGx7SjX" + }, + "source": [ + "This distribution looks nearly normal. We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this. \n", + "\n", + "Quantile means the value below which certain percentage of all the scores are distributed. 5 percentile means 5% of the score is below this value. If two distributions are of the same shape, then their corresponding percentiles should form a linear relationship.\n", + "\n", + "We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UtXksTcP7SjX" + }, + "outputs": [], + "source": [ + "\n", + "plt.figure(figsize=(12, 6))\n", + "plt.subplot(1, 2, 1)\n", + "scipy.stats.probplot(adult_income_data['AnnualFamilyIncome'], plot=sns.mpl.pyplot)\n", + "plt.title('Original data')\n", + "\n", + "plt.subplot(1, 2, 2)\n", + "scipy.stats.probplot(family_income_sampling_dist['mean'], plot=sns.mpl.pyplot)\n", + "plt.title('Sampling distribution')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 2, + "id": "3A8vxnHh7SjX" + }, + "source": [ + "We see that the raw data are highly non-normal, evidenced by the fact that the data values diverge greatly from the unit line. On the other hand, the sampling distribution looks much more normally distributed.\n", + "\n", + "## Confidence intervals\n", + "\n", + "Remember that confidence intervals are intervals that will contain the population parameter in a certain proportion of samples from the population. In this example we will walk through [the simulation that was presented in the book](https://statsthinking21.github.io/statsthinking21-core-site/sampling.html#confidence-intervals) to show that this actually works properly. To do this, let's create a function that takes a sample from the NHANES population and returns the confidence interval for the mean of the `Height` variable within that sample. We will use the t distribution to obtain our confidence intervals." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7JI7ZczG7SjY" + }, + "outputs": [], + "source": [ + "def get_confidence_interval(df, variable_name,\n", + " ci_percent=95,\n", + " sample_size=50):\n", + " sample_df = df.sample(sample_size)\n", + " mean = sample_df[variable_name].mean()\n", + " std = sample_df[variable_name].std()\n", + " sem = std / np.sqrt(sample_size)\n", + " t_tail_proportion = 1 - ((100 - ci_percent) / 100) / 2\n", + " t_cutoff = scipy.stats.t.ppf(t_tail_proportion, sample_size - 1)\n", + " upper_ci = mean + sem * t_cutoff\n", + " lower_ci = mean - sem * t_cutoff\n", + " return([lower_ci, upper_ci])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tUwfOvxi7SjY" + }, + "source": [ + "Using this function, let's resample the data 1000 times and look how often the resulting interval contains the population mean." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PsmQDGkJ7SjY" + }, + "outputs": [], + "source": [ + "\n", + "num_runs = 1000\n", + "\n", + "ci_df = pd.DataFrame({'lower': np.zeros(num_runs),\n", + " 'upper': np.zeros(num_runs)})\n", + "\n", + "for i in range(num_runs):\n", + " ci_df.iloc[i, :] = get_confidence_interval(\n", + " adult_nhanes_data,\n", + " 'Height'\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0KH1rOfq7SjY" + }, + "source": [ + "Now we need to compute the proportion of confidence intervals that capture the population mean (which we know because we are treating the entire NHANES dataset as our population). Here we will use a trick that relies upon the fact that Python treat `True`/`False` identically to one and zero respectively. We will test for each of the confidence limits (upper and lower) whether it captures the population mean, and then we will multiply those two series of values together. This will create a new variable that is True only if both limits capture the population mean. We then simply take the mean of those truth values to compute the poportion of confidence intervals that capture the mean." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Xei96nBr7SjY" + }, + "outputs": [], + "source": [ + "ci_df['captures_mean'] = (ci_df['lower'] < adult_nhanes_data['Height'].mean()) * (ci_df['upper'] > adult_nhanes_data['Height'].mean())\n", + "\n", + "ci_df['captures_mean'].mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WQQ_1csO7SjY" + }, + "source": [ + "This number should be very close to 0.95." + ] + } + ], + "metadata": { + "jupytext": { + "formats": "ipynb,py:percent" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "colab": { + "provenance": [] + } }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "\n", - "from nhanes.load import load_NHANES_data\n", - "nhanes_data = load_NHANES_data()\n", - "adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')\n", - "adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's repeatedly sample 50 individuals from the dataset, compute the mean, and store the resulting values. For this we are going to use a *for loop*, which allows us to repeatedly perform a particular set of actions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#+\n", - "sample_size = 50\n", - "num_samples = 5000\n", - "\n", - "import pandas as pd\n", - "import numpy as np\n", - "\n", - "# set up a variable to store the result\n", - "sampling_results = pd.DataFrame({'mean': np.zeros(num_samples)})\n", - "\n", - "for sample_num in range(num_samples):\n", - " sample = adult_nhanes_data.sample(sample_size)\n", - " sampling_results.loc[sample_num, 'mean'] = sample['Height'].mean()\n", - "#-" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's plot the sampling distribution. We will also overlay the sampling distribution of the mean predicted on the basis of the population mean and standard deviation, to show that it properly describes the actual sampling distribution. We also place a vertical line at the population mean." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#+\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import scipy.stats\n", - "import seaborn as sns\n", - "\n", - "hist = plt.hist(sampling_results['mean'], 100, density=True)\n", - "# hist[0] contains the histogram data\n", - "# we need to use the maximum of those data to set\n", - "# the height of the vertical line that shows the mean\n", - "plt.axvline(x=adult_nhanes_data['Height'].mean(),\n", - " ymax=1, color='k')\n", - "\n", - "# draw the normal distribution with same mean and standard deviation\n", - "# as the sampling distribution\n", - "hist_bin_min = np.min(hist[1])\n", - "hist_bin_max = np.max(hist[1])\n", - "step_size = 0.01\n", - "x_values = np.arange(hist_bin_min, hist_bin_max, step_size)\n", - "normal_values = scipy.stats.norm.pdf(\n", - " x_values,\n", - " loc=sampling_results['mean'].mean(),\n", - " scale=sampling_results['mean'].std())\n", - "plt.plot(x_values, normal_values, color='r')\n", - "#+" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Central limit theorem\n", - "The central limit theorem tells us that the sampling distribution of the mean becomes normal as the sample size grows. Let's test this by sampling a clearly non-normal variable and look at the normality of the results using a Q-Q plot. For example, let's look at the variable that represents annual family income. This variable is oddly distributed:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "plt.hist(adult_nhanes_data['AnnualFamilyIncome'])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "This odd distribution comes in part from the how the variable is coded, as shown [here](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm#INDFMIN2). Let's resample this variable 5000 times, compute the mean, and examine the distribution. To do this, we will create a function that resamples and returns the mean:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "def sample_and_return_mean(df, variable_name, \n", - " sample_size=250, num_samples=5000):\n", - " \"\"\"\n", - " repeatedly take samples from a particular variable in a data frame\n", - " and compute the mean\n", - "\n", - " Parameters:\n", - " -----------\n", - " df: data frame containing the data\n", - " variable_name: the name of the variable to be analyzed\n", - " sample_size: the number of observations to sample each time\n", - " num_samples: the number of samples to take\n", - "\n", - " Returns:\n", - " --------\n", - " sampling_distribution: data frame containing the means\n", - " \"\"\"\n", - " sampling_distribution = pd.DataFrame({'mean': np.zeros(num_samples)})\n", - " for sample_number in range(num_samples):\n", - " sample_df = df.sample(sample_size)\n", - " sampling_distribution.loc[sample_number, 'mean'] = sample_df[variable_name].mean()\n", - " return(sampling_distribution)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "Now, using this function, let's compute the sampling distribution for the annual family income variable and plot its histogram. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "adult_income_data = adult_nhanes_data.dropna(subset=['AnnualFamilyIncome'])\n", - "family_income_sampling_dist = sample_and_return_mean(adult_income_data, 'AnnualFamilyIncome')\n", - "_ = plt.hist(family_income_sampling_dist['mean'], 100)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This distribution looks nearly normal. We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this. We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "plt.figure(figsize=(12, 6))\n", - "plt.subplot(1, 2, 1)\n", - "scipy.stats.probplot(adult_income_data['AnnualFamilyIncome'], plot=sns.mpl.pyplot)\n", - "plt.title('Original data')\n", - "\n", - "plt.subplot(1, 2, 2)\n", - "scipy.stats.probplot(family_income_sampling_dist['mean'], plot=sns.mpl.pyplot)\n", - "plt.title('Sampling distribution')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "We see that the raw data are highly non-normal, evidenced by the fact that the data values diverge greatly from the unit line. On the other hand, the sampling distribution looks much more normally distributed.\n", - "\n", - "## Confidence intervals\n", - "\n", - "Remember that confidence intervals are intervals that will contain the population parameter in a certain proportion of samples from the population. In this example we will walk through [the simulation that was presented in the book](https://statsthinking21.github.io/statsthinking21-core-site/sampling.html#confidence-intervals) to show that this actually works properly. To do this, let's create a function that takes a sample from the NHANES population and returns the confidence interval for the mean of the `Height` variable within that sample. We will use the t distribution to obtain our confidence intervals." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def get_confidence_interval(df, variable_name, \n", - " ci_percent=95,\n", - " sample_size=50):\n", - " sample_df = df.sample(sample_size)\n", - " mean = sample_df[variable_name].mean()\n", - " std = sample_df[variable_name].std()\n", - " sem = std / np.sqrt(sample_size)\n", - " t_tail_proportion = 1 - ((100 - ci_percent) / 100) / 2\n", - " t_cutoff = scipy.stats.t.ppf(t_tail_proportion, sample_size - 1)\n", - " upper_ci = mean + sem * t_cutoff\n", - " lower_ci = mean - sem * t_cutoff\n", - " return([lower_ci, upper_ci])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Using this function, let's resample the data 1000 times and look how often the resulting interval contains the population mean." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "num_runs = 1000\n", - "\n", - "ci_df = pd.DataFrame({'lower': np.zeros(num_runs),\n", - " 'upper': np.zeros(num_runs)})\n", - "\n", - "for i in range(num_runs):\n", - " ci_df.iloc[i, :] = get_confidence_interval(\n", - " adult_nhanes_data,\n", - " 'Height'\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we need to compute the proportion of confidence intervals that capture the population mean (which we know because we are treating the entire NHANES dataset as our population). Here we will use a trick that relies upon the fact that Python treat `True`/`False` identically to one and zero respectively. We will test for each of the confidence limits (upper and lower) whether it captures the population mean, and then we will multiply those two series of values together. This will create a new variable that is True only if both limits capture the population mean. We then simply take the mean of those truth values to compute the poportion of confidence intervals that capture the mean." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ci_df['captures_mean'] = (ci_df['lower'] < adult_nhanes_data['Height'].mean()) * (ci_df['upper'] > adult_nhanes_data['Height'].mean())\n", - "\n", - "ci_df['captures_mean'].mean()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This number should be very close to 0.95." - ] - } - ], - "metadata": { - "jupytext": { - "formats": "ipynb,py:percent" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/notebooks/07-ResamplingAndSimulation.ipynb b/notebooks/07-ResamplingAndSimulation.ipynb index 9571ae6..012db7c 100644 --- a/notebooks/07-ResamplingAndSimulation.ipynb +++ b/notebooks/07-ResamplingAndSimulation.ipynb @@ -1,226 +1,254 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Resampling and simulation\n", - "\n", - "## Generating random samples\n", - "Here we will generate random samples from a number of different distributions and plot their histograms. We could write out separate commands to plot each of our functions of interest, but that would involve repeating a lot of code, so instead we will take advantage of the fact that Python allows us to treat modules as variables. We will specify the module that creates each distribution, and then loop through them, each time incrementing the panel number. Some distributions also take specific parameters; for example, the Chi-squared distribution requires specifying the degrees of freedom. We will store those in a separate dictionary and use them as needed." - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "iYBJesVTuILl" + }, + "source": [ + "# Resampling and simulation\n", + "\n", + "## Generating random samples\n", + "Here we will generate random samples from a number of different distributions and plot their histograms. We could write out separate commands to plot each of our functions of interest, but that would involve repeating a lot of code, so instead we will take advantage of the fact that Python allows us to treat modules as variables. We will specify the module that creates each distribution, and then loop through them, each time incrementing the panel number. Some distributions also take specific parameters; for example, the Chi-squared distribution requires specifying the degrees of freedom. We will store those in a separate dictionary and use them as needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "SFat-Ll-uILm" + }, + "outputs": [], + "source": [ + "import scipy.stats\n", + "import matplotlib.pyplot as plt\n", + "\n", + "num_samples = 20000\n", + "\n", + "plt.figure(figsize=(8, 8))\n", + "\n", + "generators = {'Uniform': scipy.stats.uniform,\n", + " 'Normal': scipy.stats.norm,\n", + " 'Exponential': scipy.stats.expon,\n", + " 'Chi-squared': scipy.stats.chi2}\n", + "\n", + "generator_parameters = {'Chi-squared': 10}\n", + "panel_num = 1\n", + "for distribution in generators:\n", + " plt.subplot(2, 2, panel_num)\n", + " if distribution in generator_parameters:\n", + " sample = generators[distribution].rvs(\n", + " generator_parameters[distribution], size=num_samples)\n", + " else:\n", + " sample = generators[distribution].rvs(size=num_samples)\n", + " plt.hist(sample, bins=100, density=True)\n", + " plt.title(distribution)\n", + " plt.xlabel('Value')\n", + " plt.ylabel('Density')\n", + " # the following function prevents the labels from overlapping\n", + " plt.tight_layout()\n", + " panel_num += 1\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sXRGnsxvuILm" + }, + "source": [ + "## Simulating the maximum finishing time\n", + "Let's simulate 5000 samples of 150 observations, collecting the maximum value from each sample, and then plotting the distribution of maxima." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "ovEu_Lm_uILn" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "num_runs = 5000\n", + "sample_size = 150\n", + "\n", + "\n", + "def sample_and_return_max(sample_size,\n", + " distribution=None):\n", + " \"\"\"\n", + " function to sample from a distribution and return maximum\n", + " \"\"\"\n", + "\n", + " # if distribution is not specified, then use the normal\n", + " if distribution is None:\n", + " distribution = scipy.stats.norm\n", + "\n", + " sample = distribution.rvs(size=sample_size)\n", + " return(np.max(sample))\n", + "\n", + "\n", + "sample_max_df = pd.DataFrame({'max': np.zeros(num_runs)})\n", + "\n", + "for i in range(num_runs):\n", + " sample_max_df.loc[i, 'max'] = sample_and_return_max(sample_size, distribution=scipy.stats.norm(loc=5,scale=1))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 2, + "id": "Lxa9SHX7uILn" + }, + "source": [ + "Now let's find the 99th percentile of the maximum distriibution. There is a built-in function in the `scipy.stats` module, called `scoreatpercentile` that will do this for us:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "YsMZgfXCuILn" + }, + "outputs": [], + "source": [ + "cutoff = scipy.stats.scoreatpercentile(sample_max_df['max'], 99)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MD96LFTWuILn" + }, + "source": [ + "Plot the histogram of the maximum values, along with a vertical line at the 95th percentile." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "ekKmMab-uILo" + }, + "outputs": [], + "source": [ + "hist = plt.hist(sample_max_df['max'], bins=100)\n", + "plt.ylabel('Count')\n", + "plt.xlabel('Maximum value')\n", + "_ = plt.axvline(x=cutoff, ymax=np.max(hist[0]), color='k')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TO7rg7DWuILo" + }, + "source": [ + "## The bootstrap\n", + "The bootstrap is useful for creating confidence intervals in cases where we don't have a parametric distribution. One example is for the median; let's look at how that works. We will start by implementing it by hand, to see more closely how it works. We will start by collecting a sample of individuals from the NHANES dataset, and the using the bootstrap to obtain confidence intervals on the median for the Height variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AzTS0T5ruILo" + }, + "outputs": [], + "source": [ + "#+\n", + "! pip install nhanes\n", + "from nhanes.load import load_NHANES_data\n", + "nhanes_data = load_NHANES_data()\n", + "adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')\n", + "adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})\n", + "\n", + "num_runs = 5000\n", + "sample_size = 100\n", + "\n", + "# Take a sample for which we will perform the bootstrap\n", + "\n", + "nhanes_sample = adult_nhanes_data.sample(sample_size)\n", + "\n", + "# Perform the resampling\n", + "\n", + "bootstrap_df = pd.DataFrame({'mean': np.zeros(num_runs)})\n", + "for sampling_run in range(num_runs):\n", + " bootstrap_sample = nhanes_sample.sample(sample_size, replace=True)\n", + " bootstrap_df.loc[sampling_run, 'mean'] = bootstrap_sample['Height'].mean()\n", + "\n", + "# Compute the 2.5% and 97.5% percentiles of the distribution\n", + "\n", + "\n", + "bootstrap_ci = [scipy.stats.scoreatpercentile(bootstrap_df['mean'], 2.5),\n", + " scipy.stats.scoreatpercentile(bootstrap_df['mean'], 97.5)]\n", + "\n", + "#-" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MwlWGg6quILo" + }, + "source": [ + "Let's compare the bootstrap distribution to the sampling distribution that we would expect given the sample mean and standard deviation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5LE9i0KyuILp" + }, + "outputs": [], + "source": [ + "hist = plt.hist(bootstrap_df['mean'], 100, density=True)\n", + "\n", + "hist_bin_min = np.min(hist[1])\n", + "hist_bin_max = np.max(hist[1])\n", + "step_size = 0.01\n", + "x_values = np.arange(hist_bin_min, hist_bin_max, step_size)\n", + "normal_values = scipy.stats.norm.pdf(\n", + " x_values,\n", + " loc=nhanes_sample['Height'].mean(),\n", + " scale=nhanes_sample['Height'].std()/np.sqrt(sample_size))\n", + "plt.plot(x_values, normal_values, color='r')\n", + "plt.legend([' Normal distribution based on sample mean and SEM','Means of bootstrap samples'])\n", + "plt.xlabel('Height (cm)')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AsrX_9NhuILp" + }, + "source": [ + "This shows that the bootstrap sampling distrbution does a good job of recapitulating the theoretical sampling distribution in this case." + ] + } + ], + "metadata": { + "jupytext": { + "formats": "ipynb,py:percent" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "colab": { + "provenance": [] + } }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "import scipy.stats\n", - "import matplotlib.pyplot as plt\n", - "\n", - "num_samples = 10000\n", - "\n", - "plt.figure(figsize=(8, 8))\n", - "\n", - "generators = {'Uniform': scipy.stats.uniform,\n", - " 'Normal': scipy.stats.norm, \n", - " 'Exponential': scipy.stats.expon,\n", - " 'Chi-squared': scipy.stats.chi2}\n", - "\n", - "generator_parameters = {'Chi-squared': 10}\n", - "panel_num = 1\n", - "for distribution in generators:\n", - " plt.subplot(2, 2, panel_num)\n", - " if distribution in generator_parameters:\n", - " sample = generators[distribution].rvs(\n", - " generator_parameters[distribution], size=num_samples)\n", - " else:\n", - " sample = generators[distribution].rvs(size=num_samples)\n", - " plt.hist(sample, bins=100)\n", - " plt.title(distribution)\n", - " plt.xlabel('Value')\n", - " plt.ylabel('Density')\n", - " # the following function prevents the labels from overlapping\n", - " plt.tight_layout()\n", - " panel_num += 1" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Simulating the maximum finishing time\n", - "Let's simulate 5000 samples of 150 observations, collecting the maximum value from each sample, and then plotting the distribution of maxima." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "\n", - "num_runs = 5000\n", - "sample_size = 150\n", - "\n", - "\n", - "def sample_and_return_max(sample_size, \n", - " distribution=None):\n", - " \"\"\"\n", - " function to sample from a distribution and return maximum\n", - " \"\"\"\n", - "\n", - " # if distribution is not specified, then use the normal\n", - " if distribution is None:\n", - " distribution = scipy.stats.norm\n", - " \n", - " sample = distribution.rvs(size=sample_size)\n", - " return(np.max(sample))\n", - "\n", - "\n", - "sample_max_df = pd.DataFrame({'max': np.zeros(num_runs)})\n", - "\n", - "for i in range(num_runs):\n", - " sample_max_df.loc[i, 'max'] = sample_and_return_max(sample_size)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, - "source": [ - "Now let's find the 95th percentile of the maximum distriibution. There is a built-in function in the `scipy.stats` module, called `scoreatpercentile` that will do this for us:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "cutoff = scipy.stats.scoreatpercentile(sample_max_df['max'], 95)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Plot the histogram of the maximum values, along with a vertical line at the 95th percentile." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "hist = plt.hist(sample_max_df['max'], bins=100)\n", - "plt.ylabel('Count')\n", - "plt.xlabel('Maximum value')\n", - "_ = plt.axvline(x=cutoff, ymax=np.max(hist[0]), color='k')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## The bootstrap\n", - "The bootstrap is useful for creating confidence intervals in cases where we don't have a parametric distribution. One example is for the median; let's look at how that works. We will start by implementing it by hand, to see more closely how it works. We will start by collecting a sample of individuals from the NHANES dataset, and the using the bootstrap to obtain confidence intervals on the median for the Height variable." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#+\n", - "from nhanes.load import load_NHANES_data\n", - "nhanes_data = load_NHANES_data()\n", - "adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')\n", - "adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})\n", - "\n", - "num_runs = 5000\n", - "sample_size = 100\n", - "\n", - "# Take a sample for which we will perform the bootstrap\n", - "\n", - "nhanes_sample = adult_nhanes_data.sample(sample_size)\n", - "\n", - "# Perform the resampling\n", - "\n", - "bootstrap_df = pd.DataFrame({'mean': np.zeros(num_runs)})\n", - "for sampling_run in range(num_runs):\n", - " bootstrap_sample = nhanes_sample.sample(sample_size, replace=True)\n", - " bootstrap_df.loc[sampling_run, 'mean'] = bootstrap_sample['Height'].mean()\n", - "\n", - "# Compute the 2.5% and 97.5% percentiles of the distribution\n", - "\n", - "\n", - "bootstrap_ci = [scipy.stats.scoreatpercentile(bootstrap_df['mean'], 2.5),\n", - " scipy.stats.scoreatpercentile(bootstrap_df['mean'], 97.5)]\n", - "\n", - "#-" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's compare the bootstrap distribution to the sampling distribution that we would expect given the sample mean and standard deviation:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# hist = plt.hist(bootstrap_df['mean'], 100, density=True)\n", - "#\n", - "# hist_bin_min = np.min(hist[1])\n", - "# hist_bin_max = np.max(hist[1])\n", - "# step_size = 0.01\n", - "# x_values = np.arange(hist_bin_min, hist_bin_max, step_size)\n", - "# normal_values = scipy.stats.norm.pdf(\n", - "# x_values,\n", - "# loc=nhanes_sample['Height'].mean(),\n", - "# scale=nhanes_sample['Height'].std()/np.sqrt(sample_size))\n", - "# plt.plot(x_values, normal_values, color='r')\n", - "#\n", - "#" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This shows that the bootstrap sampling distrbution does a good job of recapitulating the theoretical sampling distribution in this case." - ] - } - ], - "metadata": { - "jupytext": { - "formats": "ipynb,py:percent" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/notebooks/08-HypothesisTesting.ipynb b/notebooks/08-HypothesisTesting.ipynb index a908050..07a6fde 100644 --- a/notebooks/08-HypothesisTesting.ipynb +++ b/notebooks/08-HypothesisTesting.ipynb @@ -1,179 +1,275 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Hypothesis testing in Python\n", - "In this chapter we will present several examples of using Python to perform hypothesis testing.\n", - "\n", - "## Simple example: Coin-flipping\n", - "Let's say that we flipped 100 coins and observed 70 heads. We would like to use these data to test the hypothesis that the true probability is 0.5.\n", - "First let's generate our data, simulating 100,000 sets of 100 flips. We use such a large number because it turns out that it's very rare to get 70 heads, so we need many attempts in order to get a reliable estimate of these probabilties. This will take a couple of minutes to complete." - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "yQHZoAndW5zs" + }, + "source": [ + "# Hypothesis testing in Python\n", + "In this chapter we will present several examples of using Python to perform hypothesis testing.\n", + "\n", + "## Simple example: Coin-flipping\n", + "Let's say that we flipped 100 coins and observed 70 heads. We would like to use these data to test the hypothesis that the true probability is 0.5.\n", + "First let's generate our data, simulating 200,000 sets of 100 flips. We use such a large number because it turns out that it's very rare to get 70 heads, so we need many attempts in order to get a reliable estimate of these probabilties. This will take a couple of minutes to complete." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2, + "id": "zGgG23i1W5zu" + }, + "outputs": [], + "source": [ + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "num_runs = 200000\n", + "\n", + "\n", + "def toss_coins_and_count_heads(num_coins=100, p_heads=0.5):\n", + " \"\"\"\n", + " flip a coin num_coins times and return number of heads\n", + " \"\"\"\n", + "\n", + " flips = np.random.rand(num_coins) > (1 - p_heads)\n", + " return(np.sum(flips))\n", + "\n", + "\n", + "flip_results_df = pd.DataFrame({'n_heads': np.zeros(num_runs)})\n", + "\n", + "for run in range(num_runs):\n", + " flip_results_df.loc[run, 'n_heads'] = toss_coins_and_count_heads()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Yddd8XBaW5zv" + }, + "source": [ + "Now we can compute the proportion of samples from the distribution observed that landed on head for at least 70 times, when the true probability of heads is 0.5. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iFKqf2rFW5zv" + }, + "outputs": [], + "source": [ + "import scipy.stats\n", + "\n", + "pvalue = 100 - scipy.stats.percentileofscore(flip_results_df['n_heads'], 70)\n", + "print(pvalue)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5bEDSYZuW5zv" + }, + "source": [ + "For comparison, we can also compute the p-value for 70 or more heads based on a null hypothesis of $P_{heads}=0.5$, using the binomial distribution.\n", + "\n", + "\n", + "compute the probability of 69 or fewer heads, when P(heads)=0.5" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F_QCV9l7W5zv" + }, + "outputs": [], + "source": [ + "\n", + "p_lt_70 = scipy.stats.binom.cdf(k=69, n=100, p=0.5)\n", + "p_lt_70" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0, + "id": "hQeCiqjHW5zv" + }, + "source": [ + "the probability of 70 or more heads is simply the complement of p_lt_70" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0, + "id": "xgO_h8PNW5zv" + }, + "outputs": [], + "source": [ + "\n", + "p_ge_70 = 1 - p_lt_70\n", + "p_ge_70\n", + "#" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Performing t-test with Python\n", + "Let's draw a sample of 250 participants from the \"population\" who participated the NHANES study\n" + ], + "metadata": { + "id": "Jb361YaoZQG4" + } + }, + { + "cell_type": "code", + "source": [ + "! pip install nhanes\n", + "from nhanes.load import load_NHANES_data\n", + "nhanes_data = load_NHANES_data()\n", + "\n" + ], + "metadata": { + "id": "oYoivOQeZ2nt" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import numpy as np\n", + "import seaborn as sns\n", + "sample_size = 250\n", + "nhanes_data['PhysActive'] = np.logical_or(nhanes_data['VigorousRecreationalActivities'], nhanes_data['ModerateRecreationalActivities'])\n", + "print('Unique values in PhysActive:',nhanes_data['PhysActive'].unique())\n", + "\n", + "sample = nhanes_data.dropna(subset=['PhysActive', 'BodyMassIndexKgm2']).sample(sample_size)\n", + "sns.boxplot(data=sample, x=\"PhysActive\", y=\"BodyMassIndexKgm2\")" + ], + "metadata": { + "id": "BjiYATqLc_lA" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We will use `scipy.stats.ttest_ind` to perform t-test between two independently drawn samples." + ], + "metadata": { + "id": "xtZofsU6qGU3" + } + }, + { + "cell_type": "code", + "source": [ + "from scipy.stats import ttest_ind\n", + "# By default, ttest_ind assumes equal variance of the two samples\n", + "print('assuming equal variance of the two population:')\n", + "t, p = ttest_ind(sample.query('PhysActive==1.0')['BodyMassIndexKgm2'], sample.query('PhysActive==0.0')['BodyMassIndexKgm2'])\n", + "print('t-statistic:', t)\n", + "print('p-value:', p)\n", + "\n", + "# If we don't make the assumption, the result may be slightly different:\n", + "print('without assuming equal variance of the two populations:')\n", + "t, p = ttest_ind(sample.query('PhysActive==1.0')['BodyMassIndexKgm2'], sample.query('PhysActive==0.0')['BodyMassIndexKgm2'], equal_var=False)\n", + "print('t-statistic:', t)\n", + "print('p-value:', p)\n" + ], + "metadata": { + "id": "gSmTuPZmlqsn" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 0, + "id": "dhBkcLs9W5zw" + }, + "source": [ + "## Simulating p-values\n", + "\n", + "In this exercise we will perform hypothesis testing many times in order to test whether the p-values provided by our statistical test are valid. We will sample data from a normal distribution with a mean of zero, and for each sample perform a t-test to determine whether the mean is different from zero. We will then count how often we reject the null hypothesis; since we know that the true mean is zero, these are by definition Type I errors.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cUMidxAmW5zw" + }, + "outputs": [], + "source": [ + "\n", + "num_runs = 5000\n", + "\n", + "\n", + "# create a function that will take a sample\n", + "# and perform a one-sample t-test\n", + "def sample_ttest(sampSize=32):\n", + " \"\"\"\n", + " perform a ttest on random data of n=sampSize\n", + " \"\"\"\n", + "\n", + " ttresult = scipy.stats.ttest_1samp(np.random.normal(loc=0.0, scale=1.0, size=sampSize), 0)\n", + " return(ttresult.pvalue)\n", + "\n", + "\n", + "# create input data frame for the function\n", + "sim_results_df = pd.DataFrame({'p_value': np.zeros(num_runs)})\n", + "\n", + "# perform simulations\n", + "for run in range(num_runs):\n", + " sim_results_df.loc[run, 'p_value'] = sample_ttest()\n", + "\n", + "p_error = sim_results_df['p_value'] < 0.05\n", + "p_error = p_error.mean(axis=0)\n", + "p_error" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V6u-l5WJW5zw" + }, + "source": [ + "We should see that the proportion of samples with p < .05 is about 5%." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AhY-KrHcW5zw" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "jupytext": { + "formats": "ipynb,py:percent" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "colab": { + "provenance": [] + } }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "\n", - "import numpy as np\n", - "import pandas as pd\n", - "\n", - "num_runs = 10000\n", - "\n", - "\n", - "def toss_coins_and_count_heads(num_coins=100, p_heads=0.5):\n", - " \"\"\"\n", - " flip a coin num_coins times and return number of heads\n", - " \"\"\"\n", - "\n", - " flips = np.random.rand(num_coins) > (1 - p_heads)\n", - " return(np.sum(flips))\n", - "\n", - " \n", - "flip_results_df = pd.DataFrame({'n_heads': np.zeros(num_runs)})\n", - "\n", - "for run in range(num_runs):\n", - " flip_results_df.loc[run, 'n_heads'] = toss_coins_and_count_heads()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we can compute the proportion of samples from the distribution observed when the true proportion of heads is 0.5. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import scipy.stats\n", - "\n", - "pvalue = 100 - scipy.stats.percentileofscore(flip_results_df, 70) \n", - "pvalue" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For comparison, we can also compute the p-value for 70 or more heads based on a null hypothesis of $P_{heads}=0.5$, using the binomial distribution.\n", - "\n", - "\n", - "compute the probability of 69 or fewer heads, when P(heads)=0.5" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "p_lt_70 = scipy.stats.binom.cdf(k=69, n=100, p=0.5)\n", - "p_lt_70" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "the probability of 70 or more heads is simply the complement of p_lt_70" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [], - "source": [ - "\n", - "p_ge_70 = 1 - p_lt_70\n", - "p_ge_70\n", - "#" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "## Simulating p-values\n", - "\n", - "In this exercise we will perform hypothesis testing many times in order to test whether the p-values provided by our statistical test are valid. We will sample data from a normal distribution with a mean of zero, and for each sample perform a t-test to determine whether the mean is different from zero. We will then count how often we reject the null hypothesis; since we know that the true mean is zero, these are by definition Type I errors.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "num_runs = 5000\n", - "\n", - "\n", - "# create a function that will take a sample\n", - "# and perform a one-sample t-test\n", - "def sample_ttest(sampSize=32):\n", - " \"\"\"\n", - " perform a ttest on random data of n=sampSize\n", - " \"\"\"\n", - "\n", - " ttresult = scipy.stats.ttest_1samp(np.random.normal(loc=0.0, scale=1.0, size=sampSize), 0)\n", - " return(ttresult.pvalue)\n", - "\n", - "\n", - "# create input data frame for the function\n", - "sim_results_df = pd.DataFrame({'p_value': np.zeros(num_runs)})\n", - "\n", - "# perform simulations\n", - "for run in range(num_runs):\n", - " sim_results_df.loc[run, 'p_value'] = sample_ttest()\n", - "\n", - "p_error = sim_results_df['p_value'] < 0.05\n", - "p_error = p_error.mean(axis=0)\n", - "p_error" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We should see that the proportion of samples with p < .05 is about 5%." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "jupytext": { - "formats": "ipynb,py:percent" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file