Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer Expenditure Survey (CEX) data #5

Open
rickecon opened this issue Oct 14, 2017 · 1 comment
Open

Consumer Expenditure Survey (CEX) data #5

rickecon opened this issue Oct 14, 2017 · 1 comment
Assignees

Comments

@rickecon
Copy link
Contributor

In calibrating this model, we have to incorporate consumption data. The best source for consumption expenditures in the United States is the Consumer Expenditure Survey (CEX). We need data on household consumer expenditure by age of the primary respondent (head of household). I see two ways that we can do this.

  1. The CEX has summary tables of consumption by broad age categories for each year that are precomputed (PDF and Excel). These summary data look like the light blue bars in the following figure. One thing we could do to get consumption expenditure by age--which is more fine than the course age bins in the summary data--is to fit a curve to the summary data such that the average consumption expenditure across the ages corresponding to the summary data equals the value of the summary data.

cexbyage

  1. The more accurate thing we could do is to use the CEX survey microdata (PUMD) itself to calculate average consumption expenditure for each age group. There is a good paper here by Jesus Fernandez-Villaverde and Dirk Krueger,
    "Consumption over the Life Cycle: Facts from Consumer Expenditure Survey Data" (REStat, 89:3, Aug. 2007). This paper calculates exactly the lifecycle consumption profiles by age that we are interested in using the CEX microdata. For our calibration, we would probably want to average data from two or three of the most recent surveys in order to get rid of any noise that comes with the fine granularity of one-year age bins.

Each of these methods has unique advantages and disadvantages. Method 1 is less precise but significantly easier than method 2. Although it is tricky to estimate a smooth curve whose integral over a particular portion (or even average across discrete one-year age bins) equals the average in the summary data. Method 2 is more accurate, although might be significantly harder than method 1 due to the need to access, manipulate, and clean the source data. Method 2 also includes more noise in the averages from year to year.

@rickecon rickecon self-assigned this Oct 14, 2017
@rickecon
Copy link
Contributor Author

rickecon commented Oct 23, 2017

Option (1) above requires creating an interpolating function to fit data. The scipy.interpolate library is great for this. In this case, we will be using the interp1d function in this library. Begin by creating some data. These data are average fertility rates for age bins from the National Vital Statistics Reports, Volume 64, Number 1, January 15, 2015, Table 3, final 2013 data.

# Import libraries
import numpy as np
import scipy.interpolate as si
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
...
# Read in (create) the data
fert_data = (np.array([0.0, 0.0, 0.3, 12.3, 47.1, 80.7, 105.5, 98.0,
                       49.3, 10.4, 0.8, 0.0, 0.0]) / 2000)
age_midp = np.array([9, 10, 12, 16, 18.5, 22, 27, 32, 37, 42, 47,
                     55, 56])

The fert_rate vector represents the average fertility rate for the age ranges 9, 10, 10-14, 15-17, 18-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 55, and 56. The midpoints of those age ranges are given in the age_midp vector. One could plot the implied scatter plot of fertility rates (y-axis) and age bin midpoints (x-axis) with the following code, which would product the following scatter plot.

# Create the scatter plot
fig, ax = plt.subplots()
plt.scatter(age_midp, fert_data, s=70, c='blue', marker='o', label='Data')
minorLocator = MultipleLocator(1)
ax.xaxis.set_minor_locator(minorLocator)
plt.grid(b=True, which='major', color='0.65',linestyle='-')
plt.title('Average fertility rates by age bin ($f_{s}$)', fontsize=20)
plt.xlabel(r'Age $s$')
plt.ylabel(r'Fertility rate $f_{s}$')
plt.xlim((1, 60))
plt.ylim((-0.01, 1.15*(fert_data.max())))
plt.text(-5, -0.022, "Source: National Vital Statistics Reports, " +
         "Volume 64, Number 1, January 15, 2015.", fontsize=9)
plt.tight_layout(rect=(0, 0.03, 1, 1))

image
These scatterplot points are probably more accurately represented as histogram bars in order to communicate that they are average fertility rates across the entire age bin.

Now we can fit a function to these points using the interp1d function. This function will fit a curve to the data that passes through each point. We will use the "cubic spline" option, although other interpolating functional forms are available (e.g., linear).

# Generate interpolation function for fertility rates
fert_func = si.interp1d(age_midp, fert_data, kind='cubic')

The fert_func output is a function object that can take a vector of ages and compute the points that lie on the cubic spline curve that goes through the data. For example, if I choose 10,000 equally spaced ages between 1 and 100, the plot of those points would trace out the fitted interpolating cubic spline function that goes through the data. Notice that I can only interpolate values between the minimum and maximum ages in the source data.

# Use interpolation function to get interpolated values
age_fine = np.linspace(1, 100, 10000)
age_fine_sub = (age_fine >= age_midp.min()) & (age_fine <= age_midp.max())
fert_rates_fine = np.zeros_like(age_fine)
fert_rates_fine[age_fine_sub] = fert_func(age_fine[age_fine_sub])

# Plot interpolated values and original data
fig, ax = plt.subplots()
plt.scatter(age_midp, fert_data, s=70, c='blue', marker='o', label='Data')
plt.plot(age_fine, fert_rates_fine, label='Cubic spline')
minorLocator = MultipleLocator(1)
ax.xaxis.set_minor_locator(minorLocator)
plt.grid(b=True, which='major', color='0.65',linestyle='-')
plt.title('Fitted cubic spline fertility rates by age ($f_{s}$)', fontsize=20)
plt.xlabel(r'Age $s$')
plt.ylabel(r'Fertility rate $f_{s}$')
plt.xlim((1, 100))
plt.ylim((-0.01, 1.15*(fert_data.max())))
plt.legend(loc='upper right')
plt.text(-5, -0.022, "Source: National Vital Statistics Reports, " +
         "Volume 64, Number 1, January 15, 2015.", fontsize=9)
plt.tight_layout(rect=(0, 0.03, 1, 1))

image
@SophiaMo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant