Statistical Models

Since our exploratory and clustering analysis highlighted evident patterns in how different neighborhoods request 311 patterns, we decided to quantify the relationship between service requests and neighborhood characteristics while controlled for space, time or other confounding factors.

To capture these relationships, we used Poisson Generalized Linear Models (GLM), a suitable formalism to model rates since the modal accounts for the fact that service requests must be integer-valued and time independent.

A Poisson GLM relates a dependent variable, in our case the amount of service requests per request type, with a set of independent variables (time, neighborhood characteristics) through an exponential relation.

To put this in formal terms, we denote as the volume of requests for a particular service for month in a given census tract, and consider the following model:

In this model:

is the intercept, or the mean value of theta.
are the two auto-regressive coefficients, that express the relation of the current value of with respect to the value at one and two months before, respectively.
is a sequence of numerical neighborhood features, namely: population, proportion of population above age 65, proportion of Black population, proportion of Hispanic population, proportion of population below the poverty line, unemployment rate, and median household income.
is a sequence of weights, one for each of the above neighborhood features.
is the month value (2 through 12) at time .
measures the impact of each month with respect to January, that is left off in that it is taken to represent the baseline.

Fitting this model to our data means finding the values of so that the relation expressed by the model is the closest possible to the observed data, using observations from all census tracts, over a 44 month-long time span. In other words, we try to find the set of coefficients that "best explains" the relation between the selected demographic indicators and the volume of 311 requests for some specific type of service on a monthly basis. We used the common maximum likelihood technique to fit our models.

Below, we provide the results obtained by fitting the model for each census tract, considering graffiti removal as the request type under study, using data from 2009 to 2012.

Case study: graffiti removal requests

We provide the results of fitting our model for graffiti removal requests, a type of service for which we expect a strong correlation with neighborhood characteristics.

Below, we report the values of the coefficient associated with each predictor - the month of March, how many reports we had the previous month, unemployment rate of a census tract, and so on. The higher the absolute value of the coefficient, the higher the impact it has on the number of reports.

Predictor	Coefficient	Std. Error	z value	p value	Significance	Description
(Intercept)	2.029e+00	5.391e-03	376.409	< 2e-16	***	Global mean
counts_lag1	8.037e-03	4.747e-05	169.308	< 2e-16	***	# Reports previous month
counts_lag2	5.662e-03	4.838e-05	117.044	< 2e-16	***	# Reports two months before
s_2	-5.126e-02	7.294e-03	-7.027	2.11e-12	***	February
s_3	1.725e-01	6.466e-03	26.671	< 2e-16	***	March
s_4	2.091e-02	6.620e-03	3.158	0.00159	**	April
s_5	-4.597e-02	6.674e-03	-6.888	5.66e-12	***	May
s_6	1.818e-03	6.679e-03	0.272	0.78543		June
s_7	1.523e-02	6.721e-03	2.266	0.02345	*	July
s_8	8.373e-02	6.655e-03	12.581	< 2e-16	***	August
s_9	1.304e-02	6.742e-03	1.935	0.05300	.	September
s_10	-6.528e-04	6.738e-03	-0.097	0.92281		October
s_11	3.457e-02	7.111e-03	4.862	1.16e-06	***	November
s_12	-8.108e-02	7.333e-03	-11.057	< 2e-16	***	December
x_1	7.229e-02	8.041e-04	89.895	< 2e-16	***	Tract's total population
x_2	-3.361e+00	3.531e-02	-95.185	< 2e-16	***	Prop. pop. over 65
x_3	-1.224e+00	9.220e-03	-132.727	< 2e-16	***	Prop. Black
x_4	7.339e-01	7.837e-03	93.642	< 2e-16	***	Prop. Hispanic
x_5	-1.598e-01	1.801e-02	-8.874	< 2e-16	***	Prop. pop. below poverty line
x_6	-2.233e-03	4.513e-04	-4.947	7.55e-07	***	Unemployment rate
x_7	-1.088e-03	1.190e-04	-9.142	< 2e-16	***	Median household income

The columns labeled "z value", "p value", and "significance" describe the statistical significance of the obtained coefficients. In order to not over-complicate this explanation, it suffices to say that the p value represents the probability of the observed number of request under the hypothesis that there were truly no dependence of the number of requests with that particular variable, everything else left unchanged. That is to say, it represents the probability that the observed data is the result of random fluctuations with respect to a predicting variable. The number of stars is a standard way to graphically represent such significance: the more stars, the better.

Let's analyze the dependency of the number of request with the month of the year. The values of such coefficients represent the contribution of being in a particular month compared to January, which is taken as the baseline. We can see that the number of reported graffiti seem to have a strong positive correlation with the month being August, but in general it doesn't seem to present a reliable, clear dependence on seasonality.

This is in line with what we can see in the monthly plot of City-wide requests for graffiti removal from January 2011 to May 2013, depicted below: there is not clear seasonal fluctuation.

Graffiti plot

We now take a look at the coefficients for the demographic indicators. It comes at no surprise that the proportion of Hispanic population is one of the principal driver for this type of request, since we previously noticed this effect in the exploratory visual analysis and result of the K-Means clustering. What is important is that we are now able to quantify such effect.

At the same time, it seems that areas populated in large part by African-Americans tend to report less graffiti.

The percentage of elderly population is also a major factor, and contributes negatively to the number of requests. This result is also in line with our common sense: we wouldn't expect neighborhoods populated predominantly by older people to be covered in graffiti.

In this analysis, we need to remember that some these demographic and socio-economical indicators are not independent, and the explanatory power for some neighborhood could be "shared" among correlated predictors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical Models

Case study: graffiti removal requests

Clone this wiki locally