-
Notifications
You must be signed in to change notification settings - Fork 42
Statistical Models
Since our exploratory and clustering analysis highlighted evident patterns in how different neighborhoods request 311 patterns, we decided to quantify the relationship between service requests and neighborhood characteristics while controlled for space, time or other confounding factors.
To capture these relationships, we used Poisson Generalized Linear Models (GLM), a suitable formalism to model rates since the modal accounts for the fact that service requests must be integer-valued and time independent.
A Poisson GLM relates a dependent variable, in our case the amount of service requests per request type, with a set of independent variables (time, neighborhood characteristics) through an exponential relation.
To put this in formal terms, we denote as the volume of requests for a particular service for month in a given census tract, and consider the following model:
In this model:
- is the intercept, or the mean value of theta.
- are the two auto-regressive coefficients, that express the relation of the current value of with respect to the value at one and two months before, respectively.
- is a sequence of numerical neighborhood features, namely: population, proportion of population above age 65, proportion of Black population, proportion of Hispanic population, proportion of population below the poverty line, unemployment rate, and median household income.
- is a sequence of weights, one for each of the above neighborhood features.
- is the month value (2 through 12) at time .
- measures the impact of each month with respect to January, that is left off in that it is taken to represent the baseline.
Fitting this model to our data means finding the values of so that the relation expressed by the model is the closest possible to the observed data, using observations from all census tracts, over a 44 month-long time span. In other words, we try to find the set of coefficients that "best explains" the relation between the selected demographic indicators and the volume of 311 requests for some specific type of service on a monthly basis. We used the common maximum likelihood technique to fit our models.
Below, we provide the results obtained by fitting the model for each census tract, considering graffiti removal as the request type under study, using data from 2009 to 2012.
We provide the results of fitting our model for graffiti removal requests, a type of service for which we expect a strong correlation with neighborhood characteristics.
Below, we report the values of the coefficient associated with each predictor - the month of March, how many reports we had the previous month, unemployment rate of a census tract, and so on. The higher the absolute value of the coefficient, the higher the impact it has on the number of reports.
Predictor | Coefficient | Std. Error | z value | p value | Significance | Description |
---|---|---|---|---|---|---|
(Intercept) | 2.029e+00 | 5.391e-03 | 376.409 | < 2e-16 | *** | Global mean |
counts_lag1 | 8.037e-03 | 4.747e-05 | 169.308 | < 2e-16 | *** | # Reports previous month |
counts_lag2 | 5.662e-03 | 4.838e-05 | 117.044 | < 2e-16 | *** | # Reports two months before |
s_2 | -5.126e-02 | 7.294e-03 | -7.027 | 2.11e-12 | *** | February |
s_3 | 1.725e-01 | 6.466e-03 | 26.671 | < 2e-16 | *** | March |
s_4 | 2.091e-02 | 6.620e-03 | 3.158 | 0.00159 | ** | April |
s_5 | -4.597e-02 | 6.674e-03 | -6.888 | 5.66e-12 | *** | May |
s_6 | 1.818e-03 | 6.679e-03 | 0.272 | 0.78543 | June | |
s_7 | 1.523e-02 | 6.721e-03 | 2.266 | 0.02345 | * | July |
s_8 | 8.373e-02 | 6.655e-03 | 12.581 | < 2e-16 | *** | August |
s_9 | 1.304e-02 | 6.742e-03 | 1.935 | 0.05300 | . | September |
s_10 | -6.528e-04 | 6.738e-03 | -0.097 | 0.92281 | October | |
s_11 | 3.457e-02 | 7.111e-03 | 4.862 | 1.16e-06 | *** | November |
s_12 | -8.108e-02 | 7.333e-03 | -11.057 | < 2e-16 | *** | December |
x_1 | 7.229e-02 | 8.041e-04 | 89.895 | < 2e-16 | *** | Tract's total population |
x_2 | -3.361e+00 | 3.531e-02 | -95.185 | < 2e-16 | *** | Prop. pop. over 65 |
x_3 | -1.224e+00 | 9.220e-03 | -132.727 | < 2e-16 | *** | Prop. Black |
x_4 | 7.339e-01 | 7.837e-03 | 93.642 | < 2e-16 | *** | Prop. Hispanic |
x_5 | -1.598e-01 | 1.801e-02 | -8.874 | < 2e-16 | *** | Prop. pop. below poverty line |
x_6 | -2.233e-03 | 4.513e-04 | -4.947 | 7.55e-07 | *** | Unemployment rate |
x_7 | -1.088e-03 | 1.190e-04 | -9.142 | < 2e-16 | *** | Median household income |
The columns labeled "z value", "p value", and "significance" describe the statistical significance of the obtained coefficients. In order to not over-complicate this explanation, it suffices to say that the p value represents the probability of the observed number of request under the hypothesis that there were truly no dependence of the number of requests with that particular variable, everything else left unchanged. That is to say, it represents the probability that the observed data is the result of random fluctuations with respect to a predicting variable. The number of stars is a standard way to graphically represent such significance: the more stars, the better.
Let's analyze the dependency of the number of request with the month of the year. The values of such coefficients represent the contribution of being in a particular month compared to January, which is taken as the baseline. We can see that the number of reported graffiti seem to have a strong positive correlation with the month being August, but in general it doesn't seem to present a reliable, clear dependence on seasonality.
This is in line with what we can see in the monthly plot of City-wide requests for graffiti removal from January 2011 to May 2013, depicted below: there is not clear seasonal fluctuation.
We now take a look at the coefficients for the demographic indicators. It comes at no surprise that the proportion of Hispanic population is one of the principal driver for this type of request, since we previously noticed this effect in the exploratory visual analysis and result of the K-Means clustering. What is important is that we are now able to quantify such effect.
At the same time, it seems that areas populated in large part by African-Americans tend to report less graffiti.
The percentage of elderly population is also a major factor, and contributes negatively to the number of requests. This result is also in line with our common sense: we wouldn't expect neighborhoods populated predominantly by older people to be covered in graffiti.
In this analysis, we need to remember that some these demographic and socio-economical indicators are not independent, and the explanatory power for some neighborhood could be "shared" among correlated predictors.