-
Notifications
You must be signed in to change notification settings - Fork 43
Statistical Models
Since our exploratory and clustering analysis highlighted evident correlations, we decided to quantify the relationship between service requests and neighborhood characteristics in the context of space, time or other confounding factors. In order to do this, we used Poisson Generalized Linear Models (GLM), a suitable formalism to model rates since- among its other useful properties- it accounts for the fact that service requests must be integer-valued and time independent.
A Poisson GLM relates a dependent variable, in our case the amount of service request per each type, with a set of independent factors (time, neighborhood characteristics) through an exponential relation. To put this in formal, mathematical terms, we denote as the volume of requests for a particular service for month in a given census tract, and consider the following model:
In this model:
- is the intercept, or the mean value of theta.
- are the two auto-regressive coefficients, that express the relation of the current value of with respect to the value at one and two months before, respectively.
- is a sequence of numerical neighborhood features, namely: population, proportion of population above age 65, proportion of Black population, proportion of Hispanic population, proportion of population below the poverty line, unemployment rate, and median household income.
- is a sequence of weights, one for each of the above neighborhood features.
- is the month value (2 through 12) at time .
- measures the impact of each month with repsect to January, that is left off in that it is taken to represent the baseline.
Fitting this model to our data means to find the values of so that the relation expressed by the model is the closest possible to the observed data. We used the common Maximum Likelihood technique to fit our models. Below, we provide the results obtained by fitting the model for each census tract, considering graffiti removal as the request type under study, using data from 2009 to 2012.
We provide the results of fitting our model for graffiti removal requests, a type of service for which we expect a strong correlation with neighborhood characteristics.
Below, we report the values of the coefficients associated with each predictor (e.g. how many reports we had the previous month, being in March, unemployment rate of a census tract, ...). The higher the coefficient in absolute value, the higher the impact it has on the number of reports.
Predictor | Coefficient | Std. Error | z value | p value | Significance | Description |
---|---|---|---|---|---|---|
(Intercept) | 2.029e+00 | 5.391e-03 | 376.409 | < 2e-16 | *** | Global mean |
counts_lag1 | 8.037e-03 | 4.747e-05 | 169.308 | < 2e-16 | *** | # Reports previous month |
counts_lag2 | 5.662e-03 | 4.838e-05 | 117.044 | < 2e-16 | *** | # Reports two months before |
s_2 | -5.126e-02 | 7.294e-03 | -7.027 | 2.11e-12 | *** | February |
s_3 | 1.725e-01 | 6.466e-03 | 26.671 | < 2e-16 | *** | March |
s_4 | 2.091e-02 | 6.620e-03 | 3.158 | 0.00159 | ** | April |
s_5 | -4.597e-02 | 6.674e-03 | -6.888 | 5.66e-12 | *** | May |
s_6 | 1.818e-03 | 6.679e-03 | 0.272 | 0.78543 | June | |
s_7 | 1.523e-02 | 6.721e-03 | 2.266 | 0.02345 | * | July |
s_8 | 8.373e-02 | 6.655e-03 | 12.581 | < 2e-16 | *** | August |
s_9 | 1.304e-02 | 6.742e-03 | 1.935 | 0.05300 | . | September |
s_10 | -6.528e-04 | 6.738e-03 | -0.097 | 0.92281 | October | |
s_11 | 3.457e-02 | 7.111e-03 | 4.862 | 1.16e-06 | *** | November |
s_12 | -8.108e-02 | 7.333e-03 | -11.057 | < 2e-16 | *** | December |
x_1 | 7.229e-02 | 8.041e-04 | 89.895 | < 2e-16 | *** | Tract's total population |
x_2 | -3.361e+00 | 3.531e-02 | -95.185 | < 2e-16 | *** | Prop. pop. over 65 |
x_3 | -1.224e+00 | 9.220e-03 | -132.727 | < 2e-16 | *** | Prop. Black |
x_4 | 7.339e-01 | 7.837e-03 | 93.642 | < 2e-16 | *** | Prop. Hispanic |
x_5 | -1.598e-01 | 1.801e-02 | -8.874 | < 2e-16 | *** | Prop. pop. below poverty line |
x_6 | -2.233e-03 | 4.513e-04 | -4.947 | 7.55e-07 | *** | Unemployment rate |
x_7 | -1.088e-03 | 1.190e-04 | -9.142 | < 2e-16 | *** | Median household income |
The columns labeled "z value", "p value", and "Significance" describe the statistical significance of the obtained coefficients. In order to not over-complicate this explanation, it suffices to say that the p value represents the probability of the observed number of request under the hypothesis that there were truly no dependence of the number of requests with that particular variable, everything else left unchanged. That is to say, it represents the probability that the observed data is the result of random fluctuations with respect to a predicting variable. The number of stars is a standard way to graphically represent such significance: the more stars, the better.
Let's analyze the dependency of the number of request with the month of the year. The values of such coefficients represent the contribution of being in a particular month compared to January, which i s taken as the baseline. We can see that the number of reported graffiti seem to have a strong positive correlation with the month being August, but in general it doesn't seem to present a reliable, clear dependence on seasonality. This is in line with what we can see in the monthly plot of City-wide requests for graffiti removal from January 2011 to May 2013, depicted below: there is not clear seasonal fluctuation.
We now take a look at the coefficients for the demographic indicators. It comes at no surprise that the proportion of Hispanic population is one of the principal driver for this type of request, since we previously noticed this effect in the exploratory visual analysis and result of the K-Means clustering. What is important is that we are now able to quantify such effect. At the same time, it seems that areas populated in large part by African-Americans tend to report less graffiti. The percentage of elderly population is also a major factor, and contributes negatively to the number of requests. This result is also in line with our common sense: we wouldn't expect neighborhoods populated predominantly by older people to be covered in graffiti. In this analysis, we need to remember that some these demographic and socio-economical indicators are not independent, and the explanatory power for some neighborhood could be "shared" among correlated predictors.