-
Notifications
You must be signed in to change notification settings - Fork 42
Clustering analysis
In this part of the analysis, we used the City of Chicago 311 database provided by Chapin Hall, which contains more than 600 request types. By observing the data over the years 2009-2012, we wanted to investigate whether there are areas in the city that request 311 services in the same way.
In order to do this, we aggregated the data at the census tract level, a unit of analysis that is small enough to reflect consistent unique characteristics (as opposed to, say, groups of neighborhoods), but large enough to contain statistically significant data. There are approximately 800 census tracts in Chicago.
In order to assign each service request with the census tract where it was generated, we used a commercial GIS software that allowed us to join the database of 311 requests and a "shapefile" of census tracts on the basis of the geographical location of requests. The resulting table, containing the tract identifier, was then converted into a table of aggregated counts of each service generated from each census tract.
We are in the process of replicating this analysis using only open data from the City of Chicago data portal. The Python module munging/open_311_munging.py
(now incomplete) will contain the code for finding the census tract from where each 311 request is generated, without the use of commercial GIS software. For more information, check out issue #18.
We can formulate the problem of finding tracts with similar 311 requests pattern as one of finding points that are close to each other in the space of request types. Each point represents a tract, and each point has a value for each request type - specifically, the volume of requests coming from that tract, normalized by population. So each tract-point has 600 request-type-frequency values associated with it. Each of these values sits within one dimension of a 600-dimensional request-frequency space.
With the problem set up this way, the problem of finding groups of similar tracts can be expressed in terms of clustering. A "cluster" is a group of points that are closer to each other than they are to points belonging to other clusters, according to some notion of distance. In a two-dimensional space, it is easy to identify clusters visually: they usually are recognizable "clouds" of points.
In a 600-dimensional space, such as ours, the problem is exactly the same, albeit it loses its geometrical interpretation and takes place in a vectorial space that is impossible to visualize. Nonetheless, the notion of distance between two points carries over from the two-dimensional example, and so does the notion of nearby points.
One simple yet powerful algorithm for finding distance-based clustering is called k-means. Here's how it works: given a set of points that we want to cluster, we decide how many clusters we want to identify (K) and randomly pick "centroids", each being the representative of one cluster. Then, the algorithm iteratively:
- Assigns each datapoint to the closest centroid
- Recomputes the position of each centroid to be the mean position of the points that have been assigned to it.
This continues until the centroids don't move anymore, i.e. until all points are assigned to the same cluster as in the previous iteration. This iteration is depicted in the figure below from Wikipedia.
In our case, each points represents a tract, so the algorithm will group together tract that have "similar" 311 request frequencies patterns across all 600 request types.
One of the issues with K-Means is choosing the right k, the number of clusters we want to identify. In our case, we didn't know a-priori what a good k was, so we ran the algorithm for different values of k, then chose the result that appeared to be more meaningful. We used the K-Means implementation from the Python library scikit-learn.
Once we obtained results, we mapped the tracts, assigning an arbitrary color to each cluster. Here we show a cluster with k = 6, along with information on the request-makeup of each cluster.
Note that we didn't include any geographic or demographic information in our clustering - what we see is just a result of grouping together 311 request data. Under the "null hypothesis", one would expect to observe a random patchwork of differently colored tiles, while here we see definite spatial and demographic patterns: tracts that are next to each other tend to exhibit the same request patterns. And these clusters aren't only geographically contiguous, but they overlap to a great extent with Chicago's racial boundaries.
The lilac-colored tiles are largely white-majority neighborhoods, which tend to ask for graffiti and potholes in approximately the same amount.
The purple and light green clusters largely overlap with hispanic-majority neighborhoods. The predominant type of request there is for graffiti removal, a trend that we already observed in our exploratory analysis.
The deeper green clusters largely overlap with black-majority neighborhoods. There are virtually no requests for graffiti removal, while street lights and weed removal seem to be the most requested services. This is also in line with what we observed in the exploratory analysis. The small cluster colored in orange-pink presents an overwhelmingly high amount of request for weed removal. This is probably due to the large number of vacant lots in the Englewood neighborhood, causing City employees to fill in working orders and tickets for uncured lots.