Skip to content

Latest commit

 

History

History
88 lines (63 loc) · 3.22 KB

scatter.md

File metadata and controls

88 lines (63 loc) · 3.22 KB

Scatter plots

Goals

  • Plot X vs Y as scatter plot
  • More cool list comprehension use
  • Fit a line to data with numpy
  • Use of numpy's arange()

Description

We often have several bits of data associated with the same entity. For example, here are some public transit statistics:

# http://journalistsresource.org/wp-content/.../Sample-data-sets-for-linear-regression1.xlsx
income = [5800, 6200, 6400, 6500, 6550, 6580, 8200, 8600, 8800, 9200, 9630, 10570, 11330,
          11600, 11800, 11830, 12650, 13000, 13224, 13766, 14010, 14468, 15000, 15200,
          15600, 16000, 16200]
riders = [192000, 190400, 191200, 177600, 176800, 178400, 180800, 175200, 174400, 173920,
          172800, 163200, 161600, 161600, 160800, 159200, 148800, 115696, 147200, 150400,
          152000, 136000, 126240, 123888, 126080, 151680, 152800]

One of the things that we often look for is the relationship between related variables. Here is a plot of monthly income versus the number of weekly riders on public transit:

Let's figure out how to create that graph:

import matplotlib.pyplot as plt
... define data variables from above ...
plt.plot(income, riders, "ro")
plt.xlabel("Monthly income in dollars", fontsize=16)
plt.ylabel("Weekly public transit riders", fontsize=16, rotation='vertical')
plt.show()

Eyeballing it, it seems like there's a strong correlation but what we really need is to draw a best fit line through that data. We will use numpy, so import it and use the np.polyfit() function:

import numpy as np
fit = np.polyfit(income, riders, deg=1)
m, b = fit[0], fit[1]
print "m = %5.2f, b = %8.1f" % (m, b) # gives "m = -5.44, b = 220217.6"

Now we have the slope and Y intercept but must plot the points along that line. To plot the elements along the line we need to define a function that is the equivalent of y = mx + b:

def line(m, b, x):
    return m * x + b

Now, let's get a bunch of X values that are within the range of the X axis, income variable. Then we can get the Y values:

LEFT = round(min(income))
RIGHT = round(max(income))
linex = np.arange(LEFT, RIGHT, 0.1)
liney = [line(m, b, x) for x in linex]

Now all we have to do is plot:

plt.plot(linex, liney, '--')
plt.title("Fit $y = %2.3f x + %2.3f$"%(m,b), fontsize=16)

This should give you the following graph:

Student exercise

Fit a cubic curve, rather than a line, through the data so that your graph looks like:

You have to increase the degree parameter to numpy's polyfit function and then create a cubic function that uses more coefficients:

def cubic(a, b, c, d, x):
    ...

At this point, I will point out that just because we can use a higher order polynomial to get a closer fit of the curve to the data, doesn't mean that we should. It clearly has "overfit the data," meaning that our curve is adjusting to the random variations of the data when we should be looking at the overall trend. Of course, if you suspect a quadratic or cubic relationship between the variables, then you should use a higher order polynomial than degree 1.