This week we will cover checking model assumptions (a.k.a diagnostics), introduction to GLM, and visualization theory. Again this is all the homework for the week up front and you can work at your own pace. Do at least (2) & (3) by Monday, preferably also (4). We will talk about visualization theory and ggplot
on Wednesday, so be sure to have read the paper by then.
I have returned feedback on your repositories through week 4. This feedback tells you what you have successfully completed. If there are incomplete items, you'll need to revise to get completion credit. See 00_feedback_revision.md for how to see and revise those. This is also practice for collaboratively working using git and GitHub. There is no particular timeline for revisions but you can revise as many times as you like until you get to completion. Email me when you push your revisions back to GitHub.
View and listen to the lectures on model diagnostics (for Normal linear and generalized linear models). The audio is separate; advance the slides manually yourself. This recording is from a previous year. Explore the code.
There is a separate R tutorial for QQ plots (there is no accompanying lecture material).
- tutorial: 09_5_quantiles-qqplots.md
An issue that is not covered in these lectures is leverage. Currently there are no widespread methods available for this in multilevel models, so it is a bit of a specialist thing for simple linear models and GLMs, so I'm going to skip the theory for it. The general idea is that a point with high leverage is an influential point in a special way: typically it is at one end of the independent variable and it is far from the fitted line relative to other points. Because of the geometry, the point "pulls" on the line like a lever, thus affecting the estimate of the slope. For Normal linear models and some other GLMs leverage can be visualized like this (where fit
is the saved fitted-model object from lm()
or glm()
):
plot(fit, 5)
We will continue working with the ants data on Monday, sharing your ideas for analysis. Each group noticed different things and had great ideas for analysis. You also identified features of the data that would make a straight-up linear regression problematic. Here in the homework, we will dive deeper into identifying problems with the assumptions of a standard Normal linear regression model for these data.
-
Read in the ants data and convert the habitat variable to a factor:
# Read in data ant <- read.csv("data/ants.csv") # Quick view of the dataframe head(ant) # Set habitat to be a factor ant$habitat <- factor(ant$habitat) # Print the habitat column. The factor levels are shown at the end. ant$habitat
-
A factor is an R data structure for categorical variables. See
?factor
. One attribute of a factor is the levels of the factor, sorted alphabetically by default. -
Using a
.R
file (i.e. not.Rmd
) fit a Normal linear model to the ants data that would best address the scientific questions posed in class. The appropriate model is a multiple regression with an interaction of latitude and habitat. We will not include theelevation
orsite
variables for now. It's easiest to fit this model withlm()
.fit <- lm(richness ~ habitat + latitude + habitat:latitude, data=ant)
-
Don't attempt to fix any problems with the model (i.e. don't transform the data or use a different distributional assumption or nonlinear model).
-
To extract some basic information to construct diagnostic plots, you can use:
r <- fit$residuals fv <- fit$fitted cooks <- cooks.distance(fit)
-
Construct diagnostic plots for this model. Construct all the diagnostic plots described in task (2) above. For the case deletion influence plot you can use the cooks statistic in lieu of the full case deletion algorithm.
-
Describe the patterns in the diagnostic plots. What assumptions of the Normal linear model are violated according to these patterns?
-
In a separate
.R
file, do the same for your dataset that you have been using for linear models so far. -
Knit each analysis to a markdown report and push to GitHub.
The models we have encountered so far are a special case of generalized linear models (GLMs). We are going to extend now to the general case of a GLM: a linearly-transformed model with a range of possible distributions for the data. This is within the assumed prerequisite for this class, so hopefully some of this is revision, reinforcement, or another perspective for most of you.
-
Read section 9.2, pp 280 - 288.
-
Skip 9.2.4. The information here is incorrect: you can indeed compare models with different likelihoods via AIC, although you need to be careful to do it right.
-
Optionally, the perspective on maximum entropy in section 9.1 is quite interesting and somewhat unique (this perspective is heavily influenced by Jaynes 2003 Probability Theory) but you don't need to understand this. If this is your first time encountering GLMs, skip it since it is more likely to be bewildering than helpful.
- Theory and concepts behind
ggplot
. Read Wickham's paper: - http://vita.had.co.nz/papers/layered-grammar.pdf
- Work through Chapter 3 of R for Data science, including the exercises:
- http://r4ds.had.co.nz/data-visualisation.html
- We will be using
ggplot
a lot from now on.