Brett Melbourne 2021-09-01
Note: The .md
version of this document is best for viewing on GitHub.
See the .Rmd
version for the latex equation markup. The .Rmd
version
does not display the plots in GitHub and is best for viewing within
RStudio.
Here I outline all the main inferences that we get from lm()
from a
simple linear regression fit, and explain where they all come from.
First we’ll generate some fake data for illustration.
set.seed(4.6) #make example reproducible
n <- 30 #size of dataset
b0 <- 20 #true y intercept
b1 <- 10 #true slope
s <- 20 #true standard deviation of the errors
x <- runif(n, min=0, max=25) #nb while we have used runif, x is not a random variable
y <- b0 + b1 * x + rnorm(n, sd=s) #random sample of y from the population
The first thing we do with any data is plot them. Never proceed to fitting a model without first plotting the data. Of course, this all looks good.
plot(x, y)
Satisfied that the data would be appropriately described by a linear model, we’ll go ahead and fit the model and then ask for a summary of the fit.
fit <- lm(y ~ x)
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.579 -11.269 -2.065 15.997 25.268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.0961 7.6909 4.173 0.000264 ***
## x 9.5111 0.4705 20.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.2 on 28 degrees of freedom
## Multiple R-squared: 0.9359, Adjusted R-squared: 0.9336
## F-statistic: 408.7 on 1 and 28 DF, p-value: < 2.2e-16
Let’s walk through all the output in this summary. First is the call
that we made to lm()
- it tells us that we fitted the linear model
y ~ x
. If you’re not yet familiar with this notation, we’ll cover it
in more detail shortly. The tilde here is not the same as in Bayesian
analysis (i.e. it is not a distributional statement). It is a linear
model shorthand for the following linear model:
where the
The next section of the summary is some distribution information about
the residuals. Remember that the residuals are the deviations
The next section of the summary is the parameter (aka coefficient)
estimates, i.e. for x
. The standard errors, t-values
and p-values are inferences about these parameters. Where do these
inferences come from? They are frequentist inferences, so they all stem
from inference algorithms based on the sampling distribution. I’m going
to go ahead and print the coefficient table again so we have them in
view. But first, let’s take a look at the object that is created by the
call to the summary function.
sf <- summary(fit)
str(sf)
## List of 11
## $ call : language lm(formula = y ~ x)
## $ terms :Classes 'terms', 'formula' language y ~ x
## .. ..- attr(*, "variables")= language list(y, x)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. ..$ : chr "x"
## .. ..- attr(*, "term.labels")= chr "x"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, x)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## $ residuals : Named num [1:30] -1.56 11.31 -9.39 -10.71 -7.82 ...
## ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] 32.1 9.51 7.69 0.47 4.17 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
## $ sigma : num 18.2
## $ df : int [1:3] 2 28 2
## $ r.squared : num 0.936
## $ adj.r.squared: num 0.934
## $ fstatistic : Named num [1:3] 409 1 28
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:2, 1:2] 0.178623 -0.009855 -0.009855 0.000668
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..$ : chr [1:2] "(Intercept)" "x"
## - attr(*, "class")= chr "summary.lm"
Above, we see that the result of the summary function is an object that contains a bunch of other objects, such as “residuals”, “coefficients”, and “sigma”. Next we’ll look just at the coefficients, which are also known as the estimated parameters, or simply the estimates. This is often the most important part of the analysis since some or all of the parameter estimates and their uncertainties are typically the target in scientific questions. For example, we might be interested primarily in the value of the slope. Is it steep or shallow? Does it represent a rate? Is it positive or negative? Is it non-zero? etc
sf$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.096061 7.690906 4.173248 2.636342e-04
## x 9.511143 0.470481 20.215786 3.060848e-18
We see that the estimated intercept
Where do those standard errors for the parameters in the coefficients
table come from? Those are
The
Simulation is a good way to explore and understand this concept of how
the sampling distributions of the parameters will be exactly Normal when
the errors in
repeat very many times
sample data from the population
fit the linear model
estimate the parameters
plot sampling distribution (histogram) of the parameter estimates
We’ll compare the results of this sampling distribution algorithm (implemented as a stochastic simulation) to the mathematical theory. In R code now:
# Simulation of sampling distribution for estimated beta_0 and beta_1
# About 1 minute
# Imagine a large population of Normal errors, which will be
# the true population in our simulation.
pop_e <- rnorm(1e6,mean=0,sd=s)
# (pop_e has mean = 0.00 and sd = 20.0)
# Sample from this population and estimate parameters many times
reps <- 10000 #Large enough (set larger for even better precision)
b0sims <- rep(NA, reps) #to store sim results
b1sims <- rep(NA, reps)
for ( i in 1:reps ) {
errs <- sample(pop_e,30) #sample from population of errors
ysim <- b0 + b1*x + errs #simulate data for y
fitsim <- lm(ysim ~ x) #fit linear model
b0sims[i] <- coef(fitsim)[1] #estimated beta_0
b1sims[i] <- coef(fitsim)[2] #estimated beta_1
}
# Plot simulated sampling distributions and compare to theory
par(mfrow=c(1,2))
# Simulated sampling distribution for beta_0
hist(b0sims, freq=FALSE, ylim=c(0,0.05), col="#56B4E9",
main="Sampling distribution beta_0")
# Theoretical sampling distribution for beta_0
varpx <- mean((x-mean(x)) ^2 ) #population variance of x
k0 <- sqrt(1 + mean(x)^2 / varpx)
s0 <- k0 * s / sqrt(n)
yy <- seq(min(b0sims), max(b0sims), length.out=100)
lines(yy, dnorm(yy, mean=b0, sd=s0), lwd=2, col="#E69F00")
# Simulated sampling distribution for beta_1
hist(b1sims, freq=FALSE, ylim=c(0,0.8), col="#56B4E9",
main="Sampling distribution beta_1")
# Theoretical sampling distribution for beta_1
k1 <- 1 / sqrt( varpx )
s1 <- k1 * s / sqrt(n)
yy <- seq(min(b1sims), max(b1sims), length.out=100)
lines(yy, dnorm(yy, mean=b1, sd=s1), lwd=2, col="#E69F00")
There is an extremely close match of the theory (curve) and the simulation (histogram) of the sampling distribution. We used a large number of replications (10,000) to inspect this concordance. You can increase the replication even further to see that the theory exactly matches simulation. This demonstrates that the mathematical theory for the sampling distribution is sound. Most importantly, it demonstrates from an algorithmic point of view that the sampling distribution concept is actually quite simple (a few lines of pseudocode) and that we can implement the algorithm readily.
However, unlike the simulation, for our actual dataset, we don’t know
the true population of errors or the true sampling distributions, so we
don’t know the true
sigma_e <- sf$sigma #\hat{\sigma}_e
print(sigma_e)
## [1] 18.19736
k0 * sigma_e / sqrt(n) #\hat{sigma}_0
## [1] 7.690906
k1 * sigma_e / sqrt(n) #\hat{sigma}_1
## [1] 0.470481
The assumption of normality of the
After that interlude to explore sampling distributions, let’s get back to our table of parameter estimates, which I’ll print again here:
sf$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.096061 7.690906 4.173248 2.636342e-04
## x 9.511143 0.470481 20.215786 3.060848e-18
What about the t values? The t value is the size of the estimate in standard error units. For example, the t value for the intercept above is:
32.096061 / 7.690906
## [1] 4.173248
Thus, 4.17 standard error units. Simple.
What about the p-values? Recall that a p-value is defined as:
The probability of a sample statistic as large or larger than the one observed given that some hypothesis is true.
It is a measure of the improbability of the sample statistic under
some hypothesis. The hypothesis under consideration is typically a
null hypothesis. For example, here, for the intercept, it is the
null hypothesis that the true (unobserved) value of the intercept is
zero, along with other conditions, such as that the pt
to calculate the cumulative probability that
t is 4.173, which is the probability that a sampled value of t is as
large or smaller than 4.173. The probability that t is as large
or larger than 4.173 is one minus the cumulative probability (close
enough; we haven’t bothered to add back in the probability that
t=4.173). But we are not done yet since t could be as “large or
larger” than 4.173 in the negative direction also (i.e. t could be <
-4.173). We need to add that possibility. The p value is the sum of
these two probabilities. In other words, this is what is called a
two-tailed p-value because it allows for t to be larger than
expected in either tail of the t distribution. As a calculation
shortcut we could alternatively double one of the tail probabilities
since the t distribution is symmetric.
1 - pt(4.173248, 28) #as large or larger in positive direction
## [1] 0.0001318173
pt(-4.173248, 28) #"as large or larger" in negative direction (as small or smaller than -t)
## [1] 0.0001318173
1 - pt(4.173248, 28) + pt(-4.173248, 28) #two-tailed p-value
## [1] 0.0002636345
2 * (1 - pt(4.173248, 28)) #Equivalent to twice one of the tails since symmetric
## [1] 0.0002636345
Thus, the p-values for the parameters
We can get a precise 95% confidence interval for each of these
parameters with the confint
function.
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 16.341954 47.85017
## x 8.547406 10.47488
These are exactly
lo <- sf$coefficients[,"Estimate"] + qt(p=0.025, df=28) * sf$coefficients[,"Std. Error"]
up <- sf$coefficients[,"Estimate"] + qt(p=0.975, df=28) * sf$coefficients[,"Std. Error"]
cbind(lo, up)
## lo up
## (Intercept) 16.341954 47.85017
## x 8.547406 10.47488
You can see that these are the same values given by confint
. Notice
that these are close enough to
lo <- sf$coefficients[,"Estimate"] - 2 * sf$coefficients[,"Std. Error"]
up <- sf$coefficients[,"Estimate"] + 2 * sf$coefficients[,"Std. Error"]
cbind(lo, up)
## lo up
## (Intercept) 16.714249 47.47787
## x 8.570181 10.45210
So, you should get used to scanning summary tables and doubling the standard error to get approximate 95% CIs for the parameters.
What else is in that summary? The final lines of the summary report:
Residual standard error: 18.2 on 28 degrees of freedom
Multiple R-squared: 0.9359, Adjusted R-squared: 0.9336
F-statistic: 408.7 on 1 and 28 DF, p-value: < 2.2e-16
To understand these, it is helpful to look at the analysis of variance (ANOVA) table as well.
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 135331 135331 408.68 < 2.2e-16 ***
## Residuals 28 9272 331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table partitions the sums of squares into contributions from
the deterministic part of the model (
totssq <- sum((y-mean(y)) ^ 2)
totssq
## [1] 144603.2
This number is the same as the sum of the “Sum Sq”” column in the table. The residual SSQ in the table is the minimum SSQ that our training algorithm found to determine the best-fit parameters for intercept and slope. In this case, the best fit gives us a residual sum of squares of 9272. We can access the residuals from the fit also.
residuals(fit)
## 1 2 3 4 5 6
## -1.55620951 11.31380659 -9.39022334 -10.71351167 -7.82191419 21.90303800
## 7 8 9 10 11 12
## 0.06059346 25.13012559 25.26770263 0.65580802 -8.53172861 16.51695561
## 13 14 15 16 17 18
## 7.32351925 -18.99654092 17.78686819 -3.46477607 20.81027668 -20.04311845
## 19 20 21 22 23 24
## -29.98032952 14.43566709 -11.45414338 -4.46413778 12.77321555 -15.42617272
## 25 26 27 28 29 30
## -16.91325236 24.93175175 -2.57463733 24.04152606 -39.57886391 -22.04129472
and see that the sum of their squares is the residual sum of squares
rssq <- sum(residuals(fit) ^ 2)
rssq
## [1] 9272.029
The sum of squares associated with including
xssq <- totssq - rssq
xssq
## [1] 135331.2
Thus, we can say that
The next column of the ANOVA table is the mean square, which is the
total SSQ divided by the degrees of freedom. The degrees of freedom is
Mathematical theory tells us that if we use this formula as is, the
estimate of the variance would be biased downward. Dividing the SSQ by
the degrees of freedom instead of
rms <- rssq / 28
rms
## [1] 331.1439
and its square root is the estimate for the standard deviation of
sqrt(rms)
## [1] 18.19736
This number is the residual standard error in our summary output.
The F-statistic in the summary output can also be understood from the
ANOVA table. F is the mean square associated with
xms <- xssq / 1 #mean square of x is xssq / df, where df for x = 1
F <- xms / rms #read this as "x mean square divided by residual mean square""
F
## [1] 408.678
Finally, the p-value for F is the probability of sampling an F as large or larger than the observed F given the null hypothesis that F = 0. Again, mathematical theory tells us how to construct a sampling distribution for F under the null hypothesis. We won’t cover that detail here (which we could do with a simple simulation similar to the one above) but the sampling distribution is the F-distribution, so we can obtain the p-value thus:
1 - pf(F, df1=1, df2=28)
## [1] 0
For this example, it turns out that the p-value is close to zero but there is a limit for how close the computer can get to zero and still be accurate, so R reports this in the summary table as <2.2e-16, a value below which the probability is no longer accurate.
You’ll notice that both the p-value for the F-statistic associated
with x and the p-value for the t statistic associated with x in the
summary table are the same. They are both ultimately getting at the same
null hypothesis here, that
The r-squared statistics are different standardized measures of the proportion of variance explained by the linear model fit. For a variety of reasons, these are rarely useful in practical work and we won’t go into more detail here.
To get frequentist intervals for the mean of predict()
function. By default, predict()
will provide predictions
(both point predictions and confidence intervals or prediction
intervals) for each data point in the current dataset based on their
newd <- data.frame(x=10)
newd
## x
## 1 10
We can then get predictions for predict()
.
predict(fit, newd, interval = "confidence")
## fit lwr upr
## 1 127.2075 119.0094 135.4056
The predicted (or estimated) mean of
We can similarly get a prediction interval for a new value of
predict(fit, newd, interval = "prediction")
## fit lwr upr
## 1 127.2075 89.04102 165.374
We see that the prediction interval is wider, as expected because it
also includes the uncertainty in sampling a value of
To calculate confidence intervals and prediction intervals for a range
of cbind
the output matrix to the new data.
newd <- data.frame(x = seq(min(x), max(x), length.out=100))
pred_w_ci <- cbind(newd, predict(fit, newd, interval = "confidence"))
pred_w_pi <- cbind(newd, predict(fit, newd, interval = "prediction"))
We can then plot these with the data. Use your favorite graphing facilities. Here, I use base plotting functions.
# y limits for plots
ylwr <- min(pred_w_pi[,c("fit","lwr","upr")])
yupr <- max(pred_w_pi[,c("fit","lwr","upr")])
# Plot
plot(y ~ x, ylim=c(ylwr,yupr))
lines(pred_w_ci[c(1,nrow(pred_w_ci)),c("x","fit")], col="#56B4E9")
lines(pred_w_ci[,c("x","lwr")], col="grey")
lines(pred_w_ci[,c("x","upr")], col="grey")
lines(pred_w_pi[,c("x","lwr")], col="grey", lty=2)
lines(pred_w_pi[,c("x","upr")], col="grey", lty=2)
An important task is to check the model fit. We want to assess assumptions of the model, check for outliers, and consider data points that could have an unusual influence on the inference. There are good standard diagnostic tools for doing this for standard linear models. The workhorse is a series of diagnostic plots, some of which are given by the default call to plot.
plot(fit)
But there are more plots available and a range of options:
plot(fit, 1:6)
?plot.lm
We will discuss diagnostics in more detail later as this is a very important topic and most issues are common to frequentist, likelihood, and Bayesian analyses.
- We have looked at the inferences that are provided by R’s
lm()
function and explored the frequentist theory behind them. - The inferences are all the result of the frequentist way of counting how data could happen - the sampling distribution for a statistic - and using the plug in principle.
- From an algorithmic point of view we saw that the sampling distribution concept is quite simple (a few lines of pseudocode) and that we can implement the algorithm readily. As we shall see, we can take further advantage of this to devise purely algorithmic approaches to frequentist inference, such as the bootstrap.
- Realize also that the algorithm is what is important, rather than the intricacies of the mathematical theory. Much of the math was devised essentially to get around computational limitations that existed when linear regression was invented. These limitations are no longer so constraining.
- The code for a basic analysis of a linear model using
lm()
with the most common inferences that you might desire is recapitulated here:
plot(x, y)
fit <- lm(y ~ x)
summary(fit)
confint(fit)
newd <- data.frame(x=seq(min(x), max(x), length.out=100))
pred_w_ci <- cbind(newd, predict(fit, newd, interval="confidence"))
pred_w_pi <- cbind(newd, predict(fit, newd, interval="prediction"))
lines(pred_w_ci[c(1,nrow(pred_w_ci)),c("x","fit")], col="#56B4E9")
lines(pred_w_ci[,c("x","lwr")], col="grey")
lines(pred_w_ci[,c("x","upr")], col="grey")
lines(pred_w_pi[,c("x","lwr")], col="grey", lty=2)
lines(pred_w_pi[,c("x","upr")], col="grey", lty=2)
plot(fit, 1:6)