LinearModelingR.Rmd

---
title: "Linear Modeling in R"
author: "Clay Ford, UVA Library"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

## Quick Intro to R Notebooks and R Markdown

This is an R Markdown Notebook. When you execute R code within the notebook, the results appear beneath the code.  

This file was created in RStudio by going to File...New File...R Notebook.

R code needs to be in "chunks" in an R Markdown Notebook. Below is an example of an R code chunk. It makes a parabola.

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter* (Win/Linux) or *Cmd+Shift+Return* (Mac). 

```{r}
x <- seq(-1, 1, by = 0.01)
y <- x^2
plot(x, y, type = "l")
```

To hide the output, click the Expand/Collapse output button. To clear results (or an error), click the "x". 

You can also press *Ctrl+Enter* (Win/Linux) or *Cmd+Return* (Mac) to run one line of code at a time (instead of the entire chunk).

Add a new R code chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac).  

## CODE ALONG 0

Insert a new R code chunk below and type and run the code: `Sys.time()`


## Linear Modeling with Simulated Data

Instead of using theory and formulas, let's explore and review linear modeling using simulated data. 

Below we assign to x the values 1 - 25. Then we generate y as a function of x using the formula 10 + 5*x:

```{r}
x <- 1:25
y <- 10 + 5*x  # formula for a line
d <- data.frame(x, y)
plot(y ~ x, data = d)
```


- 10 is the intercept.
- 5 is the slope. 
- y is completely determined by x (y = 10 + 5*x)

Now let's add some "noise" to our data by adding random draws from a Normal
distribution with mean = 0 and a standard deviation = 10. The `rnorm()` function allows us to draw random values from a Normal distribution.

`set.seed(1)` ensures we all generate the same "random" data:

```{r}
set.seed(1)
noise <- rnorm(n = 25, mean = 0, sd = 10)
# Add the noise to 10 + 5*x and re-draw plot
d$y <- 10 + 5*x + noise
plot(y ~ x, data = d)
```

Now y appears to be _associated_ with x, but not completely determined by x.

y is the combination of a fixed part and a random part:

1. fixed: `10 + 5*x`
2. random: `rnorm(n = 25, mean = 0, sd = 10)`

What if we were given this data and told to determine the process that generated it? In other words, _work backwards_ and fill in the blanks:

1. ___ + ___*x
2. rnorm(n = 25, mean = 0, sd = ____)

_That's one way to think of linear modeling/regression_. You have some numeric response (or dependent) variable, and you want to find the model (or the formula) that generated the data. 

Traditional linear modeling/multiple regression assumes the following (among others):

1. the formula is a _weighted sum_ of predictors (eg, y = 10 + 5*x)
2. the noise is a random draw from a Normal distribution with mean = 0
3. the standard deviation of the Normal distribution is constant (eg, 10)

Linear modeling tries to recover the weights in the first assumption and the standard deviation in the third assumption.

Let's demonstrate. Below we attempt to recover the data generating process for our data. For this we use the `lm()` function. We have to specify the formula for the first assumption. The 2nd and 3rd assumptions are built into `lm()`.

The formula "y ~ x" means we think the model is "y = intercept + slope*x". (Unless we specify otherwise, this assumes we want to estimate an intercept.) This tells `lm()` to take our data and find the best intercept and slope. _Notice this is the correct model!_

When you use `lm()` you'll usually want to save the results to an object. Below I save it to "mod". Then we view the results of the model using `summary()`

```{r}
mod <- lm(y ~ x, data = d)
summary(mod)
```


The model returns the following estimates:

1. y = 11.135 + 5.042 * x
2. noise = rnorm(n = 25, mean = 0, sd = 9.7)

These are pretty close to the "true" values of 10, 5, and 10 we used to generate the data.  

**In real life, we DO NOT KNOW the formula in part 1. The real data generation process will be far more complicated. The formula we propose will just be an approximation and may not be good.**

**In real life, we DO NOT KNOW if the Normality assumption or constant variance assumption of the noise is plausible.**

How can we evaluate our model formula?

We could use our model estimates to generate data and see if they look similar to our original data. Run entire chunk at once and run more than once. The black points don't change but the red ones do. That looks pretty good! Our model-generated data appears similar to our observed data.

```{r}
# d$y <- 10 + 5*x + noise
d$y2 <- 11.135 + 5.042*d$x + rnorm(25, 0, 9.7)
plot(y ~ x, data = d) # original data
points(d$x, d$y2, col = "red") # simulated data
```


We can also compare _smooth density curves_ of the original and model-generated data. Smooth density curves are basically smooth versions of histograms. If we have a good model, data generated by our model should have a similar distribution to the original data. Run entire chunk at once and run more than once.

```{r}
hist(d$y, freq = FALSE) # freq = FALSE means area of bars sums to 1
lines(density(d$y))  # original data
d$y2 <- 11.135 + 5.042*d$x + rnorm(25, 0, 9.7)
lines(density(d$y2), col = "red")  # simulate data
```

This looks good as well. The distribution of our model-generated data is very similar to our observed data. You should do this more than once, say 50 times, to ensure the model consistently generates data similar to the observed data. We show one way to do that later in the workshop.

Since we think our model is "good", we might use it to make a prediction. For example, when x = 10 what's the expected value of y? Put another way, what's the _mean_ of y conditional on x = 10? We can do that with the `predict()` function.  The `interval = "confidence"` arguments says return a 95% confidence interval (CI) for this mean.

```{r}
predict(mod, newdata = data.frame(x = 10), interval = "confidence")
```

The expected mean of y when x = 10 is about 61.6 with a 95% CI of (57.2, 65.9). The CI gives us some notion of how uncertain this expected mean is. In fact it would be better to report this as, "the expected mean of y when x = 10 is about 57 to 66."

We might also try to summarize the relationship between y and x by examining the coefficients (or weights) in the summary output. We can extract the coefficients from the summary with the `coef()` function:

```{r}
coef(summary(mod))
```

The x coefficient says that y increases by about 5 for every one-unit increase in x, give or take about 0.27. The standard error gives us some indication of the uncertainty in this estimate. We talk more about t values and p values below.

**TO SUMMARIZE: This is basic linear modeling:**

1. propose and fit a model
2. determine if the model is good and that assumptions are mostly met
3. use the model to explain relationships and/or make predictions

## CODE ALONG 1

Let's see what happens when we fit a bad model. Below we add a new column to data frame `d` named `z`, which is a random sample of numbers on the range of -100 to 100. `runif()` samples numbers from a uniform distribution.

```{r}
set.seed(4)
d$z <- runif(25, min = -100, max = 100)
```

REMINDER: Add a new R code chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac). 

1. Model `y` as a function of `z` using `lm(y ~ z, data = d)` and save to an object called `mod2`. View the summary. What formula does the model return? What is the estimated standard deviation of the Normally distributed noise?


2. Use the model to simulate density histograms and compare to the original density histogram of `d$y`. Run the chunk several times to see how the model-generated density curve varies. 


Let's do some linear modeling with real data. 

## Import data

Let's import the data we'll be using today. The data we'll work with is Albemarle County real estate data which was downloaded from the Office of Geographic Data Services. We'll use a _random sample_ of the data.

```{r}
URL <- 'https://raw.githubusercontent.com/clayford/dataviz_with_ggplot2/master/alb_homes.csv'
homes <- read.csv(file = URL)
# set categorical variables as factors
homes$hsdistrict <- factor(homes$hsdistrict)
homes$cooling <- factor(homes$cooling)
```

Let's look at the first few rows:

```{r}
head(homes)
```

Variable name definitions:

- *yearbuilt*: year house was built
- *finsqft*: size of house in number square feet
- *cooling*: 'Central Air' versus 'No Central Air'
- *bedroom*: number of bedrooms
- *fullbath*: number of full bathrooms (toilet, sink and bath)
- *halfbath*: number of half bathrooms (toilet and sink only)
- *lotsize*: size of land on which home is located, in acres
- *totalvalue*: total assessed value of home and property
- *esdistrict*: the elementary school the home feeds into
- *msdistrict*: the middle school the home feeds into
- *hsdistrict*: the high school the home feeds into
- *censustract*: the census tract the home is located in
- *age*: of the house in years as of 2018
- *condition*: assessed condition of home (Substandard, Poor, Fair, Average, Good, Excellent)
- *fp*: indicator if house has fireplace (0=no, 1=yes)


## Linear Modeling with Real Estate Data

Let's say we want to model the mean *total value* of a home as a function of various characteristics such as lot size, finished square feet, presence of central air, etc.

Let's see how total value is distributed using a histogram. Notice it's very skewed.

```{r}
hist(homes$totalvalue)
```

We can also create a smooth density plot, which is a smooth version of a histogram:

```{r}
plot(density(homes$totalvalue))
```


In order to model mean totalvalue of homes as a function of various characteristics, we need to propose a linear model. Unlike the previous example this is not simulated data for which we know the data generating process. How to propose a model? It helps to have some subject matter expertise. 

Let's fit a linear model using finsqft, bedrooms and lotsize. The plus (+) sign means "include" in model.

```{r}
m1 <- lm(totalvalue ~ finsqft + bedroom + lotsize, data = homes)
summary(m1)
```


The `coef()` function extracts the coefficients (or weights):

```{r}
coef(m1)
```

This translates to:

```
totalprice = -133328.2482 + 284.4613*finsqft + -13218.4091*bedroom +
               4268.7655*lotsize`
```

Some naive interpretation:

- each additional finished square foot adds about $284 to price
- each additional bedroom drops the price by $13,218 (?)
- each additional lot size acre adds about $4268 to price

Each of these interpretations assumes _all other variables are held constant_! So adding a bedroom to a house, without increasing the lot size or finished square feet of the house, is estimated to drop the value of a home. Does this make sense? 

Is this a "good" model? Let's _simulate data from the model_ and compare it to our observed data. A "good" model should generate data that looks similar to the original data. 

We could do this by hand:

```{r}
sim_values <- -133328.2482 + 284.4613*homes$finsqft + 
  -13218.4091*homes$bedroom + 4268.7655*homes$lotsize + 
  rnorm(3025, mean = 0, sd = 227200)
```

An easier and faster way is to use the `simulate()` function which allows you to generate multiple samples. Here we generate 50 samples. Each sample will have the same number of observations as our original sample (n = 3025). Each sample value is generated using our observed values for `finsqft`, `bedroom`, and `lotsize`. The result is a data frame with 50 columns.

```{r}
sim1 <- simulate(m1, nsim = 50)
```


Now let's plot our simulated data with our observed data using smooth density plots. We use a for loop to add smooth density estimates of the 50 simulations. The syntax `sim1[[i]]` extracts column _i_ as a vector. (Run entire chunk at once.)

```{r}
plot(density(homes$totalvalue))
for(i in 1:50)lines(density(sim1[[i]]), col = "grey80")
```

(See end of notebook for how to create this plot using ggplot2 and for how to turn this into a function.)

This does not appear to be a good model. In fact some of our simulated values are negative!

Before we revise our model recall the main assumptions:

1) totalvalue can be modeled by a weighted sum: 
   totalvalue = Intercept + finsqft + bedrooms + lotsize
2) noise/error is from Normal dist'n with mean 0
3) the SD of the Normal dist'n is constant 

R provides some basic diagnostic plots to assess 2 and 3. Just call `plot` on your model object

```{r}
plot(m1)
```

How to interpret plots:

1. Residuals vs Fitted: for checking constant variance; should have a horizontal line with uniform and symmetric scatter of points; if not, evidence that SD is not constant

2. Normal Q-Q: for checking normality of residuals; points should lie close to diagonal line; if not, evidence that noise is not drawn from N(0, SD). This assumption is only really critical if you're planning to use your model to predict exact values (instead of means) and calculate a confidence interval. 

3. Scale-Location: for checking constant variance; should have a horizontal line with uniform scatter of point; (similar to #1 but easier to detect trend in dispersion)

4. Residuals vs Leverage: for checking for influential observations; points outside the contour lines are influential observations. Leverage is the distance from the center of all predictors. An observation with high leverage has substantial influence on the fitted value.

By default the 3 "most extreme" points are labeled by row number. 2658 appears in all four plots. It's a really big expensive home.

```{r}
homes[2658,]
```

These plots reveal that our assumptions of normally distributed residuals and constant variance are highly suspect. Our model is just bad.

What can we do?

Non-constant variance can be evidence of a wrong model or a very skewed response (or a bit of both). Recall that our response is quite skewed:

```{r}
hist(homes$totalvalue)
```

When dealing with a response that is strictly positive and very skew (like dollar amounts), it is common to transform the response to a different scale. A common transformation is a log transformation. When we log transform `totalvalue`, the distribution looks a little more symmetric, though it's important to note that is not an assumption of linear modeling!

```{r}
hist(log(homes$totalvalue))
```

Let's try modeling log-transformed `totalvalue`.

```{r}
m2 <- lm(log(totalvalue) ~ finsqft + bedroom + lotsize, data = homes)
```

The diagnostic plots look better. 

```{r}
plot(m2)
```

But is this a "good model"? Is our proposed model of weighted sums good? Let's simulate data and compare to observed data.

```{r}
sim2 <- simulate(m2, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim2[[i]]), lty = 2, col = "grey80")
```

This doesn't look too bad!

Let's say we're happy with this model. How do we interpret the coefficients? Since the response is log transformed, we interpret the coefficients as _multiplicative instead of additive_. Below we view the coefficients rounded to 4 decimal places.


```{r}
round(coef(m2), 4)
```

These are proportions. To get percentages, multiply by 100.

```{r}
round(coef(m2), 4) * 100
```

Some naive interpretations:

- each additional finished square foot increases expected mean price by 0.05%. Or multiply by 100 to say each additional 100 finished square feet increases expected mean price by 5%.
- each additional bedroom increases expected mean price by about 4.3%
- each additional lot size acre increases expected mean price by about 0.47%

Remember, the interpretation assumes _all other variables are held constant_! 

A slightly more precise estimation can be obtained by _exponentiating_ the coefficients. Below we exponentiate using the `exp()` function and then round to 4 decimal places.

```{r}
round(exp(coef(m2)), 4) 
```

For example, each additional bedroom (assuming all else held constant) increases expected total price by about 4.4%. Multiplying by 1.0439 is equivalent to adding 4.39%.

Let's review the summary output:

```{r}
summary(m2)
```


SUMMARY OVERVIEW

- Residuals section: quick assessment of residuals. Ideally 1Q/3Q and Min/Max will be roughly equivalent in absolute value.
- Coefficients: lists the estimated coefficients along with hypothesis tests for the null that each coefficient is 0. Est/SE = t-value. 
- Residual standard error: estimate of the constant standard deviation of the normal dist'n of the residuals (noise)
- degrees of freedom: sample size - number of coefficients (3025 - 4)
- R-squared: proportion of variance explained
- F-statistic: overall test that all coefficients (except intercept) are 0.

All of the p-values refer to hypothesis tests that the coefficients are 0. Many statisticians and researchers prefer to look at confidence intervals.

```{r}
round(confint(m2) * 100, 4)
```

According to our model, each additional bedroom adds about 3% to 6% to the value of a home, assuming all else held constant.

## CODE ALONG 2

1. Insert a code chunk below and model `log(totalvalue)` as function of `fullbath` and `finsqft.` Call your model `m3`


2. Insert a code chunk below and check the diagnostic plots


3. How do we interpret the fullbath coefficient?


4. Insert a code chunk below and simulate data from the model and compare to the observed `totalvalue`. Does this look like a good model?


## Categorical predictors

Let's add `hsdistrict` to the model we just fit. Does being in a particular high school district affect total value of a home? Here's a quick tally:

```{r}
table(homes$hsdistrict)
```

These levels are not numbers, so how does R handle this in a linear model? It creates a _contrast_. This is a matrix of zeroes and ones. If you have K levels, you'll have K-1 columns. In this case we'll have two columns: one for Monticello HS and one for Western Albemarle HS. By default R takes whatever level comes first alphabetically and makes it the _baseline_ or _reference_ level. 

Let's look at the default contrast, called a _treatment contrast_. To do this we convert the hsdistrict column to factor and then use the `contrasts()` function. (We don't have to do this to add hsdistrict to our model! I'm just doing this to output the contrast "definition")

```{r}
contrasts(homes$hsdistrict)
```

A model with `hsdistrict` will have two coefficients: `Monticello` and  `Western Albemarle`

- a home in Albemarle HS district gets two zeroes
- a home in Monticello HS district gets a one in the Monticello column
- a home in Western Albemarle HS district gets a one in the West Alb column


Let's fit our new model.

```{r}
m4 <- lm(log(totalvalue) ~ fullbath + finsqft + hsdistrict, data = homes)
summary(m4)
```

The coefficients for Monticello and Western Albemarle are in relation to Albemarle HS. 

```{r}
round(coef(m4) * 100, 4)
```

It appears that the value of a home in Western Albemarle will be about 10% higher than an equivalent home in Albemarle. Likewise it appears that the value of a home in the Monticello district will be about 7% less than an equivalent home in the Albemarle district. 

## CODE ALONG 3

1. Insert a code chunk below and model `log(totalvalue)` as function of `fullbath`, `finsqft` and `cooling.` Call your model `m5`. 


2. What is the interpretation of cooling? 


## Modeling Interactions

In our model above that included `hsdistrict` we assumed the effects were _additive_. For example, it didn't matter what high school district your home was in, the effect of `fullbath` or `finsqft` was the same. It also assumed the effect of each additional `fullbath` was the same regardless of how big the house was, and vice versa. This may be too simple. 

Interactions allow the effects of variables to depend on other variables. Again subject matter knowledge helps with the proposal of interactions. As we'll see interactions make your model more flexible but harder to understand.

R makes it simple to include interactions in models. Just indicate an interaction between two variables by placing a colon (:) between them. Below we include 2-way interactions. (You can have 3-way and higher interactions but they're very difficult to interpret.)

```{r}
m6 <- lm(log(totalvalue) ~ fullbath + finsqft + hsdistrict + 
           fullbath:finsqft + fullbath:hsdistrict + 
           finsqft:hsdistrict, data = homes)
summary(m6)
```

Interpretation is now much more difficult. We cannot directly interpret the _main effects_ of `fullbath`, `finsqft` or `hsdistrict`. They interact. What's the effect of `finsqft`? It depends on `fullbath` and `hsdistrict`. 

Are the interactions "significant" or necessary? We can use the `anova()` function to evaluate this question. This runs a series of _partial F-tests_. Each row below is a hypothesis test. The null is the model with this predictor is the same as the model without the predictor. The anova tests below use what are called _Type I_ sums of squares. This respects the order of the variables in the model. So...

- the first test compares a model with just an intercept to a model with an intercept and fullbath.
- the second test compares a model with an intercept and fullbath to a model with an intercept, fullbath and finsqft.
- And so on.

If the null is true, the F value should be close to 1.

```{r}
anova(m6)
```

The interaction `finsqft:hsdistrict` doesn't appear to contribute much to the model given that _everything else above it_ is in the model.

Just because an interaction is significant doesn't necessarily mean it's interesting or worthwhile. Nor can we infer anything about the nature of the interaction from the ANOVA table.

_Effect plots_ can help us visualize and make sense of models with interactions. Let's make one using the ggeffects package and talk about what it's showing.

```{r}
library(ggeffects)
plot(ggpredict(m6, terms = c("fullbath", "hsdistrict")))
# place fullbath on x-axis, group by hsdistrict
```

What's the effect of `fullbath`? It depends. It's more dramatic in Western Albemarle and Monticello. Of course a lot of the difference comes at extreme values of `fullbath`. The "ribbons" around the lines are 95% confidence intervals. 

What exactly was plotted? We can see by calling `ggpredict` without `plot`

```{r}
ggpredict(m6, terms = c("fullbath", "hsdistrict"))
```

`ggpredict` used our model to make `totalvalue` predictions for various values of `fullbath` in the three school districts, holding finsqft at 1828 (the median of finsqft). 

We can specify our values if we like. For example, make an effect plot for 1 - 5 bathrooms and hold `finsqft` at 2000:

```{r}
plot(ggpredict(m6, terms = c("fullbath[1:5]", "hsdistrict"), 
          condition = c(finsqft = 2000)))
```

What about the effects of `finsqft` and `fullbath`? This is _an interaction of two numeric variables_. The second variable has to serve as a grouping variable when creating an effect plot. Below we set `fullbath` to take values 2 - 5 and `finsqft` to take values of 1000 - 4000 in steps of 500.

```{r}
plot(ggpredict(m6, terms = c("finsqft[1000:4000 by=500]", "fullbath[2:5]")))
```

The effect of `finsqft` seems to taper off the more fullbaths a house has. But there are few large homes with 2 full baths, and likewise, few small homes with 5 full baths. Even though the interaction is "significant" in the model, it's clearly a very small interaction and probably not important.

## CODE ALONG 4

1. Insert a code chunk below and model `log(totalvalue)` as function of `fullbath`, `finsqft`, `cooling`, and the interaction of `finsqft` and `cooling`. Call your model `m7`. Is the interaction warranted?


2. Visualize the interaction using the `ggpredict` function. Perhaps use `[1000:4000 by=500]` to set the range of `finsqft` on the x-axis. How notable is this interaction?


## Nonlinear Effects

So far we have assumed that the relationship between a predictor and the response is _linear_ (eg, for a 1-unit change in a predictor, the response changes by a fixed amount). That assumption can sometimes be too simple and not realistic. Fortunately there are ways to fit non-linear effects in a linear model.

Here's a quick example of simulated non-linear data: a 2nd degree polynomial.

```{r}
x <- seq(from = -10, to = 10, length.out = 100)
set.seed(3)
y <- 1.2 + 2*x + 0.9*x^2 + rnorm(100, mean = 0, sd = 10)
nl_dat <- data.frame(y, x)
plot(y ~ x, nl_dat)
```

Clearly a straight-line model will not work well for this data. The relationship between x and y is not adequately summarized by a straight line model.

If we wanted to try to "recover" the weights we used in simulating this data we could fit a polynomial model using the `poly()` function in the formula syntax:


```
## EXAMPLE CODE; NOT INTENED TO BE RUN
nlm1 <- lm(y ~ poly(x, degree = 2, raw = TRUE), data = nl_dat)
```

However, the recommended approach to fitting non-linear effects is to use _natural splines_ instead of polynomials. Natural splines essentially allow us to fit a series of cubic polynomials connected at knots located in the range of our data.

The easiest option is to use the `ns()` function from the splines package, which comes installed with R. `ns` stands for "natural splines". The second argument is the degrees of freedom (`df`). It may help to think of `df` as the number of times the smooth line changes directions. 

Frank Harrell states in his book _Regression Model Strategies_ that 3 to 5 `df` is almost always sufficient. His basic advice is to allocate more `df` to variables you think are more important.

Let's see how it works with our simulated data.

```{r}
library(splines)
nlm2 <- lm(y ~ ns(x, df = 2), data = nl_dat)
summary(nlm2)
```

The summary output is impossible to interpret. Let's visualize the fit with an effect plot.

```{r}
library(ggeffects)
plot(ggpredict(nlm2, terms = "x"), show_data = TRUE)
```


Let's return to the homes data and fit a non-linear effect for `finsqft` using a natural spline with 5 `df`. Below we also include `hsdistrict` and `lotsize` and allow `finsqft` and `hsdistrict` to interact.

```{r}
nlm3 <- lm(log(totalvalue) ~ ns(finsqft, 5) + lotsize + hsdistrict +
           ns(finsqft, 5):hsdistrict, 
         data = homes)
```

The `anova` function allows us to assess the non-linear effect and the interaction. Some sort of interaction between finsqft and hsdistrict appears to be present.

```{r}
anova(nlm3)
```

The summary output is impossible to interpret.

```{r}
summary(nlm3)

```


Effect plots are our only hope for understanding this model. Below we plot predicted totalvalue over finsqft ranging from 1000 - 3000, grouped by hsdistrict.

```{r}
plot(ggpredict(nlm3, terms = c("finsqft[1000:3000 by=250]", "hsdistrict")))
```

The effect of `finsqft` on `totalvalue` seems more dramatic in Western Albemarle once you go beyond 1500 sq ft.

Does the model simulate data similar in distribution to the observed data?

```{r}
sim4 <- simulate(nlm3, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim4[[i]]), col = "grey80")
```

We should still check model assumptions.

```{r}
plot(nlm3)
```

Homes 12, 40, 963 and 1810 seem to stand out. Let's have a look.

```{r}
h <- c(12, 40, 963, 1810)
homes[h,c("totalvalue", "finsqft", "lotsize")]
```

Homes 12 and 40 are very low in totalvalue and the model way overpredicts their values. Home 963 has a massive totalvalue with 0 acres of lotsize. Home 1810 is on 611 acres and that value has a high leverage on its fitted value. 

```{r}
cbind(observed = homes$totalvalue[h], fitted = exp(fitted(nlm3)[h]))
```

## CODE ALONG 5

1. Insert a code chunk below and model `log(totalvalue)` as function of `finsqft` with a natural spline of 5 `df`, `cooling`, and the interaction of `cooling` and `finsqft` (natural spline of 5 `df`). Call your model `nlm4`.


2. Use the `anova` function to check whether the interaction appears necessary. What do you think?


3. Create an effect plot of `finsqft` by `cooling`. Maybe try `[1000:5000 by=250]` for the range of values for `finsqft`.


## Wrap-up

This was meant to show you the basics of linear modeling in R. Hopefully you have a better grasp of how linear modeling works. Scroll down for a few more topics.

What we did today works for _independent, numeric outcomes_. We had one observation per home and our response was `totalvalue`, a number. Our models returned expected _mean_ total value given various predictors. This is a pretty simple design.

Things get more complicated when you have, say, binary responses or multiple measures on the same subject (or home). A non-exhaustive list of other types of statistical modeling include:

- generalized linear models (for binary and count responses)
- multinomial logit models (for categorical responses)
- ordered logit models (for ordered categorical responses)
- mixed-effect or longitudinal linear models (for responses with multiple measures on the same subject or clusters of related measures)
- survival models (for responses that measure time to an event)
- time series models (for responses that exhibit, say, seasonal variation over time)

**References**

- Faraway, J. (2005). _Linear Models in R_. London: Chapman & Hall.
- Fox, J. (2002). _A R and S-Plus Companion to Applied Regression_. London: Sage.
- Harrell, F. (2015). _Regression Modeling Strategies_ (2nd ed.). New York: Springer.
- Kutner, M., et al. (2005). _Applied Linear Statistical Models_ (5th ed.). New York: McGraw-Hill.
- Maindonald J., Braun, J.W. (2010). _Data Analysis and Graphics Using R_ (3rd ed.). Cambridge: Cambridge Univ Press.


## Bonus material/topics cut for time


## How does lm() work?

`lm()` estimates regression coefficients using ordinary least squares, or OLS. The formula for this using matrix algebra is expressed as follows:

$$\hat{\beta} = (X'X)^{-1}X'Y $$

where X is the model matrix (our predictors as they appear in the model) and Y is the dependent variable. The prime symbol `'` means transpose the matrix. The `-1` means take the inverse. R was developed to make calculations such as these quite easy. 

Let's revisit model m2.

```{r}
formula(m2)
```

Here are the coefficients returned by `lm()`:

```{r}
round(coef(m2), 4)
```

Let's find these using the formula above. We can use `model.matrix()` to get X, `t()` to transpose, and `solve()` to take the inverse. Perform matrix multiplication with `%*%`

```{r}
X <- model.matrix(~ finsqft + bedroom + lotsize, data = homes)
Y <- log(homes$totalvalue)
B <- solve(t(X) %*% X) %*% t(X) %*% Y
round(B, 4)
```

While this is theoretically what `lm()` does, it actually uses more sophisticated methods for faster performance and protection against numeric instability. This page goes into more details if you're interested in learning more:

https://genomicsclass.github.io/book/pages/qr_and_regression.html


## ANOVA revisited

We can also use the `anova()` function to compare _nested models_. This means one model is a subset of another. For example, below we fit progressively more complicated models, building up to our model, `m2`: `log(totalvalue) ~ finsqft + bedroom + lotsize`

```{r}
mod_00 <- lm(log(totalvalue) ~ 1, data = homes) # intercept-only
mod_01 <- lm(log(totalvalue) ~ finsqft, data = homes)
mod_02 <- lm(log(totalvalue) ~ finsqft + bedroom, data = homes)
mod_03 <- lm(log(totalvalue) ~ finsqft + bedroom + lotsize, data = homes)
```

The `anova()` function allows us to compare these nested models. The null hypothesis is that two models are the same, in the sense they explain the same amount of variance in the response variable, `log(totalvalue)`. A low p-value provides evidence that the more complicated model with more variables is a better model. The usual way to use `anova()` to compare models is to list the smaller models first. Below are three tests: 

1. Model 2 vs Model 1
2. Model 3 vs Model 2
3. Model 4 vs Model 3

The end result is that model 4 is superior to the other three models.

```{r}
anova(mod_00, mod_01, mod_02, mod_03)
```

Notice this identical to calling `anova()` on the full model.

```{r}
anova(m2)
```

Another approach is using Type II sums of squares, where each variable is tested assuming _all other variables are in the model_. One approach to performing this test is the `drop1()` function. The name comes from the fact we're dropping one variable at a time. The null hypothesis is dropping the variable from the model has no effect. A low p-value provides evidence against this hypothesis. Below each are three tests:

1. Full model vs Full model without finsqft
2. Full model vs Full model without bedroom
3. Full model vs Full model without lotisze

Each test is soundly rejected. The full model is much better with all three predictors.

```{r}
drop1(m2, test = "F")
```

This test is also implemented in the `Anova()` function in the car package.

```{r}
# install.packages("car")
library(car)
Anova(m2)
```

## AIC and BIC

AIC (Akaike Information Criterion) is a statistic designed to help us choose a model with the best predictive power among a group of models. The value of  AIC doesn't really have any interpretation. It's meant to be compared to other AIC values. The lower the AIC the better. We can use the `AIC()` function in R to compare multiple models. For example, `mod_03` seems preferable to `mod_02` because the AIC value is so much smaller. 

```{r}
AIC(mod_02, mod_03)
```

AIC is the log-likelihood of the model multiplied by -2 with 2 x df added to it. The 2 x df part is a _penalty_. "df" is short for "degrees of freedom" and is the number of parameters estimated in the model. This includes the residual standard error. For example, `mod_03` has 5 degrees of freedom because there are 4 model coefficients and the residual standard error. We call this part a "penalty" because AIC can get bigger with more coefficients. 

Log-likelihood is the log-transformed likelihood. _Likelihood_ is the joint probability of the observed data as a function of the parameters of the chosen statistical model. Imagine turning the coefficients in the model like dials on a machine and trying to find the maximum likelihood. In other words, what combination of coefficients are most likely to produce the data we observed? R does not estimate linear model coefficients using maximum likelihood, but we can calculate the log likelihood after the fact using the `logLik()` function:

```{r}
logLik(mod_03)
```

We can then calculate AIC "by-hand" as follows.

```{r}
rbind("mod_02" = -2*logLik(mod_02) + 2*4, 
      "mod_03" = -2*logLik(mod_03) + 2*5)
```

The AIC is inclined to choose overly complex models, so some researchers prefer BIC (Bayesian Information Criterion), which places a bigger penalty on the number of predictors. Again use the `BIC()` function in the same way.

```{r}
BIC(mod_02, mod_03)
```

The BIC is calculated the same as the AIC but with a different penalty, `log(n)`, where n is the number of observations. Again we can calculate "by hand":

```{r}
rbind("mod_02" = -2*logLik(mod_02) + log(nrow(homes))*4, 
      "mod_03" = -2*logLik(mod_03) + log(nrow(homes))*5)
```

Of course, you don't have to choose one. You can use both AIC and BIC and report both. They will often choose the same models. 


## Collinearity

When predictors are highly correlated (ie, have strong linear relationships), the precision of coefficient estimates can decline. This phenomenon is often referred to as _collinearity_ or _multicollinearity_. Let's demonstrate with a toy example. First we generate two variables, x1 and x2, that are highly correlated:

```{r}
x1 <- seq(1,10,length = 100)
set.seed(123)
x2 <- 1 + 2*x1 + rnorm(100, sd = 0.2)
cor(x1, x2)  # calculate correlation; perfect correlation = 1
```

Now we generate a new variable, `y`, using `x1` and `x2` along with some noise and fit a linear model to recover the true coefficients of 2 and 3. Notice in the pairwise scatterplot that `x1` and `x2` both seem associated with `y` in the same way.

```{r}
set.seed(321)
y <- 0.5 + 2*x1 + 3*x2 + rnorm(100, sd = 10)
d_collinear <- data.frame(y, x1, x2)
pairs(d_collinear)
```

When we fit the model, our estimated coefficients are way off and have enormous standard errors. The "true" values are 2 and 3. The model estimates are about 12 and -2! Recall we interpret the `x1` coefficient as its effect on `y` holding `x2` constant. But since `x1` and `x2` are highly correlated, it's all but impossible to estimate the effect of `x1` while holding `x2` constant.

```{r}
mod_colinear <- lm(y ~ x1 + x2)
summary(mod_colinear)
```

The most common way to check for and quantify collinearity after fitting a model is calculating _variance inflation factors_ (VIF). The details of the calculation can be found with a web search, but the basic idea is that if one of the raw VIFs is greater than 10, then we may have evidence that collinearity is influencing coefficient estimates. Fortunately this is easy to do using the `vif()` function in the car package. The VIFs are sky high for our contrived predictors!

```{r}
library(car)
vif(mod_colinear)
```

The square root of the VIF can be interpreted as how much larger the standard error of the coefficient is relative to similar uncorrelated data. 

```{r}
vif(mod_colinear) |> sqrt()
```

This says the standard error for both variables is about 29 times larger than it would have been without collinearity. The best solution in this case would be to simply drop one of the variables. It appears `x1` is almost completely determined by `x2` and vice versa. Knowing one means you know the other. In the extreme case, when two variables are perfectly correlated, the model fitting procedure will return NA for one of the variables, as demonstrated below. Notice the message: "(1 not defined because of singularities)"

```{r}
x2 <- 2*x1  # perfect correlation
summary(lm(y ~ x1 + x2))
```

Let's check our `m2` model where we modeled `log(totalprice)` as a function of finsqft, bedroom and lotsize:

```{r}
vif(m2)
```

These VIFs look very good. We might suspect that finsqft and number of bedrooms could be highly correlated, but the VIF checks out.

One approach to addressing collinearity concerns is to use a data reduction technique such as principal components analysis (PCA). When it works, this method basically takes several variables and reduces them to one or two summary scores. This may be preferable to arbitrarily dropping variables. 

For models that are intended for making predictions, collinearity is not much of a concern. 


## Transformation guidelines

Above we log-transformed `totalvalue` to help meet modeling assumptions. Recall without the log transformation our residuals were large and skewed, which is a fancy way of saying our model was a bad fit. A good fitting model should have relatively small residuals with even scatter. 

A log-transformation made sense for two reasons:

1. `totalvalue` was strictly positive, had a large upper bound, and covered several orders of magnitude. 
2. changes in `totalvalue` according to the predictors were relative (multiplicative) and not absolute (additive), which corresponds to the natural log scale.

It's important to note that not all skewed variables need to be transformed when it comes to linear modeling. The distributional assumptions are on the residuals. However there may be times when you need to investigate transformations other than the log when it comes to modeling. These almost always take the form of a _power transformation_ (ie, raising your variable to a power using an exponent). Powers are usually symbolized with a Greek lambda (λ). A power of 0 translates to a log transformation.

Say your variable is `y`. A basic palette of possible power transformations include:

- λ = -1     1/y
- λ = -0.5   1/sqrt(y)
- λ = 0      log(y)
- λ = 0.5    sqrt(y)
- λ = 1      y (no transformation)
- λ = 2      y^2

The car function `symbox()` creates a visual assessment of which power makes the distribution reasonably symmetric. Below when we use it with `totalvalue` we see that the log transformation (λ = 0) does the best job of making the distribution more symmetric. 

```{r}
symbox(homes$totalvalue)
```

We can also use `symbox()` on a model object. For example, this produces essentially the same plot using the residuals of the model instead of `totalvalue`. Simply pipe the model into `symbox()`.

```{r}
lm(totalvalue ~ finsqft + bedroom + lotsize, data = homes) |>
  symbox()
```

A statistical "search" for the "best" power transformation can be performed with the `powerTransform()` function, also in the car package. The usual practice is to convert the result to the closest simple power listed above. For example, we can pipe the model result into `powerTransform()` and see the "best" transformation is about 0.16.

```{r}
lm(totalvalue ~ finsqft + bedroom + lotsize, data = homes) |>
  powerTransform() 
```

0.16 is close to 0, so it makes sense to proceed with a log transformation. That greatly simplifies interpretation.


## ggplot2 code for creating plot of simulated data

Here's how to make the simulation plot using ggplot2. I find base R graphics easier for this type of plot. 

```{r}
sim2 <- simulate(m2, nsim = 50)
library(ggplot2)
library(tidyr)
sim2 %>% 
  pivot_longer(everything(), 
               names_to = "simulation", 
               values_to = "totalvalue") %>% 
  ggplot() +
  geom_density(mapping = aes(x = totalvalue, group = simulation),
               color = "grey80") + 
  geom_density(mapping = aes(x = log(totalvalue)),
               data = homes)  

```


## How to make a function for simulating and plotting data for a linear model

**base R**

We basically copy and paste the original code in between the curly braces of the `function()` function. We call the function `plot_sims` but you can name it whatever you like. The changes are to the model name and number of simulations. We generalize those with arguments: `mod` and `nsim`. When we fit a model with `lm`, the data is stored with the model object by default and can be accessed as mod$model. The first column contains the model response, so we can access it with `[,1]` and use it to draw the density of the observed data.


```{r}
plot_sims <- function(mod, nsim){
  sim <- simulate(mod, nsim = nsim)
  plot(density(mod$model[,1]))
  for(i in 1:nsim)lines(density(sim[[i]]), col = "grey80")
}
# try the function
plot_sims(mod = m2, nsim = 20)
```

**ggplot2**

Same idea as the previous function: we want to copy the original code in between the curly braces of the `function()` function. Except we now want to preface functions with their package name (eg, ggplot2::) so we can use the function without having the packages loaded. We also do away with the pipe `%>%` since it comes from yet another package (magrittr) but is accessible when tidyr is loaded. We extract the name of the response using `resp <- names(mod$model)[1]` and then use the use the `.data` pronoun (`.data[[resp]]`) from rlang package to use it with ggplot.


```{r}
plot_sims <- function(mod, nsim){
  sim1 <- simulate(mod, nsim = nsim)
  resp <- names(mod$model)[1]
  sim1 <- tidyr::pivot_longer(sim1, everything(), 
               names_to = "simulation", 
               values_to = "totalvalue") 
  ggplot2::ggplot() +
  ggplot2::geom_density(mapping = ggplot2::aes(x = totalvalue, 
                                               group = simulation), 
                        color = "grey80", 
                        data = sim1) + 
  ggplot2::geom_density(mapping = ggplot2::aes(x = .data[[resp]]),
                        data = mod$model)  
}

plot_sims(m2, nsim = 65)

```

**bayesplot package**

The bayesplot package has a function for this called `ppc_dens_overlay`, but it's a little weird to use for linear models because it was designed to be used with Bayesian models. However it's not that hard to deploy. The first argument is simply the observed data. The second argument expects the simulations per row as opposed to per column. So we need to transpose, which we can do with the `t()` function. The result is a clean plot with the y-axis unlabeled since it really isn't needed and a legend to distinguish between observed and simulated (or replicated) data. 

```{r}
# install.packages("bayesplot")
library(bayesplot)
ppc_dens_overlay(log(homes$totalvalue), t(sim2))
```


**performance package**

The easystats collection of packages "aims to provide a unifying and consistent framework to tame, discipline, and harness the scary R statistics and their pesky models." https://easystats.github.io/easystats/ One of the packages is called {performance}, which provides the `check_predictions()` function for simulating data from a model and comparing it to the distribution of the original data. Two potential drawbacks at the time of this writing:

1. Can be slow for large data sets
2. log transformed data needs to be added as a variable to the data and modeled directly, as opposing to doing it on-the-fly as `log(totalvalue)`

Quick demo. Note the warning: "Minimum value of original data is not included in the replicated data." This says the model is not generating data as small as the smallest value in the original data, which is 9600.

```{r}
# install.packages("performance")
# install.packages("see")
library(performance)

# add log(totalvalue) to data and fit new model
homes$logtotalvalue <- log(homes$totalvalue)
m2a <- lm(logtotalvalue ~ finsqft + bedroom + lotsize, data = homes)
check_predictions(m2a)
```