-
Notifications
You must be signed in to change notification settings - Fork 0
/
determine_functional_form.Rmd
248 lines (158 loc) · 11.8 KB
/
determine_functional_form.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Determine functional form
```{r message=FALSE, warning=FALSE}
library(tidyverse)
```
In Chapter 1, we mentioned that a first step to building a model is to explore your data and choose an appropriate function family. This chapter focuses on this step. Using `diamonds`, we'll explore the relationship between `price` and the four Cs: `carat`, `clarity`, `color`, and `cut`. By the end of the chapter, we'll understand which function family or families might approximate this data well.
## Formulas
In Chapter 1, we mentioned that the modeling function `lm()` requires two arguments: a function family and data. With `lm()`, like many modeling functions, you specify the function family using a _formula_.
Our model from Chapter 1 belonged to the family with the form
`y = a_0 + a_1 * x_1`
where `y` is the response, `x_1` is a continuous variable, and `a_0` and `a_1` are parameters to be determined. Recall that `a_0` is the intercept and `a_1` is the slope.
This function family is specified with the formula
`y ~ x_1` .
You place the name of the response variable to the left of the `~` and the predictor variable(s) to the right of the `~`. The general form for the formula of a linear function of n variables is
`y ~ x_1 + x_2 + ... + x_n` .
Formulas are their own class of R object, and you specify them unquoted.
```{r}
y ~ x_1
class(y ~ x_1)
```
You don't need to explicitly specify an intercept term as it will be included by default. If you don't want an intercept term, add a `0` to your formula.
`y ~ 0 + x_1`
We'll now see how to explore and understand the suitability of modeling your data with linear functions.
## Continuous predictors
Let's start with continuous predictors, as they should be familiar after reading Chapter 1.
In Chapter 2, we modified `diamonds` quite a bit. Here are all of those modifications in one place:
```{r}
df <-
diamonds %>%
filter(x > 0, y > 0, z > 0) %>%
filter(y < 20, z < 10) %>%
filter(carat <= quantile(.$carat, probs = 0.99)) %>%
mutate(color = fct_rev(color))
```
Now, we can visualize the relationship between `carat` and `price` and look for a linear relationship.
```{r}
df %>%
ggplot(aes(carat, price)) +
geom_point(alpha = 0.01)
```
The larger diamonds cover a wide range of prices, so we'll add a smooth line to show the central tendency.
```{r, message=FALSE, cache=TRUE}
df %>%
ggplot(aes(carat, price)) +
geom_point(alpha = 0.01) +
geom_smooth()
```
Notice how `price`, especially for the larger diamonds, varies substantially among diamonds of the same size. This variation suggests that `carat` is not the only variable that influences `price`. From the smooth line, we can also see that the relationship between `carat` and `price` is positive, but not linear. Luckily, we can still use a linear model by first transforming the data. We'll talk more about transformations next.
### Transformations
_Power law_ relationships take the form
`y = c * x^d`
where `c` and `d` are both constants. Taking the log of both sides gives us
`log(y) = log(c) + d * log(x)`
which might look familiar. Power laws are linear in terms of `log(y)` and `log(x)`, which is even clearer if we replace `log(c)` with `a_0` and `d` with `a_1`.
`log(y) = a_0 + a_1 * log(x)`
Phenomena that follow power laws are common. You can determine if your data approximately follow a power law by plotting your data using log scales for the x and y axes and then visually checking if the relationship is linear.
```{r, message=FALSE, cache=TRUE}
df %>%
ggplot(aes(carat, price)) +
geom_point(alpha = 0.01) +
geom_smooth() +
scale_x_log10() +
scale_y_log10()
```
The smooth line is now linear, suggesting that the functional form `log(price) = a_0 + a_1 * log(carat)` will approximate the relationship between `price` and `carat` well.
To specify this function family in a formula, we can apply the `log()` transformations in the formula:
`log(price) ~ log(carat)` .
Log-log transformations are common, but you can apply any transformation (to your response, predictor, or both) that makes your data linear.
You might wonder why you should bother transforming your data instead of just fitting a nonlinear model. Linear models are generally simpler, easier to interpret, and often produce better approximations.
### Visual checks
`carat` is the only continuous variable of the four Cs, but other datasets will have multiple possible continuous predictors. Here are two features to check for when considering adding a continuous predictor to your model.
First, as you've already seen, the relationship between the predictor and the response should be linear. If necessary, transform the data so that the relationship is linear. If the relationship is nonlinear and you can't find a transformation that makes it linear, don't add the predictor to your linear model.
Second, the predictor should influence the response. In our visualization of `price` and `carat`, we noticed that `price` increased as `carat` increased. If the predictor does not influence the response, you might see something like this:
```{r echo=FALSE, cache=TRUE}
set.seed(234)
n <- 500
tibble(
x = runif(n, min = 0, max = 5),
y = rnorm(n)
) %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x)
```
The smooth line is horizontal, which tells you that `y` does not change with `x`.
The `diamonds` data is relatively simple, and we've made it even simpler by restricting attention to the four Cs. When you have more variables, you'll need to conduct more preprocessing and pay more attention to which variables to include and which to exclude. If you'd like to learn more, a good resource is Kuhn and Johnson, [Feature Engineering and Selection](https://bookdown.org/max/FES/), particularly the [Preprocessing](https://bookdown.org/max/FES/stroke-preprocessing.html) section.
## Discrete predictors
Earlier, we noticed that `price` varies substantially for diamonds of a given `carat`. In this section, we'll investigate how the three discrete Cs, `clarity`, `color`, and `cut`, influence price.
### Functional form
Discrete predictors work differently than continuous predictors. When a linear model includes a discrete predictor, a different constant will be added for each value of the variable.
We'll visualize the effect of `cut` by adding a smooth line for each value to our plot of `price` vs. `carat`. We'll leave off the individual points to focus on the trends.^[Note that we used `method = "loess"` instead of the default `method = "gam"`. `"loess"` can take more time and memory for large amounts of data, but often gives better results. For more information, see https://dcl-data-vis.stanford.edu/continuous-continuous.html#smoothing .]
```{r, cache=TRUE}
plot_smooth <- function(var, data = df) {
data %>%
ggplot(aes(carat, price, color = {{ var }})) +
geom_smooth(method = "loess", formula = y ~ x) +
scale_x_log10() +
scale_y_log10() +
scale_color_discrete(guide = guide_legend(reverse = TRUE))
}
plot_smooth(cut)
```
The smooth lines look linear, and all appear to have similar slopes but different intercepts, which increase as `cut` quality increases. This suggests that the following functional form, in which the intercept of the line depends on `cut`, would approximate our data well.
`log(price) = a_0 + a_1 * log(carat) + f_cut(cut)`
We need `f_cut()` to return a different constant for each value of `cut`. We can write `f_cut()` in R using `case_when()`.
```{r}
f_cut <- function(cut) {
case_when(
cut == "Fair" ~ 0,
cut == "Good" ~ a_cut_Good,
cut == "Very Good" ~ a_cut_Very_Good,
cut == "Premium" ~ a_cut_Premium,
cut == "Ideal" ~ a_cut_Ideal
)
}
```
Note that the constant for Fair is 0. Fair is the worst quality cut and the lowest factor level of `cut`. By convention, the lowest factor level's constant is added to `a_0`, so we don't need a `a_cut_Fair`. The rest of the constants are then created relative to this lowest value.
In formula syntax, this functional form is
`log(price) ~ log(carat) + cut`
In the next section, we'll visualize `clarity` and `color` and discuss visualizing discrete predictors in more detail.
### Visual checks
Now that you understand the form of `f_cut()`, we can be more precise about what to look for when visualizing a potential discrete predictor. Assume you created a visualization like we did above, with a continuous predictor on the x-axis, the response on the y-axis, and a smooth line for each value of the potential discrete predictor. Here's a set of characteristics to check for:
* The smooth lines should all be approximately linear.
For a linear function to approximate the relationship well, each of the smooth
lines needs to be linear.
* The smooth lines should all be approximately parallel to each other.
In our example, the slopes of the lines depend only on our continuous predictor: `log(carat)`. Changing `cut` does not change the slope. Non-parallel smooth lines suggest that slope depends on both the continuous predictor and the discrete predictor. This is called an _interaction_. Interactions aren't part of the set of function families we're looking at in this chapter, so we won't talk more about them here.
* The smooth lines should have different intercepts.
Smooth lines that are stacked on top of each other and don't have different intercepts suggest that the discrete variable doesn't influence the response.
`cut` met all three of our criteria pretty well, although the Premium and Very Good smooth lines are very close together.
```{r, cache=TRUE}
plot_smooth(cut)
```
This overlap suggests that there isn't much of a price difference between Very Good and Premium diamonds.
Next, let's look at `clarity`.
```{r, cache=TRUE}
plot_smooth(clarity)
```
`clarity` also mostly meets our criteria, although two of the levels (VVS1 and VVS2) are again very close together.
The slope of the line for I1 also looks a bit different than the rest. However, only `r round(100 * mean(df$clarity == "I1"), 2)`% of diamonds are I1, so it will be okay if our linear model doesn't approximate them as well as other diamonds.
Finally, we'll look at `color`.
```{r, cache=TRUE}
plot_smooth(color)
```
`color` also meets all three of our criteria pretty well. Again, some of the smooth lines (D, E, F, and G) are almost entirely on top of each other, suggesting that increasing quality above G does not influence price very much. We could try combining D, E, F, and G into one level.
```{r, cache=TRUE}
df %>%
mutate(color_combined = fct_collapse(color, DEFG = c("D", "E", "F", "G"))) %>%
plot_smooth(color_combined, data = .)
```
The smooth lines are now more clearly separated. Later, when we fit our models, we'll compare models that combine nearby factor levels to models that use all the original values.
The separation between the smooth lines varies between `clarity`, `cut`, and `color`. The larger the separation between smooth lines, the more that variable influences `price`. So `clarity` likely influences `price` the most, followed by `color`, then `cut`.
## Summary
From our EDA, we now know that:
* We need to log-transform both `price` and `carat` to make the relationship linear.
* `clarity`, `color`, and `cut` all influence `price`, and we can add any of them to our model.
* `clarity` is the most important of the three discrete Cs, followed by `color`, then `cut`.
* Increasing the quality of the color beyond G doesn't influence price very much. We could experiment with combining the top levels of `color` into one level before fitting our model.
Even though `carat`, `clarity`, `color`, and `cut` all influence price, we shouldn't necessarily add them all into our model. In the next chapter, we'll explain how to fit a model for a given function family. Then, in the following chapter, we'll discuss the trade-offs involved in adding additional predictors to models, and we'll explain how to choose between different models.