forked from intro-stat-learning/ISLP_labs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Ch07-nonlin-lab.Rmd
906 lines (714 loc) · 30.8 KB
/
Ch07-nonlin-lab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: Rmd,ipynb
main_language: python
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.14.7
---
# Chapter 7
# Lab: Non-Linear Modeling
In this lab, we demonstrate some of the nonlinear models discussed in
this chapter. We use the `Wage` data as a running example, and show that many of the complex non-linear fitting procedures discussed can easily be implemented in \Python.
As usual, we start with some of our standard imports.
```{python}
import numpy as np, pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (summarize,
poly,
ModelSpec as MS)
from statsmodels.stats.anova import anova_lm
```
We again collect the new imports
needed for this lab. Many of these are developed specifically for the
`ISLP` package.
```{python}
from pygam import (s as s_gam,
l as l_gam,
f as f_gam,
LinearGAM,
LogisticGAM)
from ISLP.transforms import (BSpline,
NaturalSpline)
from ISLP.models import bs, ns
from ISLP.pygam import (approx_lam,
degrees_of_freedom,
plot as plot_gam,
anova as anova_gam)
```
## Polynomial Regression and Step Functions
We start by demonstrating how Figure 7.1 can be reproduced.
Let's begin by loading the data.
```{python}
Wage = load_data('Wage')
y = Wage['wage']
age = Wage['age']
```
Throughout most of this lab, our response is `Wage['wage']`, which
we have stored as `y` above.
As in Section 3.6.6, we will use the `poly()` function to create a model matrix
that will fit a $4$th degree polynomial in `age`.
```{python}
poly_age = MS([poly('age', degree=4)]).fit(Wage)
M = sm.OLS(y, poly_age.transform(Wage)).fit()
summarize(M)
```
This polynomial is constructed using the function `poly()`,
which creates
a special *transformer* `Poly()` (using `sklearn` terminology
for feature transformations such as `PCA()` seen in Section 6.5.3) which
allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also
the
discussion of transformations on
page 129.
In the code above, the first line executes the `fit()` method
using the dataframe
`Wage`. This recomputes and stores as attributes any parameters needed by `Poly()`
on the training data, and these will be used on all subsequent
evaluations of the `transform()` method. For example, it is used
on the second line, as well as in the plotting function developed below.
We now create a grid of values for `age` at which we want
predictions.
```{python}
age_grid = np.linspace(age.min(),
age.max(),
100)
age_df = pd.DataFrame({'age': age_grid})
```
Finally, we wish to plot the data and add the fit from the fourth-degree polynomial. As we will make
several similar plots below, we first write a function
to create all the ingredients and produce the plot. Our function
takes in a model specification (here a basis specified by a
transform), as well as a grid of `age` values. The function
produces a fitted curve as well as 95% confidence bands. By using
an argument for `basis` we can produce and plot the results with several different
transforms, such as the splines we will see shortly.
```{python}
def plot_wage_fit(age_df,
basis,
title):
X = basis.transform(Wage)
Xnew = basis.transform(age_df)
M = sm.OLS(y, X).fit()
preds = M.get_prediction(Xnew)
bands = preds.conf_int(alpha=0.05)
fig, ax = subplots(figsize=(8,8))
ax.scatter(age,
y,
facecolor='gray',
alpha=0.5)
for val, ls in zip([preds.predicted_mean,
bands[:,0],
bands[:,1]],
['b','r--','r--']):
ax.plot(age_df.values, val, ls, linewidth=3)
ax.set_title(title, fontsize=20)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
return ax
```
We include an argument `alpha` to `ax.scatter()`
to add some transparency to the points. This provides a visual indication
of density. Notice the use of the `zip()` function in the
`for` loop above (see Section 2.3.8).
We have three lines to plot, each with different colors and line
types. Here `zip()` conveniently bundles these together as
iterators in the loop. {In `Python` speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.}
We now plot the fit of the fourth-degree polynomial using this
function.
```{python}
plot_wage_fit(age_df,
poly_age,
'Degree-4 Polynomial');
```
With polynomial regression we must decide on the degree of
the polynomial to use. Sometimes we just wing it, and decide to use
second or third degree polynomials, simply to obtain a nonlinear fit. But we can
make such a decision in a more systematic way. One way to do this is through hypothesis
tests, which we demonstrate here. We now fit a series of models ranging from
linear (degree-one) to degree-five polynomials,
and look to determine the simplest model that is sufficient to
explain the relationship between `wage` and `age`. We use the
`anova_lm()` function, which performs a series of ANOVA
tests.
An \emph{analysis of
variance} or ANOVA tests the null
hypothesis that a model $\mathcal{M}_1$ is sufficient to explain the
data against the alternative hypothesis that a more complex model
$\mathcal{M}_2$ is required. The determination is based on an F-test.
To perform the test, the models $\mathcal{M}_1$ and $\mathcal{M}_2$ must be *nested*:
the space spanned by the predictors in $\mathcal{M}_1$ must be a subspace of the
space spanned by the predictors in $\mathcal{M}_2$. In this case, we
fit five different polynomial
models and sequentially compare the simpler model to the more complex
model.
```{python}
models = [MS([poly('age', degree=d)])
for d in range(1, 6)]
Xs = [model.fit_transform(Wage) for model in models]
anova_lm(*[sm.OLS(y, X_).fit()
for X_ in Xs])
```
Notice the `*` in the `anova_lm()` line above. This
function takes a variable number of non-keyword arguments, in this case fitted models.
When these models are provided as a list (as is done here), it must be
prefixed by `*`.
The p-value comparing the linear `models[0]` to the quadratic
`models[1]` is essentially zero, indicating that a linear
fit is not sufficient. {Indexing starting at zero is confusing for the polynomial degree example, since `models[1]` is quadratic rather than linear!} Similarly the p-value comparing the quadratic
`models[1]` to the cubic `models[2]` is very low (0.0017), so the
quadratic fit is also insufficient. The p-value comparing the cubic
and degree-four polynomials, `models[2]` and `models[3]`, is
approximately 5%, while the degree-five polynomial `models[4]` seems
unnecessary because its p-value is 0.37. Hence, either a cubic or a
quartic polynomial appear to provide a reasonable fit to the data, but
lower- or higher-order models are not justified.
In this case, instead of using the `anova()` function, we could
have obtained these p-values more succinctly by exploiting the fact
that `poly()` creates orthogonal polynomials.
```{python}
summarize(M)
```
Notice that the p-values are the same, and in fact the square of
the t-statistics are equal to the F-statistics from the
`anova_lm()` function; for example:
```{python}
(-11.983)**2
```
However, the ANOVA method works whether or not we used orthogonal
polynomials, provided the models are nested. For example, we can use
`anova_lm()` to compare the following three
models, which all have a linear term in `education` and a
polynomial in `age` of different degrees:
```{python}
models = [MS(['education', poly('age', degree=d)])
for d in range(1, 4)]
XEs = [model.fit_transform(Wage)
for model in models]
anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
```
As an alternative to using hypothesis tests and ANOVA, we could choose
the polynomial degree using cross-validation, as discussed in Chapter 5.
Next we consider the task of predicting whether an individual earns
more than $250,000 per year. We proceed much as before, except
that first we create the appropriate response vector, and then apply
the `glm()` function using the binomial family in order
to fit a polynomial logistic regression model.
```{python}
X = poly_age.transform(Wage)
high_earn = Wage['high_earn'] = y > 250 # shorthand
glm = sm.GLM(y > 250,
X,
family=sm.families.Binomial())
B = glm.fit()
summarize(B)
```
Once again, we make predictions using the `get_prediction()` method.
```{python}
newX = poly_age.transform(age_df)
preds = B.get_prediction(newX)
bands = preds.conf_int(alpha=0.05)
```
We now plot the estimated relationship.
```{python}
fig, ax = subplots(figsize=(8,8))
rng = np.random.default_rng(0)
ax.scatter(age +
0.2 * rng.uniform(size=y.shape[0]),
np.where(high_earn, 0.198, 0.002),
fc='gray',
marker='|')
for val, ls in zip([preds.predicted_mean,
bands[:,0],
bands[:,1]],
['b','r--','r--']):
ax.plot(age_df.values, val, ls, linewidth=3)
ax.set_title('Degree-4 Polynomial', fontsize=20)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylim([0,0.2])
ax.set_ylabel('P(Wage > 250)', fontsize=20);
```
We have drawn the `age` values corresponding to the observations with
`wage` values above 250 as gray marks on the top of the plot, and
those with `wage` values below 250 are shown as gray marks on the
bottom of the plot. We added a small amount of noise to jitter
the `age` values a bit so that observations with the same `age`
value do not cover each other up. This type of plot is often called a
*rug plot*.
In order to fit a step function, as discussed in
Section 7.2, we first use the `pd.qcut()`
function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the
columns of the model matrix for this categorical variable. Note that this function will
include *all* columns for a given categorical, rather than the usual approach which drops one
of the levels.
```{python}
cut_age = pd.qcut(age, 4)
summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit())
```
Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own
quantiles directly instead of the argument `4`. For cuts not based
on quantiles we would use the `pd.cut()` function.
The function `pd.qcut()` (and `pd.cut()`) returns an ordered categorical variable.
The regression model then creates a set of
dummy variables for use in the regression. Since `age` is the only variable in the model, the value $94,158.40 is the average salary for those under 33.75 years of
age, and the other coefficients are the average
salary for those in the other age groups. We can produce
predictions and plots just as we did in the case of the polynomial
fit.
## Splines
In order to fit regression splines, we use transforms
from the `ISLP` package. The actual spline
evaluation functions are in the `scipy.interpolate` package;
we have simply wrapped them as transforms
similar to `Poly()` and `PCA()`.
In Section 7.4, we saw
that regression splines can be fit by constructing an appropriate
matrix of basis functions. The `BSpline()` function generates the
entire matrix of basis functions for splines with the specified set of
knots. By default, the B-splines produced are cubic. To change the degree, use
the argument `degree`.
```{python}
bs_ = BSpline(internal_knots=[25,40,60], intercept=True).fit(age)
bs_age = bs_.transform(age)
bs_age.shape
```
This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots.
We can form this same matrix using the `bs()` object,
which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section 7.8.1.
We now fit a cubic spline model to the `Wage` data.
```{python}
bs_age = MS([bs('age', internal_knots=[25,40,60])])
Xbs = bs_age.fit_transform(Wage)
M = sm.OLS(y, Xbs).fit()
summarize(M)
```
The column names are a little cumbersome, and have caused us to truncate the printed summary. They can be set on construction using the `name` argument as follows.
```{python}
bs_age = MS([bs('age',
internal_knots=[25,40,60],
name='bs(age)')])
Xbs = bs_age.fit_transform(Wage)
M = sm.OLS(y, Xbs).fit()
summarize(M)
```
Notice that there are 6 spline coefficients rather than 7. This is because, by default,
`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model.
So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept.
We could also use the `df` (degrees of freedom) option to
specify the complexity of the spline. We see above that with 3 knots,
the spline basis has 6 columns or degrees of freedom. When we specify
`df=6` rather than the actual knots, `bs()` will produce a
spline with 3 knots chosen at uniform quantiles of the training data.
We can see these chosen knots most easily using `Bspline()` directly:
```{python}
BSpline(df=6).fit(age).internal_knots_
```
When asking for six degrees of freedom,
the transform chooses knots at ages 33.75, 42.0, and 51.0,
which correspond to the 25th, 50th, and 75th percentiles of
`age`.
When using B-splines we need not limit ourselves to cubic polynomials
(i.e. `degree=3`). For instance, using `degree=0` results
in piecewise constant functions, as in our example with
`pd.qcut()` above.
```{python}
bs_age0 = MS([bs('age',
df=3,
degree=0)]).fit(Wage)
Xbs0 = bs_age0.transform(Wage)
summarize(sm.OLS(y, Xbs0).fit())
```
This fit should be compared with cell [15] where we use `qcut()`
to create four bins by cutting at the 25%, 50% and 75% quantiles of
`age`. Since we specified `df=3` for degree-zero splines here, there will also be
knots at the same three quantiles. Although the coefficients appear different, we
see that this is a result of the different coding. For example, the
first coefficient is identical in both cases, and is the mean response
in the first bin. For the second coefficient, we have
$94.158 + 22.349 = 116.507 \approx 116.611$, the latter being the mean
in the second bin in cell [15]. Here the intercept is coded by a column
of ones, so the second, third and fourth coefficients are increments
for those bins. Why is the sum not exactly the same? It turns out that
the `qcut()` uses $\leq$, while `bs()` uses $<$ when
deciding bin membership.
In order to fit a natural spline, we use the `NaturalSpline()`
transform with the corresponding helper `ns()`. Here we fit a natural spline with five
degrees of freedom (excluding the intercept) and plot the results.
```{python}
ns_age = MS([ns('age', df=5)]).fit(Wage)
M_ns = sm.OLS(y, ns_age.transform(Wage)).fit()
summarize(M_ns)
```
We now plot the natural spline using our plotting function.
```{python}
plot_wage_fit(age_df,
ns_age,
'Natural spline, df=5');
```
## Smoothing Splines and GAMs
A smoothing spline is a special case of a GAM with squared-error loss
and a single feature. To fit GAMs in `Python` we will use the
`pygam` package which can be installed via `pip install pygam`. The
estimator `LinearGAM()` uses squared-error loss.
The GAM is specified by associating each column
of a model matrix with a particular smoothing operation:
`s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables.
The argument `0` passed to `s` below indicates that this smoother will
apply to the first column of a feature matrix. Below, we pass it a
matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section 7.5.2.
```{python}
X_age = np.asarray(age).reshape((-1,1))
gam = LinearGAM(s_gam(0, lam=0.6))
gam.fit(X_age, y)
```
The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead
of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the
size of that dimension based on the remaining entries of the shape tuple.
Let’s investigate how the fit changes with the smoothing parameter `lam`.
The function `np.logspace()` is similar to `np.linspace()` but spaces points
evenly on the log-scale. Below we vary `lam` from $10^{-2}$ to $10^6$.
```{python}
fig, ax = subplots(figsize=(8,8))
ax.scatter(age, y, facecolor='gray', alpha=0.5)
for lam in np.logspace(-2, 6, 5):
gam = LinearGAM(s_gam(0, lam=lam)).fit(X_age, y)
ax.plot(age_grid,
gam.predict(age_grid),
label='{:.1e}'.format(lam),
linewidth=3)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='$\lambda$');
```
The `pygam` package can perform a search for an optimal smoothing parameter.
```{python}
gam_opt = gam.gridsearch(X_age, y)
ax.plot(age_grid,
gam_opt.predict(age_grid),
label='Grid search',
linewidth=4)
ax.legend()
fig
```
Alternatively, we can fix the degrees of freedom of the smoothing
spline using a function included in the `ISLP.pygam` package. Below we
find a value of $\lambda$ that gives us roughly four degrees of
freedom. We note here that these degrees of freedom include the
unpenalized intercept and linear term of the smoothing spline, hence there are at least two
degrees of freedom.
```{python}
age_term = gam.terms[0]
lam_4 = approx_lam(X_age, age_term, 4)
age_term.lam = lam_4
degrees_of_freedom(X_age, age_term)
```
Let’s vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom
as the desired degrees of freedom plus one to account for the fact that these smoothing
splines always have an intercept term. Hence, a value of one for `df` is just a linear fit.
```{python}
fig, ax = subplots(figsize=(8,8))
ax.scatter(X_age,
y,
facecolor='gray',
alpha=0.3)
for df in [1,3,4,8,15]:
lam = approx_lam(X_age, age_term, df+1)
age_term.lam = lam
gam.fit(X_age, y)
ax.plot(age_grid,
gam.predict(age_grid),
label='{:d}'.format(df),
linewidth=4)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='Degrees of freedom');
```
### Additive Models with Several Terms
The strength of generalized additive models lies in their ability to fit multivariate regression models with more flexibility than linear models. We demonstrate two approaches: the first in a more manual fashion using natural splines and piecewise constant functions, and the second using the `pygam` package and smoothing splines.
We now fit a GAM by hand to predict
`wage` using natural spline functions of `year` and `age`,
treating `education` as a qualitative predictor, as in (7.16).
Since this is just a big linear regression model
using an appropriate choice of basis functions, we can simply do this
using the `sm.OLS()` function.
We will build the model matrix in a more manual fashion here, since we wish
to access the pieces separately when constructing partial dependence plots.
```{python}
ns_age = NaturalSpline(df=4).fit(age)
ns_year = NaturalSpline(df=5).fit(Wage['year'])
Xs = [ns_age.transform(age),
ns_year.transform(Wage['year']),
pd.get_dummies(Wage['education']).values]
X_bh = np.hstack(Xs)
gam_bh = sm.OLS(y, X_bh).fit()
```
Here the function `NaturalSpline()` is the workhorse supporting
the `ns()` helper function. We chose to use all columns of the
indicator matrix for the categorical variable `education`, making an intercept redundant.
Finally, we stacked the three component matrices horizontally to form the model matrix `X_bh`.
We now show how to construct partial dependence plots for each of the terms in our rudimentary GAM. We can do this by hand,
given grids for `age` and `year`.
We simply predict with new $X$ matrices, fixing all but one of the features at a time.
```{python}
age_grid = np.linspace(age.min(),
age.max(),
100)
X_age_bh = X_bh.copy()[:100]
X_age_bh[:] = X_bh[:].mean(0)[None,:]
X_age_bh[:,:4] = ns_age.transform(age_grid)
preds = gam_bh.get_prediction(X_age_bh)
bounds_age = preds.conf_int(alpha=0.05)
partial_age = preds.predicted_mean
center = partial_age.mean()
partial_age -= center
bounds_age -= center
fig, ax = subplots(figsize=(8,8))
ax.plot(age_grid, partial_age, 'b', linewidth=3)
ax.plot(age_grid, bounds_age[:,0], 'r--', linewidth=3)
ax.plot(age_grid, bounds_age[:,1], 'r--', linewidth=3)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of age on wage', fontsize=20);
```
Let's explain in some detail what we did above. The idea is to create a new prediction matrix, where all but the columns belonging to `age` are constant (and set to their training-data means). The four columns for `age` are filled in with the natural spline basis evaluated at the 100 values in `age_grid`.
* We made a grid of length 100 in `age`, and created a matrix `X_age_bh` with 100 rows and the same number of columns as `X_bh`.
* We replaced every row of this matrix with the column means of the original.
* We then replace just the first four columns representing `age` with the natural spline basis computed at the values in `age_grid`.
The remaining steps should by now be familiar.
We also look at the effect of `year` on `wage`; the process is the same.
```{python}
year_grid = np.linspace(2003, 2009, 100)
year_grid = np.linspace(Wage['year'].min(),
Wage['year'].max(),
100)
X_year_bh = X_bh.copy()[:100]
X_year_bh[:] = X_bh[:].mean(0)[None,:]
X_year_bh[:,4:9] = ns_year.transform(year_grid)
preds = gam_bh.get_prediction(X_year_bh)
bounds_year = preds.conf_int(alpha=0.05)
partial_year = preds.predicted_mean
center = partial_year.mean()
partial_year -= center
bounds_year -= center
fig, ax = subplots(figsize=(8,8))
ax.plot(year_grid, partial_year, 'b', linewidth=3)
ax.plot(year_grid, bounds_year[:,0], 'r--', linewidth=3)
ax.plot(year_grid, bounds_year[:,1], 'r--', linewidth=3)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20);
```
We now fit the model (7.16) using smoothing splines rather
than natural splines. All of the
terms in (7.16) are fit simultaneously, taking each other
into account to explain the response. The `pygam` package only works with matrices, so we must convert
the categorical series `education` to its array representation, which can be found
with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we
use only seven basis functions for it.
```{python}
gam_full = LinearGAM(s_gam(0) +
s_gam(1, n_splines=7) +
f_gam(2, lam=0))
Xgam = np.column_stack([age,
Wage['year'],
Wage['education'].cat.codes])
gam_full = gam_full.fit(Xgam, y)
```
The two `s_gam()` terms result in smoothing spline fits, and use a default value for $\lambda$ (`lam=0.6`), which is somewhat arbitrary. For the categorical term `education`, specified using a `f_gam()` term, we specify `lam=0` to avoid any shrinkage.
We produce the partial dependence plot in `age` to see the effect of these choices.
The values for the plot
are generated by the `pygam` package. We provide a `plot_gam()`
function for partial-dependence plots in `ISLP.pygam`, which makes this job easier than in our last example with natural splines.
```{python}
fig, ax = subplots(figsize=(8,8))
plot_gam(gam_full, 0, ax=ax)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of age on wage - default lam=0.6', fontsize=20);
```
We see that the function is somewhat wiggly. It is more natural to specify the `df` than a value for `lam`.
We refit a GAM using four degrees of freedom each for
`age` and `year`. Recall that the addition of one below takes into account the intercept
of the smoothing spline.
```{python}
age_term = gam_full.terms[0]
age_term.lam = approx_lam(Xgam, age_term, df=4+1)
year_term = gam_full.terms[1]
year_term.lam = approx_lam(Xgam, year_term, df=4+1)
gam_full = gam_full.fit(Xgam, y)
```
Note that updating `age_term.lam` above updates it in `gam_full.terms[0]` as well! Likewise for `year_term.lam`.
Repeating the plot for `age`, we see that it is much smoother.
We also produce the plot for `year`.
```{python}
fig, ax = subplots(figsize=(8,8))
plot_gam(gam_full,
1,
ax=ax)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of year on wage', fontsize=20)
```
Finally we plot `education`, which is categorical. The partial dependence plot is different, and more suitable for the set of fitted constants for each level of this variable.
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_full, 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of wage on education',
fontsize=20);
ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8);
```
### ANOVA Tests for Additive Models
In all of our models, the function of `year` looks rather linear. We can
perform a series of ANOVA tests in order to determine which of these
three models is best: a GAM that excludes `year` ($\mathcal{M}_1$),
a GAM that uses a linear function of `year` ($\mathcal{M}_2$), or a
GAM that uses a spline function of `year` ($\mathcal{M}_3$).
```{python}
gam_0 = LinearGAM(age_term + f_gam(2, lam=0))
gam_0.fit(Xgam, y)
gam_linear = LinearGAM(age_term +
l_gam(1, lam=0) +
f_gam(2, lam=0))
gam_linear.fit(Xgam, y)
```
Notice our use of `age_term` in the expressions above. We do this because
earlier we set the value for `lam` in this term to achieve four degrees of freedom.
To directly assess the effect of `year` we run an ANOVA on the
three models fit above.
```{python}
anova_gam(gam_0, gam_linear, gam_full)
```
We find that there is compelling evidence that a GAM with a linear
function in `year` is better than a GAM that does not include
`year` at all ($p$-value= 0.002). However, there is no
evidence that a non-linear function of `year` is needed
($p$-value=0.435). In other words, based on the results of this
ANOVA, $\mathcal{M}_2$ is preferred.
We can repeat the same process for `age` as well. We see there is very clear evidence that
a non-linear term is required for `age`.
```{python}
gam_0 = LinearGAM(year_term +
f_gam(2, lam=0))
gam_linear = LinearGAM(l_gam(0, lam=0) +
year_term +
f_gam(2, lam=0))
gam_0.fit(Xgam, y)
gam_linear.fit(Xgam, y)
anova_gam(gam_0, gam_linear, gam_full)
```
There is a (verbose) `summary()` method for the GAM fit.
```{python}
gam_full.summary()
```
We can make predictions from `gam` objects, just like from
`lm` objects, using the `predict()` method for the class
`gam`. Here we make predictions on the training set.
```{python}
Yhat = gam_full.predict(Xgam)
```
In order to fit a logistic regression GAM, we use `LogisticGAM()`
from `pygam`.
```{python}
gam_logit = LogisticGAM(age_term +
l_gam(1, lam=0) +
f_gam(2, lam=0))
gam_logit.fit(Xgam, high_earn)
```
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit, 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of wage on education',
fontsize=20);
ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8);
```
The model seems to be very flat, with especially high error bars for the first category.
Let's look at the data a bit more closely.
```{python}
pd.crosstab(Wage['high_earn'], Wage['education'])
```
We see that there are no high earners in the first category of
education, meaning that the model will have a hard time fitting. We
will fit a logistic regression GAM excluding all observations falling into this
category. This provides more sensible results.
To do so,
we could subset the model matrix, though this will not remove the
column from `Xgam`. While we can deduce which column corresponds
to this feature, for reproducibility’s sake we reform the model matrix
on this smaller subset.
```{python}
only_hs = Wage['education'] == '1. < HS Grad'
Wage_ = Wage.loc[~only_hs]
Xgam_ = np.column_stack([Wage_['age'],
Wage_['year'],
Wage_['education'].cat.codes-1])
high_earn_ = Wage_['high_earn']
```
In the second-to-last line above, we subtract one from the codes of the category, due to a bug in `pygam`. It just relabels
the education values and hence has no effect on the fit.
We now fit the model.
```{python}
gam_logit_ = LogisticGAM(age_term +
year_term +
f_gam(2, lam=0))
gam_logit_.fit(Xgam_, high_earn_)
```
Let’s look at the effect of `education`, `year` and `age` on high earner status now that we’ve
removed those observations.
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 2)
ax.set_xlabel('Education')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on education', fontsize=20);
ax.set_xticklabels(Wage['education'].cat.categories[1:],
fontsize=8);
```
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 1)
ax.set_xlabel('Year')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on year',
fontsize=20);
```
```{python}
fig, ax = subplots(figsize=(8, 8))
ax = plot_gam(gam_logit_, 0)
ax.set_xlabel('Age')
ax.set_ylabel('Effect on wage')
ax.set_title('Partial dependence of high earner status on age', fontsize=20);
```
## Local Regression
We illustrate the use of local regression using the `lowess()`
function from `sm.nonparametric`. Some implementations of
GAMs allow terms to be local regression operators; this is not the case in `pygam`.
Here we fit local linear regression models using spans of 0.2
and 0.5; that is, each neighborhood consists of 20% or 50% of
the observations. As expected, using a span of 0.5 is smoother than 0.2.
```{python}
lowess = sm.nonparametric.lowess
fig, ax = subplots(figsize=(8,8))
ax.scatter(age, y, facecolor='gray', alpha=0.5)
for span in [0.2, 0.5]:
fitted = lowess(y,
age,
frac=span,
xvals=age_grid)
ax.plot(age_grid,
fitted,
label='{:.1f}'.format(span),
linewidth=4)
ax.set_xlabel('Age', fontsize=20)
ax.set_ylabel('Wage', fontsize=20);
ax.legend(title='span', fontsize=15);
```