forked from ModelOriented/DALEX-docs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-local.Rmd
98 lines (69 loc) · 4.64 KB
/
03-local.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# Prediction understanding {#predictionUnderstanding}
In this chapter we introduce two groups of explainers that can be used to boost our understanding of model predictions.
* Section \@ref(outlierDetection) presents explainers that helps to identify outliers.
* Section \@ref(predictionBreakdown) presents explainers for model predictions. Each prediction can be split into parts attributed to particular variables. Having found out which variables are important and whether the prediction is accurate, one can validate the model.
Explainers presented here are illustrated based on two models fitted to the `apartments` data.
```{r, warning=FALSE, message=FALSE}
library("DALEX")
apartments_lm_model <- lm(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
library("randomForest")
set.seed(59)
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
```
First we need to prepare wrappers for these models. They are in `explainer_lm` and `explainer_rf` objects.
```{r, warning=FALSE, message=FALSE}
explainer_lm <- explain(apartments_lm_model,
data = apartmentsTest[,2:6], y = apartmentsTest$m2.price)
explainer_rf <- explain(apartments_rf_model,
data = apartmentsTest[,2:6], y = apartmentsTest$m2.price)
```
## Outlier detection {#outlierDetection}
Function `model_performance()` may be used to identify outliers.
This function was already introduced in section \@ref(modelPerformance) but we will present here its other uses.
As you may remember, residuals for random forest were smaller in general, except for a small fraction of very high residuals.
Let's use the `model_performance()` function to extract and plot residuals against the observed true values.
```{r outliersML, message=FALSE, warning=FALSE, fig.cap="Diagnostic plot for the random forest model. Clearly the more expensive are apartments the more underestimated are model predictions"}
mp_rf <- model_performance(explainer_rf)
library("ggplot2")
ggplot(mp_rf, aes(observed, diff)) + geom_point() +
xlab("Observed") + ylab("Predicted - Observed") +
ggtitle("Diagnostic plot for the random forest model") + theme_mi2()
```
Lets see which variables stand behind the model prediction for an apartment with largest residual.
```{r, eval=FALSE}
which.min(mp_rf$diff)
## 1161
new_apartment <- apartmentsTest[which.min(mp_rf$diff), ]
new_apartment
```
```{r, echo=FALSE}
new_apartment <- apartmentsTest[which.min(mp_rf$diff), ]
knitr::kable(
head(new_apartment),
caption = 'Observation with the largest residual in the random forest model'
)
```
## Prediction breakDown {#predictionBreakdown}
Does your ML algorithm learn from mistakes?
Understanding what causes wrong model predictions will help to improve the model itself.
Lots of arguments in favor of such explainers can be found in the [@lime] article. This approach is implemented in the **live** package (see [@live]) which may be seen as an extension of the LIME method.
In this section we present other method for explanations of model predictions, namely the one implemented in the `breakDown` package [@breakDown].
The function `single_prediction()` is a wrapper around this package.
Model prediction is visualized with Break Down Plots, which were inspired by waterfall plots as in [`xgboostExplainer` package](https://github.com/AppliedDataSciencePartners/xgboostExplainer).
Break Down Plots show the contribution of every variable present in the model.
Function `single_prediction()` generates variable attributions for selected prediction.
The generic `plot()` function shows these attributions.
```{r single_prediction_break, fig.height=2.5, fig.cap="Break Down Plot for prediction from the random forest model"}
new_apartment_rf <- single_prediction(explainer_rf, observation = new_apartment)
breakDown:::print.broken(new_apartment_rf)
plot(new_apartment_rf)
```
Both the plot and the table confirm that all variables (`district`, `surface`, `floor`, `no.rooms`) have positive effects as expected. Still, these effects are too small while the final prediction - `3505 + 1881`- is much smaller than the real price of a square meter `6679`.
Let's see how the linear model behaves for this observation.
```{r single_prediction_break2, fig.height=3.5, fig.cap="Break Down Plots that compare the linear model and the random forest model"}
new_apartment_lm <- single_prediction(explainer_lm, observation = new_apartment)
plot(new_apartment_lm, new_apartment_rf)
```
Prediction for linear model is much closer to the real price of square meter for this apartment.