-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathAOA_question.Rmd
291 lines (197 loc) · 11.6 KB
/
AOA_question.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
title: "Mapping the area of applicability (AOA) of prediction models"
subtitle: "Questions to OpenGeoHub Summer school 2021 participants"
author: "Hanna Meyer"
date: "08/18/2021"
output:
html_document:
toc: true
toc_float: true
toc_depth: 3
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This exercise should help you to get a better understanding of how the AOA estimation works, why it is important and what you might want to consider in order to develop models that are applicable for their respective purpose.
# Prediction task
The task is to perfom a supervised land cover classification for Münster in Germany.
The challenge is: There are no reference data available...at least none for training, you'll get some later for validation ;-)
Instead, for model training reference data from a few regions across Germany are provided (thanks to my Master Landscape ecology group at WWU!). This is a realistic scenario looking at the large variety of prediction maps e.g. in the context of ecology, where predictions are usually made far beyond training samples.
```{r start, message=FALSE, warning=FALSE,echo=FALSE}
#First, load a number of packages we'll need.
library(raster)
library(caret)
library(mapview)
library(sf)
#library(devtools)
#install_github("HannaMeyer/CAST")
library(CAST)
#additional required packages:
library(tmap)
library(latticeExtra)
library(doParallel)
library(parallel)
library(Orcs)
```
## Data
To start with, let's load and explore the data...
### Raster data (predictor variables)
As predictor variables for spatial mapping of the area of Münster, you have a Sentinel-2 image available with some derived artificial channels (NDVI, standard deviations of NDVI in a 5x5 Pixel environment, Latitude, Longitude)
```{r load, message=FALSE,warning=FALSE}
sen_ms <- stack("data/Sen_Muenster_sub.grd")
```
```{r rgbplots,echo=FALSE}
rgbplot_ms <- spplot(sen_ms[[1]], col.regions="transparent",sp.layout =rgb2spLayout(sen_ms[[3:1]], quantiles = c(0.02, 0.98), alpha = 1))
```
Let's plot the rasterStack to get an idea how the variables look like.
```{r visPredictors}
plot(sen_ms)
plotRGB(sen_ms,stretch="lin",r=3,g=2,b=1)
```
### Reference data
As reference data we have digitized and labeled training polygons for some spatially distinct regions across Germany. Sentinel-2 data have been extracted for the training polygons and each pixel covered by a polygon is used as potential training data point.
We start with splitting the data into validation (Münster) and training (other regions) data.
```{r loadPoly}
trainSites <- readRDS("data/data_combined_ll.RDS")
trainDat <- trainSites[trainSites$Region!="Muenster",]
validationDat <- trainSites[trainSites$Region=="Muenster",]
head(trainSites)
#see unique regions in train set:
unique(trainDat$Region)
```
### Predictors and response
In order to speed things up, for this example we will reduce the data. Therefore, from each training polygon only 15% of the pixels will be used for model training.
```{r subset}
set.seed(100)
trainids <- createDataPartition(trainDat$ID,list=FALSE,p=0.15)
trainDat <- trainDat[trainids,]
trainDat <- trainDat[complete.cases(trainDat),]
```
For model training we need to define the predictor and response variables. To start with, as predictors we simply use basically all information from the raster stack without further consideration. As response variable we use the "Label" column of the data frame.
```{r vars}
predictors <- names(sen_ms)
response <- "Label"
```
## Model training and validation
We then train a random forest algorithm to learn relationships between the predictors and the land cover.
As a first naive approach we use the default modelling way with a default random cross-validation.
```{r train_basic, warning=FALSE, message=FALSE}
# train the model
ctrl_default <- trainControl(method="cv", number = 3, savePredictions = TRUE)
set.seed(100)
model <- train(trainDat[,predictors],
trainDat[,response],
method="rf",
metric="Kappa",
trControl=ctrl_default,
importance=TRUE,
ntree=50)
model
```
Looking at the performance it looks great, apparently a nearly perfect prediction model (keep in mind Kappa can take values up to 1 = perfect fit).
## Model prediction
To do the classification we can then use the trained model and apply it to each pixel of the raster stack using the predict function.
```{r predict, message=FALSE, warning=FALSE}
prediction <- predict(sen_ms,model)
```
```{r vispredict, message=FALSE, warning=FALSE, echo=FALSE}
prediction <- predict(sen_ms,model)
cols <- rev(c("blue","red","darkorange","lightgreen","brown","green", "grey","green2","forestgreen","darkgreen"))
tm_shape(deratify(prediction)) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
```
However, quite obviously, the result doesn't look very good compared to the RGB. Very obvious is that the city of Münster is mainly predicted as "open Soil". Why is this possible keeping in mind that the cross-validation results indicated a nearly perfect model?
### Can the model predict beyond training regions?
The reason is that we apply the model to make predictions for a new region. But the random CV doesn't help to estimate the performance for Münster (or other regions). This is because the data are heavily clustered in space (i.e. several data points from the same training polygon, see explanation in the recordings of the previous summer schools), hence random CV is only helpful to answer the question how well the model can make predictions for the same cluster as used during training.
To get a more suitable performance estimate, we re-train the model with spatial CV. Since we want to apply the model to a new region within Germany, an obvious way of splitting the data is to split by region: In each iteration of cross-validation, one region is left out and used for validation instead of training.
```{r train_sp, warning=FALSE, message=FALSE, results='hide'}
# train the model
indices <- CreateSpacetimeFolds(trainDat,spacevar = "Region",k=length(unique(trainDat$Region)))
ctrl_sp <- trainControl(method="cv", index = indices$index, savePredictions = TRUE)
set.seed(100)
model_sp <- train(trainDat[,predictors],
trainDat[,response],
method="rf",
metric="Kappa",
trControl=ctrl_sp,
importance=TRUE,
ntree=50)
```
Let's compare the performance estimates.
```{r compare_performance, warning=FALSE, message=FALSE, results='hide'}
boxplot(model$resample$Kappa, model_sp$resample$Kappa,
names=c("random CV", "spatial CV"),ylab="Kappa Index")
```
Based on the spatial CV, we expect the performance for Münster to be considerably lower compared to the random CV (Kappa around 0.4 to 0.6).
### Validation with external validation data of Münster
We see large differences between random and spatial CV. Recently it has been argued against spatial CV for estimating map accuracy (https://www.sciencedirect.com/science/article/abs/pii/S0304380021002489?via%3Dihub). However, as outlined above, a spatial CV is the only meaningful way to estimate the performance for clustered data if no probability sample is available. Let's test if the spatial CV estimates are reasonable: We now use the validation data and validate the predictions with these external data.
```{r definition, echo = FALSE}
validation <- function(x,y=validationDat){
confusionMatrix(factor(x,levels=unique(unique(as.character(x),unique(as.character(y$Label))))),factor(y$Label,unique(unique(as.character(x),unique(as.character(y$Label))))))$overall[1:2]
}
```
```{r validation, warning=FALSE, message=FALSE}
pred_muenster_valid <- predict(model_sp,validationDat)
validation(pred_muenster_valid)
```
Seems like the external validation is well comparable to the spatial CV estimate (and as expected not at all comparable to the random CV estimate).
## Area of Applicability
Technically it was no problem to make predictions for Münster. But let's assess if we really should have done this. A trained model should only be applied to locations that feature predictor properties that are comparable to those of the training data. Read https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13650 to get familiar with the idea. If dissimilarity to the training data is larger than the disimmilarity within the training data, Meyer and Pebesma (2021) suggest that the model should not be applied to this location.
The calculation of the AOA is quite time consuming. To make a bit faster we use a parallelization.
```{r aoa}
cl <- makeCluster(4)
registerDoParallel(cl)
AOA <- aoa(sen_ms,model,cl=cl)
plot(AOA)
```
We see that the AOA has only values of 0 which means that no pixel falls inside the AOA. Now we can visualize the prediction for the AOA only.
```{r aoavis, echo=FALSE}
cols_aoa <- c("black","transparent")
if (sum(values(AOA$AOA))==0){
cols_aoa <- c("black")
}
predplot <- spplot(deratify(prediction),col.regions=cols, main = list(label="Prediction (left), prediction only for the AOA (right) and RGB composite (bottom)",cex=0.8))
predplotaoa <- spplot(deratify(prediction),col.regions=cols)+
spplot(AOA$AOA,col.regions=cols_aoa)
latticeCombineGrid(list(predplot,predplotaoa,rgbplot_ms),layout=c(2,2))
```
We see that no predictions are shown because the model is not considered applicable to any part of the area of Münster.
### Percentage of AOA
As a starting point for further improvement, let's calculate the percentage of the AOA for Münster.
```{r aoa_performance,echo=FALSE}
print(paste0("Percentage of Münster that is within the AOA: ",
round(sum(values(AOA$AOA)==1)/ncell(AOA),2)*100
," %"))
```
# Problem to be solved
Obviously, the model was not applicable to any part of Münster. The prediction patterns confirm that the model cannot make any reliable predictions here. Now it's up to you to solve this.
### Questions
* Apparently the model is not applicable to the area of Münster. Why is this the case? Describe in not more than 150 words.
* What can you do to increase the transferability? Extend this Rmd and without using any further data (e.g. don't increase the number of training samples. But you can use less data if you like), implement a workflow to increase the transferability.
### How to start ?
You might want to look into the previous summer school recordings to gain ideas for a potential solution (none of them will explicitly tell you what to do though):
https://www.uni-muenster.de/RemoteSensing/en/lehre/summer_schools/index.html
It is further recommended to read about the area of applicability: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13650.
### Judgement of your contributions
You will be judged on the comprehensibility and plausibility of your approach as well as on the percentage of the AOA for Münster and the Kappa&Accuracy based on the external validation.
# Answer
*to be filled by you*
### Why is the original model not applicable to Münster?
*Describe in not more than 150 words*
### Increasing the transferability
*Describe in words, code and figures*
### Summary statistics
*Predict test data, validate with the test , visualize results and calculate percentage within AOA*
```{r check}
#Predict on test data:
#pred_muenster_valid <- predict(model,validationDat)
# Validate the predictions:
#validation(pred_muenster_valid)
#Visualize: Use a 3-panel figure like shown above!
# Calculate percentage AOA:
#print(paste0("Percentage of Münster that is within the AOA: ",sum(values(AOA$AOA)==1)/ncell(AOA)," %"))
```