AOA_question.Rmd

---
title: "Mapping the area of applicability (AOA) of prediction models"
subtitle: "Questions to OpenGeoHub Summer school 2021 participants"
author: "Hanna Meyer"
date: "08/18/2021"
output:
  html_document:
    toc: true
    toc_float: true
    toc_depth: 3
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)


```


# Introduction

This exercise should help you to get a better understanding of how the AOA estimation works, why it is important and what you might want to consider in order to develop models that are applicable for their respective purpose.

# Prediction task
The task is to perfom a supervised land cover classification for Münster in Germany.
The challenge is: There are no reference data available...at least none for training, you'll get some later for validation ;-)
Instead, for model training reference data from a few regions across Germany are provided (thanks to my Master Landscape ecology group at WWU!). This is a realistic scenario looking at the large variety of prediction maps e.g. in the context of ecology, where predictions are usually made far beyond training samples.


```{r start, message=FALSE, warning=FALSE,echo=FALSE}
#First, load a number of packages we'll need.
library(raster)
library(caret)
library(mapview)
library(sf)
#library(devtools)
#install_github("HannaMeyer/CAST")
library(CAST)
#additional required packages:
library(tmap)
library(latticeExtra)
library(doParallel)
library(parallel)
library(Orcs)
```

## Data

To start with, let's load and explore the data...

### Raster data (predictor variables)

As predictor variables for spatial mapping of the area of Münster, you have a Sentinel-2 image available with some derived artificial channels (NDVI, standard deviations of NDVI in a 5x5 Pixel environment, Latitude, Longitude)

```{r load, message=FALSE,warning=FALSE}
sen_ms <- stack("data/Sen_Muenster_sub.grd")
```

```{r rgbplots,echo=FALSE}

rgbplot_ms <- spplot(sen_ms[[1]],  col.regions="transparent",sp.layout =rgb2spLayout(sen_ms[[3:1]], quantiles = c(0.02, 0.98), alpha = 1))
```


Let's plot the rasterStack to get an idea how the variables look like. 
 
```{r visPredictors}
plot(sen_ms)
plotRGB(sen_ms,stretch="lin",r=3,g=2,b=1)
```

### Reference data

As reference data we have digitized and labeled training polygons for some spatially distinct regions across Germany. Sentinel-2 data have been extracted for the training polygons and each pixel covered by a polygon is used as potential training data point.
We start with splitting the data into validation (Münster) and training (other regions) data.

```{r loadPoly}

trainSites <- readRDS("data/data_combined_ll.RDS")
trainDat <- trainSites[trainSites$Region!="Muenster",]
validationDat <- trainSites[trainSites$Region=="Muenster",]
head(trainSites)

#see unique regions in train set:
unique(trainDat$Region)

```


### Predictors and response

In order to speed things up, for this example we will reduce the data. Therefore, from each training polygon only 15% of the pixels will be used for model training.


```{r subset}
set.seed(100)
trainids <- createDataPartition(trainDat$ID,list=FALSE,p=0.15)
trainDat <- trainDat[trainids,]
trainDat <- trainDat[complete.cases(trainDat),]

```


For model training we need to define the predictor and response variables. To start with, as predictors we simply use basically all information from the raster stack without further consideration. As response variable we use the "Label" column of the data frame.

```{r vars}
predictors <- names(sen_ms)
response <- "Label"
```


## Model training and validation
We then train a random forest algorithm to learn relationships between the predictors and the land cover. 
As a first naive approach we use the default modelling way with a default random cross-validation.


```{r train_basic, warning=FALSE, message=FALSE}
# train the model
ctrl_default <- trainControl(method="cv", number = 3, savePredictions = TRUE)
set.seed(100)
model <- train(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               metric="Kappa",
               trControl=ctrl_default,
               importance=TRUE,
               ntree=50)
model

```


Looking at the performance it looks great, apparently a nearly perfect prediction model (keep in mind Kappa can take values up to 1 = perfect fit).

## Model prediction
To do the classification we can then use the trained model and apply it to each pixel of the raster stack using the predict function. 

```{r predict, message=FALSE, warning=FALSE}
prediction <- predict(sen_ms,model)
```

```{r vispredict, message=FALSE, warning=FALSE, echo=FALSE}
prediction <- predict(sen_ms,model)
cols <- rev(c("blue","red","darkorange","lightgreen","brown","green", "grey","green2","forestgreen","darkgreen"))

tm_shape(deratify(prediction)) +
  tm_raster(palette = cols,title = "LUC")+
  tm_scale_bar(bg.color="white",bg.alpha=0.75)+
  tm_layout(legend.bg.color = "white",
            legend.bg.alpha = 0.75)
```

However, quite obviously, the result doesn't look very good compared to the RGB. Very obvious is that the city of Münster is mainly predicted as "open Soil". Why is this possible keeping in mind that the cross-validation results indicated a nearly perfect model?

### Can the model predict beyond training regions?
The reason is that we apply the model to make predictions for a new region. But the random CV doesn't help to estimate the performance for Münster (or other regions). This is because the data are heavily clustered in space (i.e. several data points from the same training polygon, see explanation in the recordings of the previous summer schools), hence random CV is only helpful to answer the question how well the model can make predictions for the same cluster as used during training. 

To get a more suitable performance estimate, we re-train the model with spatial CV. Since we want to apply the model to a new region within Germany, an obvious way of splitting the data is to split by region: In each iteration of cross-validation, one region is left out and used for validation instead of training. 


```{r train_sp, warning=FALSE, message=FALSE, results='hide'}
# train the model
indices <- CreateSpacetimeFolds(trainDat,spacevar = "Region",k=length(unique(trainDat$Region)))
ctrl_sp <- trainControl(method="cv", index = indices$index, savePredictions = TRUE)

set.seed(100)
model_sp <- train(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               metric="Kappa",
               trControl=ctrl_sp,
               importance=TRUE,
               ntree=50)
```

Let's compare the performance estimates.
```{r compare_performance, warning=FALSE, message=FALSE, results='hide'}
boxplot(model$resample$Kappa, model_sp$resample$Kappa,
        names=c("random CV", "spatial CV"),ylab="Kappa Index")

```

Based on the spatial CV, we expect the performance for Münster to be considerably lower compared to the random CV (Kappa around 0.4 to 0.6).

### Validation with external validation data of Münster

We see large differences between random and spatial CV. Recently it has been argued against spatial CV for estimating map accuracy (https://www.sciencedirect.com/science/article/abs/pii/S0304380021002489?via%3Dihub). However, as outlined above, a spatial CV is the only meaningful way to estimate the performance for clustered data if no probability sample is available. Let's test if the spatial CV estimates are reasonable: We now use the validation data and validate the predictions with these external data.

```{r definition, echo = FALSE}

validation <- function(x,y=validationDat){
confusionMatrix(factor(x,levels=unique(unique(as.character(x),unique(as.character(y$Label))))),factor(y$Label,unique(unique(as.character(x),unique(as.character(y$Label))))))$overall[1:2]
}

```

```{r validation, warning=FALSE, message=FALSE}

pred_muenster_valid <- predict(model_sp,validationDat)
validation(pred_muenster_valid)

```

Seems like the external validation is well comparable to the spatial CV estimate (and as expected not at all comparable to the random CV estimate).


## Area of Applicability
Technically it was no problem to make predictions for Münster. But let's assess if we really should have done this. A trained model should only be applied to locations that feature predictor properties that are comparable to those of the training data. Read https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13650 to get familiar with the idea. If dissimilarity to the training data is larger than the disimmilarity within the training data, Meyer and Pebesma (2021) suggest that the model should not be applied to this location.

The calculation of the AOA is quite time consuming. To make a bit faster we use a parallelization.

```{r aoa}
cl <- makeCluster(4)
registerDoParallel(cl)
AOA <- aoa(sen_ms,model,cl=cl)
plot(AOA)

```

We see that the AOA has only values of 0 which means that no pixel falls inside the AOA. Now we can visualize the prediction for the AOA only.

```{r aoavis, echo=FALSE}
cols_aoa <-  c("black","transparent")
if (sum(values(AOA$AOA))==0){
  cols_aoa <-  c("black")
}
predplot <- spplot(deratify(prediction),col.regions=cols, main = list(label="Prediction (left), prediction only for the AOA (right) and RGB composite (bottom)",cex=0.8))

predplotaoa <- spplot(deratify(prediction),col.regions=cols)+
spplot(AOA$AOA,col.regions=cols_aoa)
latticeCombineGrid(list(predplot,predplotaoa,rgbplot_ms),layout=c(2,2))

```

We see that no predictions are shown because the model is not considered applicable to any part of the area of Münster.

### Percentage of AOA
As a starting point for further improvement, let's calculate the percentage of the AOA for Münster. 

```{r aoa_performance,echo=FALSE}
print(paste0("Percentage of Münster that is within the AOA: ",
             round(sum(values(AOA$AOA)==1)/ncell(AOA),2)*100
," %"))

```

# Problem to be solved
Obviously, the model was not applicable to any part of Münster. The prediction patterns confirm that the model cannot make any reliable predictions here. Now it's up to you to solve this.

### Questions
* Apparently the model is not applicable to the area of Münster. Why is this the case? Describe in not more than 150 words.

* What can you do to increase the transferability? Extend this Rmd and without using any further data (e.g. don't increase the number of training samples. But you can use less data if you like), implement a workflow to increase the transferability. 

### How to start ?
You might want to look into the previous summer school recordings to gain ideas for a potential solution (none of them will explicitly tell you what to do though):
https://www.uni-muenster.de/RemoteSensing/en/lehre/summer_schools/index.html

It is further recommended to read about the area of applicability: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13650.

### Judgement of your contributions
You will be judged on the comprehensibility and plausibility of your approach as well as on the percentage of the AOA for Münster and the Kappa&Accuracy based on the external validation.


# Answer
*to be filled by you*

### Why is the original model not applicable to Münster?
*Describe in not more than 150 words*

### Increasing the transferability
*Describe in words, code and figures*

### Summary statistics
*Predict test data, validate with the test , visualize results and calculate percentage within AOA*

```{r check}
#Predict on test data:
#pred_muenster_valid <- predict(model,validationDat)

# Validate the predictions:
#validation(pred_muenster_valid)

#Visualize: Use a 3-panel figure like shown above!

# Calculate percentage AOA:
#print(paste0("Percentage of Münster that is within the AOA: ",sum(values(AOA$AOA)==1)/ncell(AOA)," %"))

```