diff --git a/articles/cast01-CAST-intro-cookfarm.html b/articles/cast01-CAST-intro-cookfarm.html
index 2c89b265..e826da73 100644
--- a/articles/cast01-CAST-intro-cookfarm.html
+++ b/articles/cast01-CAST-intro-cookfarm.html
@@ -78,7 +78,7 @@
Hanna
Meyer
- 2024-01-08
+ 2024-01-22
Source: vignettes/cast01-CAST-intro-cookfarm.Rmd
cast01-CAST-intro-cookfarm.Rmd
diff --git a/articles/cast02-AOA-tutorial.html b/articles/cast02-AOA-tutorial.html
index 51a5cb6a..3f16b704 100644
--- a/articles/cast02-AOA-tutorial.html
+++ b/articles/cast02-AOA-tutorial.html
@@ -78,7 +78,7 @@
Hanna
Meyer
- 2024-01-08
+ 2024-01-22
Source: vignettes/cast02-AOA-tutorial.Rmd
cast02-AOA-tutorial.Rmd
diff --git a/articles/cast03-AOA-parallel.html b/articles/cast03-AOA-parallel.html
index c6374a0a..958ec20e 100644
--- a/articles/cast03-AOA-parallel.html
+++ b/articles/cast03-AOA-parallel.html
@@ -78,7 +78,7 @@
Marvin
Ludwig
- 2024-01-08
+ 2024-01-22
Source: vignettes/cast03-AOA-parallel.Rmd
cast03-AOA-parallel.Rmd
diff --git a/articles/cast04-plotgeodist.html b/articles/cast04-plotgeodist.html
index 657e84fa..a0d14993 100644
--- a/articles/cast04-plotgeodist.html
+++ b/articles/cast04-plotgeodist.html
@@ -78,7 +78,7 @@
Hanna
Meyer
- 2024-01-08
+ 2024-01-22
Source: vignettes/cast04-plotgeodist.Rmd
cast04-plotgeodist.Rmd
diff --git a/news/index.html b/news/index.html
index 03b0d817..5a8b5d63 100644
--- a/news/index.html
+++ b/news/index.html
@@ -55,7 +55,7 @@
-CAST
0.9.0
+
CAST
0.9.0
CRAN release: 2024-01-09
- new features:
- CAST functions now return classes with generic plotting and printing
- new dataset for examples, tutorials and testing: data(splotdata)
diff --git a/pkgdown.yml b/pkgdown.yml
index a9defa14..4145d642 100644
--- a/pkgdown.yml
+++ b/pkgdown.yml
@@ -6,7 +6,7 @@ articles:
cast02-AOA-tutorial: cast02-AOA-tutorial.html
cast03-AOA-parallel: cast03-AOA-parallel.html
cast04-plotgeodist: cast04-plotgeodist.html
-last_built: 2024-01-08T15:01Z
+last_built: 2024-01-22T16:23Z
urls:
reference: https://hannameyer.github.io/CAST/reference
article: https://hannameyer.github.io/CAST/articles
diff --git a/search.json b/search.json
index 73081add..5d37a1df 100644
--- a/search.json
+++ b/search.json
@@ -1 +1 @@
-[{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"1. Introduction to CAST","text":"!!Note: recent developments CAST yet fully documented tutorial. major update can expected Apr 2024!!","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"background","dir":"Articles","previous_headings":"Introduction","what":"Background","title":"1. Introduction to CAST","text":"One key task environmental science obtaining information environmental variables continuously space space time, usually based remote sensing limited field data. respect, machine learning algorithms proven important tool learn patterns nonlinear complex systems. However, standard machine learning applications suitable spatio-temporal data, usually ignore spatio-temporal dependencies data. becomes problematic (least) two aspects predictive modelling: Overfitted models well overly optimistic error assessment (see Meyer et al 2018 Meyer et al 2019 ). approach problems, CAST supports well-known caret package (Kuhn 2018 provide methods designed spatio-temporal data. tutorial shows set spatio-temporal prediction model includes objective reliable error estimation. shows spatio-temporal overfitting can detected comparison validation strategies. shown certain variables responsible problem overfitting due spatio-temporal autocorrelation patterns. Therefore, tutorial also shows automatically exclude variables lead overfitting aim improve spatio-temporal prediction model. order follow tutorial, assume reader familiar basics predictive modelling nicely explained Kuhn Johnson 2013 well machine learning applications via caret package.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"how-to-start","dir":"Articles","previous_headings":"Introduction","what":"How to start","title":"1. Introduction to CAST","text":"work tutorial, first install CAST package load library: need help, see","code":"#install.packages(\"CAST\") library(CAST) help(CAST)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"example-of-a-typical-spatio-temporal-prediction-task","dir":"Articles","previous_headings":"","what":"Example of a typical spatio-temporal prediction task","title":"1. Introduction to CAST","text":"example prediction task tutorial following: set data loggers distributed farm, want map soil moisture, based set spatial temporal predictor variables. use Random Forests machine learning algorithm tutorial.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"description-of-the-example-dataset","dir":"Articles","previous_headings":"Example of a typical spatio-temporal prediction task","what":"Description of the example dataset","title":"1. Introduction to CAST","text":", work cookfarm dataset, described e.g. Gasch et al 2015 available via GSIF package (Hengl 2017). dataset included CAST package re-structured dataset used analysis Meyer et al 2018. want point following information dataset: “SOURCEID” represents ID data logger, “VW” soil moisture response variable, “Easting” “Northing” coordinates data loggers, “altitude” indicates depth soil VW measured, remaining columns represent different potential predictor variables terrain related (e.g. “DEM”, “TWI”), vegetation indices (e.g. “NDRE”), soil properties (e.g. “BLD”) climate-related predictors (e.g. “Precip_wrcc”). See Gasch et al 2015 description dataset. get impression spatial properties dataset, let’s look spatial distribution data loggers cookfarm: see data taken 42 locations (SOURCEID) field. loggers recorded data 2007 2013 (dataset contains data 2010 ). VW data given daily basis.","code":"data <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) head(data) ## SOURCEID VW Easting Northing altitude DEM TWI NDRE.M ## 101689 CAF357 0.303 493828.1 5181021 -0.3 792.5756 3.791253 0.08161208 ## 213001 CAF357 0.328 493828.1 5181021 -0.6 792.5756 3.791253 0.08161208 ## 324313 CAF357 0.376 493828.1 5181021 -0.9 792.5756 3.791253 0.08161208 ## 435625 CAF357 0.350 493828.1 5181021 -1.2 792.5756 3.791253 0.08161208 ## 546937 CAF357 0.323 493828.1 5181021 -1.5 792.5756 3.791253 0.08161208 ## 101690 CAF357 0.297 493828.1 5181021 -0.3 792.5756 3.791253 0.08161208 ## NDRE.Sd Bt BLD Date Precip_wrcc MaxT_wrcc MinT_wrcc ## 101689 0.2805182 0.0000 1.22 2010-01-01 5.8 2.8 -3.3 ## 213001 0.2805182 0.0000 1.36 2010-01-01 5.8 2.8 -3.3 ## 324313 0.2805182 0.0000 1.48 2010-01-01 5.8 2.8 -3.3 ## 435625 0.2805182 0.0000 1.56 2010-01-01 5.8 2.8 -3.3 ## 546937 0.2805182 0.0106 1.60 2010-01-01 5.8 2.8 -3.3 ## 101690 0.2805182 0.0000 1.22 2010-01-02 6.9 6.1 0.6 ## Precip_cum cday ## 101689 5.8 14611 ## 213001 5.8 14611 ## 324313 5.8 14611 ## 435625 5.8 14611 ## 546937 5.8 14611 ## 101690 12.7 14612 library(sf) data_sp <- unique(data[,c(\"SOURCEID\",\"Easting\",\"Northing\")]) data_sp <- st_as_sf(data_sp,coords=c(\"Easting\",\"Northing\"),crs=26911) plot(data_sp,axes=T,col=\"black\") #...or plot the data with mapview: library(mapview) mapviewOptions(basemaps = c(\"Esri.WorldImagery\")) mapview(data_sp)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"data-subsetting","dir":"Articles","previous_headings":"Example of a typical spatio-temporal prediction task","what":"Data subsetting","title":"1. Introduction to CAST","text":"reduce data amount can handled tutorial, let’s restrict data depth -0.3 two weeks year 2012. subsetting let’s overview soil moisture time series measured data loggers. can see (expected) logger location unique time series soil moisture.","code":"library(lubridate) library(ggplot2) trainDat <- data[data$altitude==-0.3& year(data$Date)==2012& week(data$Date)%in%c(10:12),] ggplot(data = trainDat, aes(x=Date, y=VW)) + geom_line(aes(colour=SOURCEID))"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"model-training-and-prediction","dir":"Articles","previous_headings":"","what":"Model training and prediction","title":"1. Introduction to CAST","text":"following use subset cookfarm data example spatially predict soil moisture (.e. map soil moisture) (without) consideration spatio-temporal dependencies. start , lets use dataset create “default” Random Forest model predicts soil moisture based predictor variables. keep computation time minimum, don’t include hyperparameter tuning (hence mtry set 2) reasonable Random Forests comparably insensitive tuning. Based trained model can make spatial predictions soil moisture. load multiband raster contains spatial data predictor variables 25th March 2012 (example). apply trained model data set. result spatially comprehensive map soil moisture day. see simply creating map using machine learning caret easy task, however accurately measuring performance less simple. Though map looks good first sight now follow question accurate map , hence need ask well model able map soil moisture. visible inspection noticeable model produces strange linear features eastern side farm looks suspicious. let’s come back later first focus statistical validation model.","code":"library(caret) predictors <- c(\"DEM\",\"TWI\",\"Precip_cum\",\"cday\", \"MaxT_wrcc\",\"Precip_wrcc\",\"BLD\", \"Northing\",\"Easting\",\"NDRE.M\") set.seed(10) model <- train(trainDat[,predictors],trainDat$VW, method=\"rf\",tuneGrid=data.frame(\"mtry\"=2), importance=TRUE,ntree=50, trControl=trainControl(method=\"cv\",number=3)) library(terra) predictors_sp <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) prediction <- predict(predictors_sp,model,na.rm=TRUE) plot(prediction)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"cross-validation-strategies-for-spatio-temporal-data","dir":"Articles","previous_headings":"","what":"Cross validation strategies for spatio-temporal data","title":"1. Introduction to CAST","text":"Among validation strategies, k-fold cross validation (CV) popular estimate performance model view data used model training. CV, models repeatedly trained (k models) model run, data one fold put side used model training model validation. way, performance model can estimated using data included model training.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"the-standard-approach-random-k-fold-cv","dir":"Articles","previous_headings":"Cross validation strategies for spatio-temporal data","what":"The Standard approach: Random k-fold CV","title":"1. Introduction to CAST","text":"example used random k-fold CV defined caret’s trainControl argument. specifically, used random 3-fold CV. Hence, data points dataset RANDOMLY split 3 folds. assess performance model let’s look output Random CV: see soil moisture modelled high R² (0.90) indicates nearly perfect fit data. Sounds good, unfortunately, random k fold CV give us good indication map accuracy. Random k-fold CV means three folds (highest certainty) contains data points data logger. Therefore, random CV indicate ability model make predictions beyond location training data (.e. map soil moisture). Since aim map soil moisture, rather need perform target-oriented validation validates model view spatial mapping.","code":"model ## Random Forest ## ## 654 samples ## 10 predictor ## ## No pre-processing ## Resampling: Cross-Validated (3 fold) ## Summary of sample sizes: 436, 437, 435 ## Resampling results: ## ## RMSE Rsquared MAE ## 0.02188303 0.9044144 0.01273172 ## ## Tuning parameter 'mtry' was held constant at a value of 2"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"target-oriented-validation","dir":"Articles","previous_headings":"Cross validation strategies for spatio-temporal data","what":"Target-oriented validation","title":"1. Introduction to CAST","text":"interested model performance view random subsets data loggers, need know well model able make predictions areas without data loggers. find , need repeatedly leave complete time series one data loggers use test data CV. first need create meaningful folds rather random folds. CAST’s function “CreateSpaceTimeFolds” designed provide index arguments used caret’s trainControl. index defines data points used model training model run reversely defines data points held back. Hence, using index argument can account dependencies data leaving complete data one data loggers (LLO CV), one time steps (LTO CV) data loggers time steps (LLTO CV). example ’re focusing LLO CV, therefore use column “SOURCEID” define location data logger split data folds using information. Analog random CV split data five folds, hence five model runs performed leaving one fifth data loggers validation. Note several suggestions spatial CV exist. call LLO just simple example. See references Meyer Pebesma 2022 examples look Mila et al 2022 methodology implemented CAST function nndm. inspecting output model, see view new locations, R² 0.16 performance much lower expected random CV (R² = 0.90). Apparently, considerable overfitting model, causing good random performance poor performance view new locations. might partly attributed choice variables must suspect certain variables misinterpreted model (see Meyer et al 2018 [talk OpenGeoHub summer school 2019] (https://www.youtube.com/watch?v=mkHlmYEzsVQ)). Let’s look variable importance ranking Random Forest see find something suspicious: importance ranking indicates among others, “Easting” important variable. fits observation inappropriate linear features predicted map. Apparently model assigns high importance variable causes high random CV performance. time model fails prediction new locations variable unsuitable predictions beyond locations data loggers used model training. Assuming certain variables misinterpreted algorithm able produce higher LLO performance variables removed. Let’s see true…","code":"set.seed(10) indices <- CreateSpacetimeFolds(trainDat,spacevar = \"SOURCEID\", k=3) set.seed(10) model_LLO <- train(trainDat[,predictors],trainDat$VW, method=\"rf\",tuneGrid=data.frame(\"mtry\"=2), importance=TRUE, trControl=trainControl(method=\"cv\", index = indices$index)) model_LLO ## Random Forest ## ## 654 samples ## 10 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 433, 430, 445 ## Resampling results: ## ## RMSE Rsquared MAE ## 0.07645742 0.1616273 0.05994028 ## ## Tuning parameter 'mtry' was held constant at a value of 2 plot(varImp(model_LLO))"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"removing-variables-that-cause-overfitting","dir":"Articles","previous_headings":"","what":"Removing variables that cause overfitting","title":"1. Introduction to CAST","text":"CAST’s forward feature selection (ffs) selects variables make sense view selected CV method excludes counterproductive (meaningless) view selected CV method. use LLO CV method, ffs selects variables lead combination highest LLO performance (.e. best spatial model). variables spatial meaning even counterproductive won’t improve even reduce LLO performance therefore excluded model ffs. ffs job first training models using possible pairs two predictor variables. best model initial models kept. basis best model predictor variables iterativly increased remaining variables tested improvement currently best model. process stops none remaining variables increases model performance added current best model. let’s run ffs case study using R² metric select optimal variables. process take 1-2 minutes… Using ffs LLO CV, R² increased 0.16 0.28. variables used model “DEM”,“NDRE.M” “Northing”. others removed (least small example) spatial meaning even counterproductive. Using plot\\(\\_\\)ffs function can visualize performance model changed depending variables used: See best model using two variables led R² slightly 0.2. Using third variable slightly increase R². variable improve LLO performance. Note R² features high standard deviation regardless variables used. due small dataset used lead robust results. effect new model spatial representation soil moisture? see variable selection effect statistical performance also predicted spatial patterns change considerably. note linear feature resulting soil moisture map likely “Easting” removed set predictor variables ffs.","code":"set.seed(10) ffsmodel_LLO <- ffs(trainDat[,predictors],trainDat$VW,metric=\"Rsquared\", method=\"rf\", tuneGrid=data.frame(\"mtry\"=2), verbose=FALSE,ntree=50, trControl=trainControl(method=\"cv\", index = indices$index)) ffsmodel_LLO ## Selected Variables: ## DEM NDRE.M Northing ## --- ## Random Forest ## ## 654 samples ## 3 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 433, 430, 445 ## Resampling results: ## ## RMSE Rsquared MAE ## 0.1013101 0.2833983 0.0767997 ## ## Tuning parameter 'mtry' was held constant at a value of 2 ffsmodel_LLO$selectedvars ## [1] \"DEM\" \"NDRE.M\" \"Northing\" plot(ffsmodel_LLO) prediction_ffs <- predict(predictors_sp,ffsmodel_LLO,na.rm=TRUE) plot(prediction_ffs)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"area-of-applicability","dir":"Articles","previous_headings":"","what":"Area of Applicability","title":"1. Introduction to CAST","text":"Still required analyse model can applied entire study area locations different predictor properties model learned . See details vignette Area applicability Meyer Pebesma 2021. figure shows grey areas outside area applicability, hence predictions considered locations. See tutorial AOA package information.","code":"### AOA for which the spatial CV error applies: AOA <- aoa(predictors_sp,ffsmodel_LLO) plot(prediction_ffs,main=\"prediction for the AOA \\n(spatial CV error applied)\") plot(AOA$AOA,col=c(\"grey\",\"transparent\"),add=T) #spplot(prediction_ffs,main=\"prediction for the AOA \\n(spatial CV error applied)\")+ #spplot(AOA$AOA,col.regions=c(\"grey\",\"transparent\")) ### AOA for which the random CV error applies: AOA_random <- aoa(predictors_sp,model) plot(prediction,main=\"prediction for the AOA \\n(random CV error applied)\") plot(AOA_random$AOA,col=c(\"grey\",\"transparent\"),add=T) #spplot(prediction,main=\"prediction for the AOA \\n(random CV error applied)\")+ #spplot(AOA_random$AOA,col.regions=c(\"grey\",\"transparent\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"conclusions","dir":"Articles","previous_headings":"","what":"Conclusions","title":"1. Introduction to CAST","text":"conclude, tutorial shown CAST can used facilitate target-oriented (: spatial) CV spatial spatio-temporal data crucial obtain meaningful validation results. Using ffs conjunction target-oriented validation, variables can excluded counterproductive view target-oriented performance due misinterpretations algorithm. ffs therefore helps select ideal set predictor variables spatio-temporal prediction tasks gives objective error estimates.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"final-notes","dir":"Articles","previous_headings":"","what":"Final notes","title":"1. Introduction to CAST","text":"intention tutorial describe motivation led development CAST well functionality. Priority modelling soil moisture cookfarm best possible way provide example motivation functionality CAST can run within minutes. Hence, small subset entire cookfarm dataset used. Keep mind due small subset example robust quite different results might obtained depending small changes settings. intention showing motivation CAST also reason coordinates used predictor variables. Though coordinates used predictors quite scientific studies rather provide extreme example misleading variables can lead overfitting.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"1. Introduction to CAST","text":"Meyer, H., & Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. Accepted. Meyer, H., & Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. [https://doi.org/10.1111/2041-210X.13650] Meyer H, Reudenbach C, Wöllauer S,Nauss T (2019) Importance spatial predictor variable selection machine learning applications–Moving data reproduction spatial prediction. Ecological Modelling 411: 108815 [https://doi.org/10.1016/j.ecolmodel.2019.108815] Meyer H, Reudenbach C, Hengl T, Katurij M, Nauss T (2018) Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1–9 [https://doi.org/10.1016/j.envsoft.2017.12.001] Talk OpenGeoHub summer school 2019 spatial validation variable selection: https://www.youtube.com/watch?v=mkHlmYEzsVQ. Tutorial (https://youtu./EyP04zLe9qo) Lecture (https://youtu./OoNH6Nl-X2s) recording OpenGeoHub summer school 2020 area applicability. well talk OpenGeoHub summer school 2021: https://av.tib.eu/media/54879","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"2. Area of applicability of spatial prediction models","text":"spatial predictive mapping, models often applied make predictions far beyond sampling locations (.e. field observations used map variable even global scale), new locations might considerably differ environmental properties. However, areas predictor space without support training data problematic. model enabled learn relationships environments predictions areas considered highly uncertain. CAST, implement methodology described Meyer&Pebesma (2021) estimate “area applicability” (AOA) (spatial) prediction models. AOA defined area enabled model learn relationships based training data, estimated cross-validation performance holds. delineate AOA, first dissimilarity index (DI) calculated based distances training data multidimensional predictor variable space. account relevance predictor variables responsible prediction patterns weight variables model-derived importance scores prior distance calculation. AOA derived applying threshold based DI observed training data using cross-validation. tutorial shows example estimate area applicability spatial prediction models. information see: Meyer, H., & Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. [https://doi.org/10.1111/2041-210X.13650]","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"getting-started","dir":"Articles","previous_headings":"Introduction","what":"Getting started","title":"2. Area of applicability of spatial prediction models","text":"","code":"library(CAST) library(caret) library(terra) library(sf) library(viridis) library(gridExtra)"},{"path":[]},{"path":[]},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"generate-predictors","dir":"Articles","previous_headings":"Example 1: Using simulated data > Get data","what":"Generate Predictors","title":"2. Area of applicability of spatial prediction models","text":"predictor variables, set bioclimatic variables used (https://www.worldclim.org). tutorial, originally downloaded using getData function raster package cropped area central Europe. cropped data provided CAST package.","code":"predictors <- rast(system.file(\"extdata\",\"bioclim.tif\",package=\"CAST\")) plot(predictors,col=viridis(100))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"generate-response","dir":"Articles","previous_headings":"Example 1: Using simulated data > Get data","what":"Generate Response","title":"2. Area of applicability of spatial prediction models","text":"able test reliability method, ’re using simulated prediction task. therefore simulate virtual response variable bioclimatic variables.","code":"generate_random_response <- function(raster, predictornames = names(raster), seed = sample(seq(1000), 1)){ operands_1 = c(\"+\", \"-\", \"*\", \"/\") operands_2 = c(\"^1\",\"^2\") expression <- paste(as.character(predictornames, sep=\"\")) # assign random power to predictors set.seed(seed) expression <- paste(expression, sample(operands_2, length(predictornames), replace = TRUE), sep = \"\") # assign random math function between predictors (expect after the last one) set.seed(seed) expression[-length(expression)] <- paste(expression[- length(expression)], sample(operands_1, length(predictornames)-1, replace = TRUE), sep = \" \") print(paste0(expression, collapse = \" \")) # collapse e = paste0(\"raster$\", expression, collapse = \" \") response = eval(parse(text = e)) names(response) <- \"response\" return(response) } response <- generate_random_response (predictors, seed = 10) ## [1] \"bio2^1 * bio5^1 + bio10^2 - bio13^2 / bio14^2 / bio19^1\" plot(response,col=viridis(100),main=\"virtual response\")"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"simulate-sampling-locations","dir":"Articles","previous_headings":"Example 1: Using simulated data > Get data","what":"Simulate sampling locations","title":"2. Area of applicability of spatial prediction models","text":"simulate typical prediction task, field sampling locations randomly selected. , randomly select 20 points. Note small data set, used avoid long computation times.","code":"mask <- predictors[[1]] values(mask)[!is.na(values(mask))] <- 1 mask <- st_as_sf(as.polygons(mask)) mask <- st_make_valid(mask) set.seed(15) samplepoints <- st_as_sf(st_sample(mask,20,\"random\")) plot(response,col=viridis(100)) plot(samplepoints,col=\"red\",add=T,pch=3)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"model-training","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"Model training","title":"2. Area of applicability of spatial prediction models","text":"Next, machine learning algorithm applied learn relationships predictors response.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"prepare-data","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Prepare data","title":"2. Area of applicability of spatial prediction models","text":"Therefore, predictors response extracted sampling locations.","code":"trainDat <- extract(predictors,samplepoints,na.rm=FALSE) trainDat$response <- extract(response,samplepoints,na.rm=FALSE, ID=FALSE)$response trainDat <- na.omit(trainDat)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"train-the-model","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Train the model","title":"2. Area of applicability of spatial prediction models","text":"Random Forest applied machine learning algorithm (others can used well, long variable importance returned). model validated default cross-validation estimate prediction error.","code":"set.seed(10) model <- train(trainDat[,names(predictors)], trainDat$response, method=\"rf\", importance=TRUE, trControl = trainControl(method=\"cv\")) print(model) ## Random Forest ## ## 20 samples ## 6 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 18, 18, 18, 18, 18, 18, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 3854.481 1 3310.203 ## 4 3084.764 1 2675.126 ## 6 2960.314 1 2571.475 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 6."},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"variable-importance","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Variable importance","title":"2. Area of applicability of spatial prediction models","text":"estimation AOA require importance individual predictor variables.","code":"plot(varImp(model,scale = F),col=\"black\")"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"predict-and-calculate-error","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Predict and calculate error","title":"2. Area of applicability of spatial prediction models","text":"trained model used make predictions entire area interest. Since simulated area-wide response used, ’s possible tutorial compare predictions true reference.","code":"prediction <- predict(predictors,model,na.rm=T) truediff <- abs(prediction-response) plot(rast(list(prediction,response)),main=c(\"prediction\",\"reference\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"aoa-calculation","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"AOA Calculation","title":"2. Area of applicability of spatial prediction models","text":"visualization shows predictions made model. next step, DI AOA calculated. AOA calculation takes model input extract importance predictors, used weights multidimensional distance calculation. Note AOA can also calculated without trained model (.e. using training data new data ). case predictor variables treated equally important (unless weights given form table). Plotting aoa object shows distribution DI values within training data DI new data. output aoa function two raster data: first DI normalized weighted minimum distance nearest training data point divided average distance within training data. AOA derived DI using threshold. threshold (outlier-removed) maximum DI observed training data DI training data calculated considering cross-validation folds. used threshold relevant information training data DI returned parameters list entry. can plot DI well predictions onyl AOA: patterns DI general agreement true prediction error. high values present Alps, covered training data feature distinct environmental conditions. Since DI values areas threshold, regard area outside AOA.","code":"AOA <- aoa(predictors, model) class(AOA) ## [1] \"aoa\" names(AOA) ## [1] \"parameters\" \"DI\" \"AOA\" print(AOA) ## DI: ## class : SpatRaster ## dimensions : 102, 123, 1 (nrow, ncol, nlyr) ## resolution : 14075.98, 14075.98 (x, y) ## extent : 3496791, 5228136, 2143336, 3579086 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs ## source(s) : memory ## varname : bioclim ## name : DI ## min value : 0.000000 ## max value : 3.408739 ## AOA: ## class : SpatRaster ## dimensions : 102, 123, 1 (nrow, ncol, nlyr) ## resolution : 14075.98, 14075.98 (x, y) ## extent : 3496791, 5228136, 2143336, 3579086 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs ## source(s) : memory ## varname : bioclim ## name : AOA ## min value : 0 ## max value : 1 ## ## ## Predictor Weights: ## bio2 bio5 bio10 bio13 bio14 bio19 ## 1 3.746582 17.92456 17.04888 2.15925 0 0 ## ## ## AOA Threshold: 0.3221291 plot(AOA) plot(truediff,col=viridis(100),main=\"true prediction error\") plot(AOA$DI,col=viridis(100),main=\"DI\") plot(prediction, col=viridis(100),main=\"prediction for AOA\") plot(AOA$AOA,col=c(\"grey\",\"transparent\"),add=T,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"aoa-for-spatially-clustered-data","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"AOA for spatially clustered data?","title":"2. Area of applicability of spatial prediction models","text":"example randomly distributed training samples. However, sampling locations might also highly clustered space. case, random cross-validation meaningful (see e.g. Meyer et al. 2018, Meyer et al. 2019, Valavi et al. 2019, Roberts et al. 2018, Pohjankukka et al. 2017, Brenning 2012) Also threshold AOA reliable, based distance nearest data point within training data (usually small data clustered). Instead, cross-validation based leave-cluster-approach, AOA estimation based distances nearest data point located spatial cluster. show looks like, use 15 spatial locations simulate 5 data points around location. first train model (case) inappropriate random cross-validation. …model based leave-cluster-cross-validation. AOA calculated (comparison) using model validated random cross-validation, second taking spatial clusters account calculating threshold based minimum distances nearest training point located cluster. done aoa function, folds used cross-validation automatically extracted model. Note AOA much larger spatial CV approach. However, spatial cross-validation error considerably larger, hence also area error applies larger. random cross-validation performance high, however, area performance applies small. fact also apparent plot aoa objects display distributions DI training data well DI new data. random CV predictionDI larger AOA threshold determined trainDI. Using spatial CV, predictionDI well within DI training samples.","code":"set.seed(25) samplepoints <- clustered_sample(mask,75,15,radius=25000) plot(response,col=viridis(100)) plot(samplepoints,col=\"red\",add=T,pch=3) trainDat <- extract(predictors,samplepoints,na.rm=FALSE) trainDat$response <- extract(response,samplepoints,na.rm=FALSE)$response trainDat <- data.frame(trainDat,samplepoints) trainDat <- na.omit(trainDat) set.seed(10) model_random <- train(trainDat[,names(predictors)], trainDat$response, method=\"rf\", importance=TRUE, trControl = trainControl(method=\"cv\")) prediction_random <- predict(predictors,model_random,na.rm=TRUE) print(model_random) ## Random Forest ## ## 75 samples ## 6 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 68, 67, 68, 68, 68, 67, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 1088.1729 0.9956237 790.2191 ## 4 921.1760 0.9968527 717.5578 ## 6 922.1137 0.9967308 715.7016 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 4. folds <- CreateSpacetimeFolds(trainDat, spacevar=\"parent\",k=10) set.seed(15) model <- train(trainDat[,names(predictors)], trainDat$response, method=\"rf\", importance=TRUE, tuneGrid = expand.grid(mtry = c(2:length(names(predictors)))), trControl = trainControl(method=\"cv\",index=folds$index)) print(model) ## Random Forest ## ## 75 samples ## 6 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 70, 70, 65, 70, 70, 65, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 3227.421 0.9382904 2740.529 ## 3 2761.092 0.9433621 2396.941 ## 4 2677.002 0.9570317 2349.310 ## 5 2587.598 0.9486190 2282.064 ## 6 2494.756 0.9425158 2190.718 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 6. prediction <- predict(predictors,model,na.rm=TRUE) AOA_spatial <- aoa(predictors, model) AOA_random <- aoa(predictors, model_random) plot(AOA_spatial$DI,col=viridis(100),main=\"DI\") plot(prediction, col=viridis(100),main=\"prediction for AOA \\n(spatial CV error applies)\") plot(AOA_spatial$AOA,col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\")) plot(prediction_random, col=viridis(100),main=\"prediction for AOA \\n(random CV error applies)\") plot(AOA_random$AOA,col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\")) grid.arrange(plot(AOA_spatial) + ggplot2::ggtitle(\"Spatial CV\"), plot(AOA_random) + ggplot2::ggtitle(\"Random CV\"), ncol = 2)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"comparison-prediction-error-with-model-error","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"Comparison prediction error with model error","title":"2. Area of applicability of spatial prediction models","text":"Since used simulated response variable, can now compare prediction error within AOA model error, assuming model error applies inside AOA outside. results indicate high agreement model CV error (RMSE) true prediction RMSE. case , random well spatial model.","code":"###for the spatial CV: RMSE(values(prediction)[values(AOA_spatial$AOA)==1], values(response)[values(AOA_spatial$AOA)==1]) ## [1] 3308.808 RMSE(values(prediction)[values(AOA_spatial$AOA)==0], values(response)[values(AOA_spatial$AOA)==0]) ## [1] 10855.31 model$results ## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD ## 1 2 3227.421 0.9382904 2740.529 2335.609 0.06774290 2168.398 ## 2 3 2761.092 0.9433621 2396.941 1823.280 0.07190124 1674.310 ## 3 4 2677.002 0.9570317 2349.310 1690.078 0.04208035 1549.323 ## 4 5 2587.598 0.9486190 2282.064 1595.276 0.05220790 1410.225 ## 5 6 2494.756 0.9425158 2190.718 1507.700 0.07431001 1289.825 ###and for the random CV: RMSE(values(prediction_random)[values(AOA_random$AOA)==1], values(response)[values(AOA_random$AOA)==1]) ## [1] 1365.329 RMSE(values(prediction_random)[values(AOA_random$AOA)==0], values(response)[values(AOA_random$AOA)==0]) ## [1] 3959.685 model_random$results ## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD ## 1 2 1088.1729 0.9956237 790.2191 595.2632 0.004567068 407.8754 ## 2 4 921.1760 0.9968527 717.5578 437.1580 0.002792369 311.1915 ## 3 6 922.1137 0.9967308 715.7016 412.0427 0.002498990 306.1030"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"relationship-between-the-di-and-the-performance-measure","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"Relationship between the DI and the performance measure","title":"2. Area of applicability of spatial prediction models","text":"relationship error DI can used limit predictions area (within AOA) required performance (e.g. RMSE, R2, Kappa, Accuracy) applies. can done using result DItoErrormetric used relationship analyzed window DI values. corresponding model (: shape constrained additive models default: Monotone increasing P-splines dimension basis used represent smooth term 6 2nd order penalty.) can used estimate performance pixel level, allows limiting predictions using threshold. Note used multi-purpose CV estimate relationship DI RMSE (see details paper).","code":"DI_RMSE_relation <- DItoErrormetric(model, AOA_spatial$parameters, multiCV=TRUE, window.size = 5, length.out = 5) plot(DI_RMSE_relation) expected_RMSE = terra::predict(AOA_spatial$DI, DI_RMSE_relation) # account for multiCV changing the DI threshold updated_AOA = AOA_spatial$DI > attr(DI_RMSE_relation, \"AOA_threshold\") plot(expected_RMSE,col=viridis(100),main=\"expected RMSE\") plot(updated_AOA, col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"example-2-a-real-world-example","dir":"Articles","previous_headings":"","what":"Example 2: A real-world example","title":"2. Area of applicability of spatial prediction models","text":"example used simulated data allows analyze reliability AOA. However, simulated area-wide response available usual prediction tasks. Therefore, second example AOA estimated dataset point observations reference .","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"data-and-preprocessing","dir":"Articles","previous_headings":"Example 2: A real-world example","what":"Data and preprocessing","title":"2. Area of applicability of spatial prediction models","text":", work cookfarm dataset, described e.g. Gasch et al 2015. dataset included CAST re-structured dataset. Find details also vignette “Introduction CAST”. use soil moisture (VW) response variable . Hence, ’re aiming making spatial continuous prediction based limited measurements data loggers.","code":"dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) # calculate average of VW for each sampling site: dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) # create sf object from the data: pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) ##### Extract Predictors for the locations of the sampling points studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) st_crs(pts) <- crs(studyArea) trainDat <- extract(studyArea,pts,na.rm=FALSE) pts$ID <- 1:nrow(pts) trainDat <- merge(trainDat,pts,by.x=\"ID\",by.y=\"ID\") # The final training dataset with potential predictors and VW: head(trainDat) ## ID DEM TWI BLD NDRE.M NDRE.Sd Bt Easting Northing ## 1 1 788.1906 4.304258 1.42 -0.051189531 0.2506899 0.0000 493384 5180587 ## 2 2 788.3813 3.863605 1.29 -0.046459336 0.1754623 0.0000 493514 5180567 ## 3 3 790.5244 3.947488 1.36 -0.040845532 0.2225785 0.0000 493574 5180577 ## 4 4 775.7229 5.395786 1.55 -0.004329725 0.2099845 0.0501 493244 5180587 ## 5 5 796.7618 3.534822 1.31 0.027252737 0.2002646 0.0000 493624 5180607 ## 6 6 795.8370 3.815516 1.40 -0.123434804 0.2180606 0.0000 493694 5180607 ## MinT_wrcc MaxT_wrcc Precip_cum cday Precip_wrcc Group.1 VW ## 1 1.1 36.2 10.6 15425 0 CAF003 0.2894505 ## 2 1.1 36.2 10.6 15425 0 CAF007 0.2705531 ## 3 1.1 36.2 10.6 15425 0 CAF009 0.2629683 ## 4 1.1 36.2 10.6 15425 0 CAF019 0.2993580 ## 5 1.1 36.2 10.6 15425 0 CAF031 0.2664754 ## 6 1.1 36.2 10.6 15425 0 CAF033 0.2650177 ## geometry ## 1 POINT (493383.1 5180586) ## 2 POINT (493510.7 5180568) ## 3 POINT (493574.6 5180573) ## 4 POINT (493246.6 5180590) ## 5 POINT (493628.3 5180612) ## 6 POINT (493692.2 5180610)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"model-training-and-prediction","dir":"Articles","previous_headings":"Example 2: A real-world example","what":"Model training and prediction","title":"2. Area of applicability of spatial prediction models","text":"set variables used predictors VW random Forest model. model validated leave one cross-validation. Note model performance low, due small dataset used (small dataset low ability predictors model VW).","code":"predictors <- c(\"DEM\",\"NDRE.Sd\",\"TWI\",\"Bt\") response <- \"VW\" model <- train(trainDat[,predictors],trainDat[,response], method=\"rf\",tuneLength=3,importance=TRUE, trControl=trainControl(method=\"LOOCV\")) model ## Random Forest ## ## 42 samples ## 4 predictor ## ## No pre-processing ## Resampling: Leave-One-Out Cross-Validation ## Summary of sample sizes: 41, 41, 41, 41, 41, 41, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 0.04049575 0.01826180 0.03233088 ## 3 0.04100862 0.02199224 0.03305649 ## 4 0.04153769 0.01562694 0.03340031 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 2."},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"prediction","dir":"Articles","previous_headings":"Example 2: A real-world example > Model training and prediction","what":"Prediction","title":"2. Area of applicability of spatial prediction models","text":"Next, model used make predictions entire study area.","code":"#Predictors: plot(stretch(studyArea[[predictors]])) #prediction: prediction <- predict(studyArea,model,na.rm=TRUE)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"aoa-estimation","dir":"Articles","previous_headings":"Example 2: A real-world example","what":"AOA estimation","title":"2. Area of applicability of spatial prediction models","text":"Next ’re limiting predictions AOA. Predictions outside AOA excluded.","code":"AOA <- aoa(studyArea,model) #### Plot results: plot(AOA$DI,col=viridis(100),main=\"DI with sampling locations (red)\") plot(pts,zcol=\"ID\",col=\"red\",add=TRUE) plot(prediction, col=viridis(100),main=\"prediction for AOA \\n(LOOCV error applies)\") plot(AOA$AOA,col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"final-notes","dir":"Articles","previous_headings":"","what":"Final notes","title":"2. Area of applicability of spatial prediction models","text":"AOA estimated based training data new data (.e. raster group entire area interest). trained model used getting variable importance needed weight predictor variables. can given table either, approach can used packages caret well. Knowledge AOA important predictions used baseline decision making subsequent environmental modelling. suggest AOA provided alongside prediction map complementary communication validation performances.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"further-reading","dir":"Articles","previous_headings":"Final notes","what":"Further reading","title":"2. Area of applicability of spatial prediction models","text":"Meyer, H., & Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. Accepted. Meyer, H., & Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. [https://doi.org/10.1111/2041-210X.13650] Tutorial (https://youtu./EyP04zLe9qo) Lecture (https://youtu./OoNH6Nl-X2s) recording OpenGeoHub summer school 2020 area applicability. well talk OpenGeoHub summer school 2021: https://av.tib.eu/media/54879","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast03-AOA-parallel.html","id":"generate-example-data","dir":"Articles","previous_headings":"","what":"Generate Example Data","title":"3. AOA in Parallel","text":"","code":"library(CAST) library(caret) library(terra) library(sf) data(\"splotdata\") predictors <- rast(system.file(\"extdata\",\"predictors_chile.tif\",package=\"CAST\")) splotdata <- st_drop_geometry(splotdata) set.seed(10) model_random <- train(splotdata[,names(predictors)], splotdata$Species_richness, method=\"rf\", importance=TRUE, ntrees = 50, trControl = trainControl(method=\"cv\")) prediction_random <- predict(predictors,model_random,na.rm=TRUE)"},{"path":"https://hannameyer.github.io/CAST/articles/cast03-AOA-parallel.html","id":"parallel-aoa-by-dividing-the-new-data","dir":"Articles","previous_headings":"","what":"Parallel AOA by dividing the new data","title":"3. AOA in Parallel","text":"better performances, recommended compute AOA two steps. First, DI training data resulting DI threshold computed model training data function trainDI. result trainDI usually first step aoa function, however can skipped providing trainDI object function call. makes possible compute AOA multiple raster tiles (e.g. different cores). especially useful large prediction areas, e.g. global mapping. large raster, divide multiple smaller tiles apply trainDI object afterwards tile. Use trainDI argument aoa function specify, want use previously computed trainDI object. can now run aoa function parallel different tiles! course can use favorite parallel backend task, use mclapply parallel package. larger tasks might useful save tiles hard-drive load one one avoid filling RAM.","code":"model_random_trainDI = trainDI(model_random) print(model_random_trainDI) ## DI of 703 observation ## Predictors: bio_1 bio_4 bio_5 bio_6 bio_8 bio_9 bio_12 bio_13 bio_14 bio_15 elev ## ## AOA Threshold: 0.1941761 saveRDS(model_random_trainDI, \"path/to/file\") r1 = crop(predictors, c(-75.66667, -67, -30, -17.58333)) r2 = crop(predictors, c(-75.66667, -67, -45, -30)) r3 = crop(predictors, c(-75.66667, -67, -55.58333, -45)) plot(r1[[1]],main = \"Tile 1\") plot(r2[[1]],main = \"Tile 2\") plot(r3[[1]],main = \"Tile 3\") aoa_r1 = aoa(newdata = r1, trainDI = model_random_trainDI) plot(r1[[1]], main = \"Tile 1: Predictors\") plot(aoa_r1$DI, main = \"Tile 1: DI\") plot(aoa_r1$AOA, main = \"Tile 1: AOA\") library(parallel) tiles_aoa = mclapply(list(r1, r2, r3), function(tile){ aoa(newdata = tile, trainDI = model_random_trainDI) }, mc.cores = 3) plot(tiles_aoa[[1]]$AOA, main = \"Tile 1\") plot(tiles_aoa[[2]]$AOA, main = \"Tile 2\") plot(tiles_aoa[[3]]$AOA, main = \"Tile 3\") # Simple Example Code for raster tiles on the hard drive tiles = list.files(\"path/to/tiles\", full.names = TRUE) tiles_aoa = mclapply(tiles, function(tile){ current = terra::rast(tile) aoa(newdata = current, trainDI = model_random_trainDI) }, mc.cores = 3)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"4. Visualization of nearest neighbor distance distributions","text":"tutorial shows euclidean nearest neighbor distances geographic space feature space can calculated visualized using CAST. type visualization allows assess whether training data feature representative coverage prediction area cross-validation (CV) folds (independent test data) adequately chosen representative prediction locations. See e.g. Meyer Pebesma (2022) Milà et al. (2022) discussion topic.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"sample-data","dir":"Articles","previous_headings":"","what":"Sample data","title":"4. Visualization of nearest neighbor distance distributions","text":"example data, use two different sets global virtual reference data: One spatial random sample second example, reference data clustered geographic space (see Meyer Pebesma (2022) discussions ). can define parameters run example different settings","code":"library(CAST) library(caret) library(terra) library(sf) library(rnaturalearth) library(ggplot2) seed <- 10 # random realization samplesize <- 300 # how many samples will be used? nparents <- 20 #For clustered samples: How many clusters? radius <- 500000 # For clustered samples: What is the radius of a cluster?"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"prediction-area","dir":"Articles","previous_headings":"Sample data","what":"Prediction area","title":"4. Visualization of nearest neighbor distance distributions","text":"prediction area entire global land area, .e. imagine prediction task aim making global predictions based set reference data.","code":"ee <- st_crs(\"+proj=eqearth\") co <- ne_countries(returnclass = \"sf\") co.ee <- st_transform(co, ee)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"spatial-random-sample","dir":"Articles","previous_headings":"Sample data","what":"Spatial random sample","title":"4. Visualization of nearest neighbor distance distributions","text":", simulate random sample visualize data entire global prediction area.","code":"sf_use_s2(FALSE) set.seed(seed) pts_random <- st_sample(co.ee, samplesize) ### See points on the map: ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = \"none\", col = \"none\") + labs(x = NULL, y = NULL)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"clustered-sample","dir":"Articles","previous_headings":"Sample data","what":"Clustered sample","title":"4. Visualization of nearest neighbor distance distributions","text":"second data set use clustered design size.","code":"set.seed(seed) sf_use_s2(FALSE) pts_clustered <- clustered_sample(co.ee, samplesize, nparents, radius) ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_clustered, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = \"none\", col = \"none\") + labs(x = NULL, y = NULL)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"distances-in-geographic-space","dir":"Articles","previous_headings":"","what":"Distances in geographic space","title":"4. Visualization of nearest neighbor distance distributions","text":"can plot distributions spatial distances reference data nearest neighbor (“sample--sample”) distribution distances points global land surface nearest reference data point (“sample--prediction”). Note samples prediction locations used calculate sample--prediction nearest neighbor distances. Since ’re using global case study , throughout tutorial use sampling=Fibonacci draw prediction locations constant point density sphere. Note random data set nearest neighbor distance distribution training data quasi identical nearest neighbor distance distribution prediction area. comparison, second data set number training data heavily clustered geographic space. therefore see nearest neighbor distances within reference data rather small. Prediction locations, however, average much away.","code":"dist_random <- geodist(pts_random,co.ee, sampling=\"Fibonacci\") dist_clstr <- geodist(pts_clustered,co.ee, sampling=\"Fibonacci\") plot(dist_random, unit = \"km\")+scale_x_log10(labels=round)+ggtitle(\"Randomly distributed reference data\") plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)+ggtitle(\"Clustered reference data\")"},{"path":[]},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"random-cross-validation","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Random Cross-validation","title":"4. Visualization of nearest neighbor distance distributions","text":"Let’s use clustered data set show distribution spatial nearest neighbor distances cross-validation can visualized well. Therefore, first use “default” way random 10-fold cross validation randomly split reference data training test (see Meyer et al., 2018 2019 see might good idea). Obviously CV folds representative prediction locations (least terms distance nearest training data point). .e. folds used performance assessment model, can expect overly optimistic estimates validate predictions close proximity reference data.","code":"randomfolds <- caret::createFolds(1:nrow(pts_clustered)) dist_clstr <- geodist(pts_clustered,co.ee, sampling=\"Fibonacci\", cvfolds= randomfolds) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"spatial-cross-validation","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Spatial Cross-validation","title":"4. Visualization of nearest neighbor distance distributions","text":", however, case CV performance regarded representative prediction task. Therefore, use spatial CV instead. , use leave-cluster-CV, means iteration, one spatial clusters held back. See fits nearest neighbor distribution prediction area much better. Note geodist also allows inspecting independent test data instead cross validation folds. See ?geodist ?plot.geodist.","code":"spatialfolds <- CreateSpacetimeFolds(pts_clustered,spacevar=\"parent\",k=length(unique(pts_clustered$parent))) dist_clstr <- geodist(pts_clustered,co.ee, sampling=\"Fibonacci\", cvfolds= spatialfolds$indexOut) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"why-has-spatial-cv-sometimes-blamed-for-being-too-pessimistic","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Why has spatial CV sometimes blamed for being too pessimistic ?","title":"4. Visualization of nearest neighbor distance distributions","text":"Recently, Wadoux et al. (2021) published paper title “Spatial cross-validation right way evaluate map accuracy” state “spatial cross-validation strategies resulted grossly pessimistic map accuracy assessment”. come conclusion? reference data used study either regularly, random comparably mildly clustered geographic space, applied spatial CV strategies held large spatial units back CV. can see happens apply spatial CV randomly distributed reference data. see nearest neighbor distances cross-validation don’t match nearest neighbor distances prediction. compared section , time cross-validation folds far away reference data. Naturally end overly pessimistic performance estimates make prediction situations cross-validation harder, compared required model application entire area interest (global). spatial CV chosen therefore suitable prediction task, prediction situations created CV resemble encountered prediction.","code":"# create a spatial CV for the randomly distributed data. Here: # \"leave region-out-CV\" sf_use_s2(FALSE) pts_random_co <- st_join(st_as_sf(pts_random),co.ee) ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random_co, aes(color=subregion),size=0.5, shape=3) + scale_color_manual(values=rainbow(length(unique(pts_random_co$subregion))))+ guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL)+ ggtitle(\"spatial fold membership by color\") spfolds_rand <- CreateSpacetimeFolds(pts_random_co,spacevar = \"subregion\", k=length(unique(pts_random_co$subregion))) dist_rand_sp <- geodist(pts_random_co,co.ee, sampling=\"Fibonacci\", cvfolds= spfolds_rand$indexOut) plot(dist_rand_sp, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"nearest-neighbour-distance-matching-cv","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Nearest Neighbour Distance Matching CV","title":"4. Visualization of nearest neighbor distance distributions","text":"good way approximate geographical prediction distances CV use Nearest Neighbour Distance Matching (NNDM) CV (see Milà et al., 2022 details). NNDM CV variation LOO CV empirical distribution function nearest neighbour distances found prediction matched CV process. NNDM CV-distance distribution matches sample--prediction distribution well. happens use NNDM CV randomly-distributed sampling points instead? NNDM CV-distance still matches sample--prediction distance function.","code":"nndmfolds_clstr <- nndm(pts_clustered, modeldomain=co.ee, samplesize = 2000) dist_clstr <- geodist(pts_clustered,co.ee, sampling = \"Fibonacci\", cvfolds = nndmfolds_clstr$indx_test, cvtrain = nndmfolds_clstr$indx_train) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round) nndmfolds_rand <- nndm(pts_random_co, modeldomain=co.ee, samplesize = 2000) dist_rand <- geodist(pts_random_co,co.ee, sampling = \"Fibonacci\", cvfolds = nndmfolds_rand$indx_test, cvtrain = nndmfolds_rand$indx_train) plot(dist_rand, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"k-fold-nearest-neighbour-distance-matching-cv","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"k-fold Nearest Neighbour Distance Matching CV","title":"4. Visualization of nearest neighbor distance distributions","text":"Since NNDM CV highly time consuming, k-fold version may provide good trade-. See (see Linnenbrink et al., 2023 details)","code":"knndmfolds_clstr <- knndm(pts_clustered, modeldomain=co.ee, samplesize = 2000) pts_clustered$knndmCV <- as.character(knndmfolds_clstr$clusters) ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_clustered, aes(color=knndmCV),size=0.5, shape=3) + scale_color_manual(values=rainbow(length(unique(pts_clustered$knndmCV))))+ guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL)+ ggtitle(\"spatial fold membership by color\") dist_clstr <- geodist(pts_clustered,co.ee, sampling = \"Fibonacci\", cvfolds = knndmfolds_clstr$indx_test, cvtrain = knndmfolds_clstr$indx_train) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"distances-in-feature-space","dir":"Articles","previous_headings":"","what":"Distances in feature space","title":"4. Visualization of nearest neighbor distance distributions","text":"far compared nearest neighbor distances geographic space. can also feature space. Therefore, set bioclimatic variables used (https://www.worldclim.org) features (.e. predictors) virtual prediction task. visualize nearest neighbor feature space distances consideration cross-validation. regard chosen predictor variables see nearest neighbor distance clustered training data rather small, compared required prediction. random CV representative prediction locations spatial CV better job.","code":"predictors_global <- rast(system.file(\"extdata\",\"bioclim_global.tif\",package=\"CAST\")) plot(predictors_global) # use random CV: dist_clstr_rCV <- geodist(pts_clustered,predictors_global, type = \"feature\", sampling=\"Fibonacci\", cvfolds = randomfolds) # use spatial CV: dist_clstr_sCV <- geodist(pts_clustered,predictors_global, type = \"feature\", sampling=\"Fibonacci\", cvfolds = spatialfolds$indexOut) # Plot results: plot(dist_clstr_rCV)+scale_x_log10()+ggtitle(\"Clustered reference data and random CV\") plot(dist_clstr_sCV)+scale_x_log10()+ggtitle(\"Clustered reference data and spatial CV\")"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"references","dir":"Articles","previous_headings":"Distances in feature space","what":"References","title":"4. Visualization of nearest neighbor distance distributions","text":"Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications 13, 2208. https://doi.org/10.1038/s41467-022-29838-9 Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. https://doi.org/10.1111/2041-210X.13851. Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H. (2023): kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308.","code":""},{"path":"https://hannameyer.github.io/CAST/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Hanna Meyer. Maintainer, author. Carles Milà. Author. Marvin Ludwig. Author. Jan Linnenbrink. Author. Philipp Otto. Contributor. Chris Reudenbach. Contributor. Thomas Nauss. Contributor. Edzer Pebesma. Contributor.","code":""},{"path":"https://hannameyer.github.io/CAST/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Meyer H, Milà C, Ludwig M, Linnenbrink J (2024). CAST: 'caret' Applications Spatial-Temporal Models. R package version 0.9.0, https://hannameyer.github.io/CAST/, https://github.com/HannaMeyer/CAST.","code":"@Manual{, title = {CAST: 'caret' Applications for Spatial-Temporal Models}, author = {Hanna Meyer and Carles Milà and Marvin Ludwig and Jan Linnenbrink}, year = {2024}, note = {R package version 0.9.0, https://hannameyer.github.io/CAST/}, url = {https://github.com/HannaMeyer/CAST}, }"},{"path":"https://hannameyer.github.io/CAST/index.html","id":"cast-caret-applications-for-spatio-temporal-models","dir":"","previous_headings":"","what":"caret Applications for Spatial-Temporal Models","title":"caret Applications for Spatial-Temporal Models","text":"Supporting functionality run ‘caret’ spatial spatial-temporal data. ‘caret’ frequently used package model training prediction using machine learning. CAST includes functions improve spatial spatial-temporal modelling tasks using ‘caret’. decrease spatial overfitting improve model performances, package implements forward feature selection selects suitable predictor variables view contribution spatial spatio-temporal model performance. CAST includes functionality estimate (spatial) area applicability prediction models. Note: developer version CAST can found https://github.com/HannaMeyer/CAST. CRAN Version can found https://CRAN.R-project.org/package=CAST","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"package-website","dir":"","previous_headings":"","what":"Package Website","title":"caret Applications for Spatial-Temporal Models","text":"https://hannameyer.github.io/CAST/","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"tutorials","dir":"","previous_headings":"","what":"Tutorials","title":"caret Applications for Spatial-Temporal Models","text":"Introduction CAST Area applicability spatial prediction models Area applicability parallel Visualization nearest neighbor distance distributions talk OpenGeoHub summer school 2019 spatial validation variable selection: https://www.youtube.com/watch?v=mkHlmYEzsVQ. Tutorial (https://youtu./EyP04zLe9qo) Lecture (https://youtu./OoNH6Nl-X2s) recording OpenGeoHub summer school 2020 area applicability. well talk OpenGeoHub summer school 2021: https://av.tib.eu/media/54879 Talk tutorial OpenGeoHub 2022 summer school Machine learning-based maps environment - challenges extrapolation overfitting, including discussions area applicability nearest neighbor distance matching cross-validation (https://doi.org/10.5446/59412).","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/index.html","id":"spatial-cross-validation","dir":"","previous_headings":"Scientific documentation of the methods","what":"Spatial cross-validation","title":"caret Applications for Spatial-Temporal Models","text":"Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. https://doi.org/10.1111/2041-210X.13851 Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H.: kNNDM (2023): k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation. EGUsphere [preprint]. https://doi.org/10.5194/egusphere-2023-1308","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"spatial-variable-selection","dir":"","previous_headings":"Scientific documentation of the methods","what":"Spatial variable selection","title":"caret Applications for Spatial-Temporal Models","text":"Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauss, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software, 101, 1-9. https://doi.org/10.1016/j.envsoft.2017.12.001 Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T. (2019): Importance spatial predictor variable selection machine learning applications - Moving data reproduction spatial prediction. Ecological Modelling. 411. https://doi.org/10.1016/j.ecolmodel.2019.108815","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"area-of-applicability","dir":"","previous_headings":"Scientific documentation of the methods","what":"Area of applicability","title":"caret Applications for Spatial-Temporal Models","text":"Meyer, H., Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. https://doi.org/10.1111/2041-210X.13650","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"applications-and-use-cases","dir":"","previous_headings":"Scientific documentation of the methods","what":"Applications and use cases","title":"caret Applications for Spatial-Temporal Models","text":"Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications, 13. https://www.nature.com/articles/s41467-022-29838-9 Ludwig, M., Moreno-Martinez, ., Hoelzel, N., Pebesma, E., Meyer, H. (2023): Assessing improving transferability current global spatial prediction models. Global Ecology Biogeography. https://doi.org/10.1111/geb.13635.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":null,"dir":"Reference","previous_headings":"","what":"'caret' Applications for Spatial-Temporal Models — CAST","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"Supporting functionality run 'caret' spatial spatial-temporal data. 'caret' frequently used package model training prediction using machine learning. CAST includes functions improve spatial-temporal modelling tasks using 'caret'. includes newly suggested 'Nearest neighbor distance matching' cross-validation estimate performance spatial prediction models allows spatial variable selection selects suitable predictor variables view contribution spatial model performance. CAST includes functionality estimate (spatial) area applicability prediction models analysing similarity new data training data. Methods described Meyer et al. (2018); Meyer et al. (2019); Meyer Pebesma (2021); Milà et al. (2022); Meyer Pebesma (2022).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"'caret' Applications Spatio-Temporal models","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H.: kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308, 2023. Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. 13. Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution. 12, 1620– 1633. Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T. (2019): Importance spatial predictor variable selection machine learning applications - Moving data reproduction spatial prediction. Ecological Modelling. 411, 108815. Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauß, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1-9.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"Hanna Meyer, Carles Milà, Marvin Ludwig, Lan Linnenbrink","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":null,"dir":"Reference","previous_headings":"","what":"Create Space-time Folds — CreateSpacetimeFolds","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Create spatial, temporal spatio-temporal Folds cross validation based pre-defined groups","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"","code":"CreateSpacetimeFolds( x, spacevar = NA, timevar = NA, k = 10, class = NA, seed = sample(1:1000, 1) )"},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"x data.frame containing spatio-temporal data spacevar Character indicating column x identifies spatial units (e.g. ID weather stations) timevar Character indicating column x identifies temporal units (e.g. day year) k numeric. Number folds. spacevar timevar NA leave one location leave one time step cv performed, set k number unique spatial temporal units. class Character indicating column x identifies class unit (e.g. land cover) seed numeric. See ?seed","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"list contains list model training list model validation can directly used \"index\" \"indexOut\" caret's trainControl function","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"function creates train test sets taking (spatial /temporal) groups account. contrast nndm, requires groups already defined (e.g. spatial clusters blocks temporal units). Using \"class\" helpful case data clustered space categorical. E.g case land cover classifications training data come training polygons. case data split way entire polygons held back (spacevar=\"polygonID\") time distribution classes similar fold (class=\"LUC\").","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Standard k-fold cross-validation can lead considerable misinterpretation spatial-temporal modelling tasks. function can used prepare Leave-Location-, Leave-Time-Leave-Location--Time-cross-validation target-oriented validation strategies spatial-temporal prediction tasks. See Meyer et al. (2018) information.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauß, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1-9.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"","code":"if (FALSE) { dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) ### Prepare for 10-fold Leave-Location-and-Time-Out cross validation indices <- CreateSpacetimeFolds(dat,\"SOURCEID\",\"Date\") str(indices) ### Prepare for 10-fold Leave-Location-Out cross validation indices <- CreateSpacetimeFolds(dat,spacevar=\"SOURCEID\") str(indices) ### Prepare for leave-One-Location-Out cross validation indices <- CreateSpacetimeFolds(dat,spacevar=\"SOURCEID\", k=length(unique(dat$SOURCEID))) str(indices) }"},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":null,"dir":"Reference","previous_headings":"","what":"Model the relationship between the DI and the prediction error — DItoErrormetric","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"Performance metrics calculated moving windows DI values cross-validated training data","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"","code":"DItoErrormetric( model, trainDI, multiCV = FALSE, length.out = 10, window.size = 5, calib = \"scam\", method = \"L2\", useWeight = TRUE, k = 6, m = 2 )"},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"model model used get AOA trainDI result trainDI aoa object aoa multiCV Logical. Re-run model fitting validation different CV strategies. See details. length.Numeric. used multiCV=TRUE. Number cross-validation folds. See details. window.size Numeric. Size moving window. See rollapply. calib Character. Function model DI~performance relationship. Currently lm scam supported method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. See ?aoa explanation useWeight Logical. model given. Weight variables according importance model? k Numeric. See mgcv::s m Numeric. See mgcv::s","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"scam linear model","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"multiCV=TRUE model re-fitted validated length.new cross-validations cross-validation folds defined clusters predictor space, ranging three clusters LOOCV. Hence, large range DI values created cross-validation. AOA threshold based calibration data multiple CV larger original AOA threshold (likely extrapolation situations created CV), AOA threshold changes accordingly. See Meyer Pebesma (2021) full documentation methodology.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"Hanna Meyer, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"","code":"if (FALSE) { library(CAST) library(sf) library(terra) library(caret) data(splotdata) splotdata <- st_drop_geometry(splotdata) predictors <- terra::rast(system.file(\"extdata\",\"predictors_chile.tif\", package=\"CAST\")) model <- caret::train(splotdata[,6:16], splotdata$Species_richness, ntree = 10, trControl = trainControl(method = \"cv\", savePredictions = TRUE)) AOA <- aoa(predictors, model) errormodel <- DItoErrormetric(model, AOA) plot(errormodel) expected_error = terra::predict(AOA$DI, errormodel) plot(expected_error) # with multiCV = TRUE errormodel = DItoErrormetric(model, AOA, multiCV = TRUE, length.out = 3) plot(errormodel) expected_error = terra::predict(AOA$DI, errormodel) plot(expected_error) # mask AOA based on new threshold from multiCV mask_aoa = terra::mask(expected_error, AOA$DI > attr(errormodel, 'AOA_threshold'), maskvalues = 1) plot(mask_aoa) }"},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":null,"dir":"Reference","previous_headings":"","what":"Area of Applicability — aoa","title":"Area of Applicability — aoa","text":"function estimates Dissimilarity Index (DI) derived Area Applicability (AOA) spatial prediction models considering distance new data (.e. SpatRaster spatial predictors used models) predictor variable space data used model training. Predictors can weighted based internal variable importance machine learning algorithm used model training. AOA derived applying threshold DI (outlier-removed) maximum DI cross-validated training data.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Area of Applicability — aoa","text":"","code":"aoa( newdata, model = NA, trainDI = NA, train = NULL, weight = NA, variables = \"all\", CVtest = NULL, CVtrain = NULL, method = \"L2\", useWeight = TRUE )"},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Area of Applicability — aoa","text":"newdata SpatRaster, stars object data.frame containing data model meant make predictions . model train object created caret used extract weights (based variable importance) well cross-validation folds. See examples case model available models trained via e.g. mlr3. trainDI trainDI object. Optional trainDI calculated beforehand. train data.frame containing data used model training. Optional. required model given weight data.frame containing weights variable. Optional. required model given. variables character vector predictor variables. \"\" variables model used model given train dataset. CVtest list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. required model given. CVtrain list. element contains data points used training cross validation iteration (.e. held back data). required model given required CVtrain opposite CVtest (.e. data point used testing, used training). Relevant data points excluded, e.g. using nndm. method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. useWeight Logical. model given. Weight variables according importance model?","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Area of Applicability — aoa","text":"object class aoa containing: parameters object class trainDI. see trainDI DI SpatRaster, stars object data frame. Dissimilarity index newdata AOA SpatRaster, stars object data frame. Area Applicability newdata. AOA values 0 (outside AOA) 1 (inside AOA)","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Area of Applicability — aoa","text":"Dissimilarity Index (DI) corresponding Area Applicability (AOA) calculated. variables factors, dummy variables created prior weighting distance calculation. Interpretation results: location similar properties training data low distance predictor variable space (DI towards 0) locations different properties high DI. See Meyer Pebesma (2021) full documentation methodology.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Area of Applicability — aoa","text":"classification models used, currently variable importance can automatically retrieved models trained via train(predictors,response) via formula-interface. fixed.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Area of Applicability — aoa","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution 12: 1620-1633. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Area of Applicability — aoa","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Area of Applicability — aoa","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) library(viridis) # prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) pts$ID <- 1:nrow(pts) set.seed(100) pts <- pts[1:30,] studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\"))[[1:8]] trainDat <- extract(studyArea,pts,na.rm=FALSE) trainDat <- merge(trainDat,pts,by.x=\"ID\",by.y=\"ID\") # visualize data spatially: plot(studyArea) plot(studyArea$DEM) plot(pts[,1],add=TRUE,col=\"black\") # train a model: set.seed(100) variables <- c(\"DEM\",\"NDRE.Sd\",\"TWI\") model <- train(trainDat[,which(names(trainDat)%in%variables)], trainDat$VW, method=\"rf\", importance=TRUE, tuneLength=1, trControl=trainControl(method=\"cv\",number=5,savePredictions=T)) print(model) #note that this is a quite poor prediction model prediction <- predict(studyArea,model,na.rm=TRUE) plot(varImp(model,scale=FALSE)) #...then calculate the AOA of the trained model for the study area: AOA <- aoa(studyArea,model) plot(AOA) #### #The AOA can also be calculated without a trained model. #All variables are weighted equally in this case: #### AOA <- aoa(studyArea,train=trainDat,variables=variables) #### # The AOA can also be used for models trained via mlr3 (parameters have to be assigned manually): #### library(mlr3) library(mlr3learners) library(mlr3spatial) library(mlr3spatiotempcv) library(mlr3extralearners) # initiate and train model: train_df <- trainDat[, c(\"DEM\",\"NDRE.Sd\",\"TWI\", \"VW\")] backend <- as_data_backend(train_df) task <- as_task_regr(backend, target = \"VW\") lrn <- lrn(\"regr.randomForest\", importance = \"mse\") lrn$train(task) # cross-validation folds rsmp_cv <- rsmp(\"cv\", folds = 5L)$instantiate(task) ## predict: prediction <- predict(studyArea,lrn$model,na.rm=TRUE) ### Estimate AOA AOA <- aoa(studyArea, train = as.data.frame(task$data()), variables = task$feature_names, weight = data.frame(t(lrn$importance())), CVtest = rsmp_cv$instance[order(row_id)]$fold) }"},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":null,"dir":"Reference","previous_headings":"","what":"Best subset feature selection — bss","title":"Best subset feature selection — bss","text":"Evaluate combinations predictors model training","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Best subset feature selection — bss","text":"","code":"bss( predictors, response, method = \"rf\", metric = ifelse(is.factor(response), \"Accuracy\", \"RMSE\"), maximize = ifelse(metric == \"RMSE\", FALSE, TRUE), globalval = FALSE, trControl = caret::trainControl(), tuneLength = 3, tuneGrid = NULL, seed = 100, verbose = TRUE, ... )"},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Best subset feature selection — bss","text":"predictors see train response see train method see train metric see train maximize see train globalval Logical. models evaluated based 'global' performance? See global_validation trControl see train tuneLength see train tuneGrid see train seed random number verbose Logical. information progress printed? ... arguments passed classification regression routine (randomForest).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Best subset feature selection — bss","text":"list class train. Beside usual train content object contains vector \"selectedvars\" \"selectedvars_perf\" give best variables selected well corresponding performance. also contains \"perf_all\" gives performance model runs.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Best subset feature selection — bss","text":"bss alternative ffs ideal training set small. Models iteratively fitted using different combinations predictor variables. Hence, 2^X models calculated. try running bss large datasets computation time much higher compared ffs. internal cross validation can run parallel. See information parallel processing carets train functions details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Best subset feature selection — bss","text":"variable selection particularly suitable spatial cross validations variable selection MUST based performance model predicting new spatial units. Note bss slow since combinations variables tested. time efficient alternative forward feature selection (ffs) (ffs).","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Best subset feature selection — bss","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Best subset feature selection — bss","text":"","code":"if (FALSE) { data(iris) bssmodel <- bss(iris[,1:4],iris$Species) bssmodel$perf_all }"},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":null,"dir":"Reference","previous_headings":"","what":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"Performance metrics calculated moving windows DI values cross-validated training data","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"","code":"calibrate_aoa( AOA, model, window.size = 5, calib = \"scam\", multiCV = FALSE, length.out = 10, maskAOA = TRUE, method = \"L2\", useWeight = TRUE, showPlot = TRUE, k = 6, m = 2 )"},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"AOA result aoa model model used get AOA window.size Numeric. Size moving window. See rollapply. calib Character. Function model DI~performance relationship. Currently lm scam supported multiCV Logical. Re-run model fitting validation different CV strategies. See details. length.Numeric. used multiCV=TRUE. Number cross-validation folds. See details. maskAOA Logical. areas outside AOA set NA? method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. See ?aoa explanation useWeight Logical. model given. Weight variables according importance model? showPlot Logical. k Numeric. See mgcv::s m Numeric. See mgcv::s","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"list length 2 elements \"AOA\": SpatRaster stars object contains original DI AOA (might updated new test data indicate option), well expected performance based relationship. Data used calibration stored attributes. second element plot showing relationship.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"multiCV=TRUE model re-fitted validated length.new cross-validations cross-validation folds defined clusters predictor space, ranging three clusters LOOCV. Hence, large range DI values created cross-validation. AOA threshold based calibration data multiple CV larger original AOA threshold (likely extrapolation situations created CV), AOA updated accordingly. See Meyer Pebesma (2021) full documentation methodology.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) library(viridis) library(latticeExtra) #' # prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) pts$ID <- 1:nrow(pts) studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\"))[[1:8]] dat <- extract(studyArea,pts,na.rm=TRUE) trainDat <- merge(dat,pts,by.x=\"ID\",by.y=\"ID\") # train a model: variables <- c(\"DEM\",\"NDRE.Sd\",\"TWI\") set.seed(100) model <- train(trainDat[,which(names(trainDat)%in%variables)], trainDat$VW,method=\"rf\",importance=TRUE,tuneLength=1, trControl=trainControl(method=\"cv\",number=5,savePredictions=TRUE)) #...then calculate the AOA of the trained model for the study area: AOA <- aoa(studyArea,model) AOA_new <- calibrate_aoa(AOA,model) plot(AOA_new$AOA$expected_RMSE) }"},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":null,"dir":"Reference","previous_headings":"","what":"Clustered samples simulation — clustered_sample","title":"Clustered samples simulation — clustered_sample","text":"simple procedure simulate clustered points based two-step sampling.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Clustered samples simulation — clustered_sample","text":"","code":"clustered_sample(sarea, nsamples, nparents, radius)"},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Clustered samples simulation — clustered_sample","text":"sarea polygon. Area samples simulated. nsamples integer. Number samples simulated. nparents integer. Number parents. radius integer. Radius buffer around parent offspring simulation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Clustered samples simulation — clustered_sample","text":"sf object simulated points parent point belongs .","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Clustered samples simulation — clustered_sample","text":"simple procedure simulate clustered points based two-step sampling. First, pre-specified number parents simulated using random sampling. parent, `(nsamples-nparents)/nparents` simulated within radius parent point using random sampling.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Clustered samples simulation — clustered_sample","text":"Carles Milà","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Clustered samples simulation — clustered_sample","text":"","code":"# Simulate 100 points in a 100x100 square with 5 parents and a radius of 10. library(sf) #> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE library(ggplot2) set.seed(1234) simarea <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) simarea <- sf::st_polygon(simarea) simpoints <- clustered_sample(simarea, 100, 5, 10) simpoints$parent <- as.factor(simpoints$parent) ggplot() + geom_sf(data = simarea, alpha = 0) + geom_sf(data = simpoints, aes(col = parent))"},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":null,"dir":"Reference","previous_headings":"","what":"Model expected error between Metric and DI — errorModel","title":"Model expected error between Metric and DI — errorModel","text":"Model expected error Metric DI","code":""},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Model expected error between Metric and DI — errorModel","text":"","code":"errorModel(preds_all, model, window.size, calib, k, m)"},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Model expected error between Metric and DI — errorModel","text":"preds_all data.frame: pred, obs, DI model model used get AOA window.size Numeric. Size moving window. See rollapply. calib Character. Function model DI~performance relationship. Currently lm scam supported k Numeric. See mgcv::s m Numeric. See mgcv::s","code":""},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Model expected error between Metric and DI — errorModel","text":"scam lm","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":null,"dir":"Reference","previous_headings":"","what":"Forward feature selection — ffs","title":"Forward feature selection — ffs","text":"simple forward feature selection algorithm","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Forward feature selection — ffs","text":"","code":"ffs( predictors, response, method = \"rf\", metric = ifelse(is.factor(response), \"Accuracy\", \"RMSE\"), maximize = ifelse(metric == \"RMSE\", FALSE, TRUE), globalval = FALSE, withinSE = FALSE, minVar = 2, trControl = caret::trainControl(), tuneLength = 3, tuneGrid = NULL, seed = sample(1:1000, 1), verbose = TRUE, ... )"},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Forward feature selection — ffs","text":"predictors see train response see train method see train metric see train maximize see train globalval Logical. models evaluated based 'global' performance? See global_validation withinSE Logical Models selected better currently best models Standard error minVar Numeric. Number variables combine first selection. See Details. trControl see train tuneLength see train tuneGrid see train seed random number used model training verbose Logical. information progress printed? ... arguments passed classification regression routine (randomForest).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Forward feature selection — ffs","text":"list class train. Beside usual train content object contains vector \"selectedvars\" \"selectedvars_perf\" give order best variables selected well corresponding performance (starting first two variables). also contains \"perf_all\" gives performance model runs.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Forward feature selection — ffs","text":"Models two predictors first trained using possible pairs predictor variables. best model initial models kept. basis best model predictor variables iteratively increased remaining variables tested improvement currently best model. process stops none remaining variables increases model performance added current best model. internal cross validation can run parallel. See information parallel processing carets train functions details. Using withinSE favour models less variables probably shorten calculation time Per Default, ffs starts possible 2-pair combinations. minVar allows start selection 2 variables, e.g. minVar=3 starts ffs testing combinations 3 (instead 2) variables first increasing number. important e.g. neural networks often make sense two variables. also relevant assumed optimal variables can found 2 considered time.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Forward feature selection — ffs","text":"variable selection particularly suitable spatial cross validations variable selection MUST based performance model predicting new spatial units. See Meyer et al. (2018) Meyer et al. (2019) details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Forward feature selection — ffs","text":"Gasch, C.K., Hengl, T., Gräler, B., Meyer, H., Magney, T., Brown, D.J. (2015): Spatio-temporal interpolation soil water, temperature, electrical conductivity 3D+T: Cook Agronomy Farm data set. Spatial Statistics 14: 70-90. Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauß, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1-9. doi:10.1016/j.envsoft.2017.12.001 Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T. (2019): Importance spatial predictor variable selection machine learning applications - Moving data reproduction spatial prediction. Ecological Modelling. 411, 108815. doi:10.1016/j.ecolmodel.2019.108815 . Ludwig, M., Moreno-Martinez, ., Hölzel, N., Pebesma, E., Meyer, H. (2023): Assessing improving transferability current global spatial prediction models. Global Ecology Biogeography. doi:10.1111/geb.13635 .","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Forward feature selection — ffs","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Forward feature selection — ffs","text":"","code":"if (FALSE) { data(iris) ffsmodel <- ffs(iris[,1:4],iris$Species) ffsmodel$selectedvars ffsmodel$selectedvars_perf } # or perform model with target-oriented validation (LLO CV) #the example is described in Gasch et al. (2015). The ffs approach for this dataset is described in #Meyer et al. (2018). Due to high computation time needed, only a small and thus not robust example #is shown here. if (FALSE) { #run the model on three cores: library(doParallel) library(lubridate) cl <- makeCluster(3) registerDoParallel(cl) #load and prepare dataset: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) trainDat <- dat[dat$altitude==-0.3&year(dat$Date)==2012&week(dat$Date)%in%c(13:14),] #visualize dataset: ggplot(data = trainDat, aes(x=Date, y=VW)) + geom_line(aes(colour=SOURCEID)) #create folds for Leave Location Out Cross Validation: set.seed(10) indices <- CreateSpacetimeFolds(trainDat,spacevar = \"SOURCEID\",k=3) ctrl <- trainControl(method=\"cv\",index = indices$index) #define potential predictors: predictors <- c(\"DEM\",\"TWI\",\"BLD\",\"Precip_cum\",\"cday\",\"MaxT_wrcc\", \"Precip_wrcc\",\"NDRE.M\",\"Bt\",\"MinT_wrcc\",\"Northing\",\"Easting\") #run ffs model with Leave Location out CV set.seed(10) ffsmodel <- ffs(trainDat[,predictors],trainDat$VW,method=\"rf\", tuneLength=1,trControl=ctrl) ffsmodel plot(ffsmodel) #or only selected variables: plot(ffsmodel,plotType=\"selected\") #compare to model without ffs: model <- train(trainDat[,predictors],trainDat$VW,method=\"rf\", tuneLength=1, trControl=ctrl) model stopCluster(cl) }"},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"Calculates nearest neighbor distances geographic space feature space training data well training data prediction locations. Optional, nearest neighbor distances training data test data training data CV iterations computed.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"","code":"geodist( x, modeldomain, type = \"geo\", cvfolds = NULL, cvtrain = NULL, testdata = NULL, preddata = NULL, samplesize = 2000, sampling = \"regular\", variables = NULL )"},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"x object class sf, training data locations modeldomain SpatRaster, stars sf object defining prediction area (see Details) type \"geo\" \"feature\". distance computed geographic space normalized multivariate predictor space (see Details) cvfolds optional. list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. See e.g. ?createFolds ?CreateSpacetimeFolds ?nndm cvtrain optional. List row indices x fit model CV iteration. cvtrain null cvfolds , samples included cvfolds used training data testdata optional. object class sf: Point data used independent validation preddata optional. object class sf: Point data indicating locations within modeldomain used target prediction points. Useful prediction objective subset locations within modeldomain rather whole area. samplesize numeric. many prediction samples used? sampling character. draw prediction samples? See spsample. Use sampling = \"Fibonacci\" global applications. variables character vector defining predictor variables used type=\"feature. provided variables included modeldomain used.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"data.frame containing distances. Unit returned geographic distances meters. attributes contain W statistic prediction area either sample data, CV folds test data. See details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"modeldomain sf polygon raster defines prediction area. function takes regular point sample (amount defined samplesize) spatial extent. type = \"feature\", argument modeldomain (provided also testdata /preddata) include predictors. Predictor values x, testdata preddata optional modeldomain raster. provided extracted modeldomain rasterStack. W statistic describes match distributions. See Linnenbrink et al (2023) details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"See Meyer Pebesma (2022) application plotting function","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"Hanna Meyer, Edzer Pebesma, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"","code":"if (FALSE) { library(CAST) library(sf) library(terra) library(caret) library(rnaturalearth) library(ggplot2) data(splotdata) studyArea <- rnaturalearth::ne_countries(continent = \"South America\", returnclass = \"sf\") ########### Distance between training data and new data: dist <- geodist(splotdata, studyArea) plot(dist) ########### Distance between training data, new data and test data (here Chile): plot(splotdata[,\"Country\"]) dist <- geodist(splotdata[splotdata$Country != \"Chile\",], studyArea, testdata = splotdata[splotdata$Country == \"Chile\",]) plot(dist) ########### Distance between training data, new data and CV folds: folds <- createFolds(1:nrow(splotdata), k=3, returnTrain=FALSE) dist <- geodist(x=splotdata, modeldomain=studyArea, cvfolds=folds) plot(dist) ########### Distances in the feature space: predictors <- terra::rast(system.file(\"extdata\",\"predictors_chile.tif\", package=\"CAST\")) dist <- geodist(x = splotdata, modeldomain = predictors, type = \"feature\", variables = c(\"bio_1\",\"bio_12\", \"elev\")) plot(dist) dist <- geodist(x = splotdata[splotdata$Country != \"Chile\",], modeldomain = predictors, cvfolds = folds, testdata = splotdata[splotdata$Country == \"Chile\",], type = \"feature\", variables=c(\"bio_1\",\"bio_12\", \"elev\")) plot(dist) ############ Example for a random global dataset ############ (refer to figure in Meyer and Pebesma 2022) ### Define prediction area (here: global): ee <- st_crs(\"+proj=eqearth\") co <- ne_countries(returnclass = \"sf\") co.ee <- st_transform(co, ee) ### Simulate a spatial random sample ### (alternatively replace pts_random by a real sampling dataset (see Meyer and Pebesma 2022): sf_use_s2(FALSE) pts_random <- st_sample(co.ee, 2000, exact=FALSE) ### See points on the map: ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = \"none\", col = \"none\") + labs(x = NULL, y = NULL) ### plot distances: dist <- geodist(pts_random,co.ee) plot(dist) + scale_x_log10(labels=round) }"},{"path":"https://hannameyer.github.io/CAST/reference/get_preds_all.html","id":null,"dir":"Reference","previous_headings":"","what":"Get Preds all — get_preds_all","title":"Get Preds all — get_preds_all","text":"Get Preds ","code":""},{"path":"https://hannameyer.github.io/CAST/reference/get_preds_all.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Get Preds all — get_preds_all","text":"","code":"get_preds_all(model, trainDI)"},{"path":"https://hannameyer.github.io/CAST/reference/get_preds_all.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Get Preds all — get_preds_all","text":"model, model trainDI, trainDI","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":null,"dir":"Reference","previous_headings":"","what":"Evaluate 'global' cross-validation — global_validation","title":"Evaluate 'global' cross-validation — global_validation","text":"Calculate validation metric using held back predictions ","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Evaluate 'global' cross-validation — global_validation","text":"","code":"global_validation(model)"},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Evaluate 'global' cross-validation — global_validation","text":"model object class train","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Evaluate 'global' cross-validation — global_validation","text":"regression (postResample) classification (confusionMatrix) statistics","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Evaluate 'global' cross-validation — global_validation","text":"Relevant folds representative entire area interest. case, metrics like R2 meaningful since reflect general ability model explain entire gradient response. Comparable LOOCV, predictions held back folds used together calculate validation statistics.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Evaluate 'global' cross-validation — global_validation","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Evaluate 'global' cross-validation — global_validation","text":"","code":"dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- dat[sample(1:nrow(dat),500),] indices <- CreateSpacetimeFolds(dat,\"SOURCEID\",\"Date\") ctrl <- caret::trainControl(method=\"cv\",index = indices$index,savePredictions=\"final\") model <- caret::train(dat[,c(\"DEM\",\"TWI\",\"BLD\")],dat$VW, method=\"rf\", trControl=ctrl, ntree=10) #> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 . #> #> Loading required package: lattice global_validation(model) #> RMSE Rsquared MAE #> 0.08848113 0.13992098 0.06953367"},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":null,"dir":"Reference","previous_headings":"","what":"K-fold Nearest Neighbour Distance Matching — knndm","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"function implements kNNDM algorithm returns necessary indices perform k-fold NNDM CV map validation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"","code":"knndm( tpoints, modeldomain = NULL, ppoints = NULL, space = \"geographical\", k = 10, maxp = 0.5, clustering = \"hierarchical\", linkf = \"ward.D2\", samplesize = 1000, sampling = \"regular\" )"},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"tpoints sf sfc point object. Contains training points samples. modeldomain sf polygon object defining prediction area. Optional; alternative ppoints (see Details). ppoints sf sfc point object. Contains target prediction points. Optional; alternative modeldomain (see Details). space character. \"geographical\" knndm, .e. kNNDM geographical space, currently implemented. k integer. Number folds desired CV. Defaults 10. maxp numeric. Maximum fold size allowed, defaults 0.5, .e. single fold can hold maximum half training points. clustering character. Possible values include \"hierarchical\" \"kmeans\". See details. linkf character. relevant clustering = \"hierarchical\". Link function agglomerative hierarchical clustering. Defaults \"ward.D2\". Check `stats::hclust` options. samplesize numeric. many points modeldomain sampled prediction points? required modeldomain used instead ppoints. sampling character. draw prediction points modeldomain? See `sf::st_sample`. required modeldomain used instead ppoints.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"object class knndm consisting list eight elements: indx_train, indx_test (indices observations use training/test data kNNDM CV iteration), Gij (distances G function construction prediction target points), Gj (distances G function construction LOO CV), Gjstar (distances modified G function kNNDM CV), clusters (list cluster IDs), W (Wasserstein statistic), space (stated user function call).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"knndm k-fold version NNDM LOO CV medium large datasets. Brielfy, algorithm tries find k-fold configuration integral absolute differences (Wasserstein W statistic) empirical nearest neighbour distance distribution function test training data CV (Gj*), empirical nearest neighbour distance distribution function prediction training points (Gij), minimised. performing clustering training points' coordinates different numbers clusters range k N (number observations), merging k final folds, selecting configuration lowest W. Using projected CRS `knndm` large computational advantages since fast nearest neighbour search can done via `FNN` package, working geographic coordinates requires computing full spherical distance matrices. clustering algorithm, `kmeans` can used projected CRS `hierarchical` can work projected geographical coordinates, though requires calculating full distance matrix training points even projected CRS. order select clustering algorithms number folds `k`, different `knndm` configurations can run compared, one lower W statistic one offers better match. W statistics `knndm` runs comparable long `tpoints` `ppoints` `modeldomain` stay . Map validation using knndm used using `CAST::global_validation`, .e. stacking --sample predictions evaluating . reasons behind 1) resulting folds can unbalanced 2) nearest neighbour functions constructed matched using CV folds simultaneously. training data points clustered respect prediction area presented knndm configuration still show signs Gj* > Gij, several things can tried. First, increase `maxp` parameter; may help control strong clustering (cost unbalanced folds). Secondly, decrease number final folds `k`, may help larger clusters. `modeldomain` sf polygon defines prediction area. function takes regular point sample (amount defined `samplesize`) spatial extent. alternative use `ppoints` instead `modeldomain`, already defined prediction locations (e.g. raster pixel centroids). using either `modeldomain` `ppoints`, advise plot study area polygon training/prediction points previous step ensure aligned.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"Experimental cycle. Article describing testing algorithm preparation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H.: kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308, 2023. Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"Carles Milà Jan Linnenbrink","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"","code":"######################################################################## # Example 1: Simulated data - Randomly-distributed training points ######################################################################## library(sf) library(ggplot2) # Simulate 1000 random training points in a 100x100 square set.seed(1234) simarea <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) simarea <- sf::st_polygon(simarea) train_points <- sf::st_sample(simarea, 1000, type = \"random\") pred_points <- sf::st_sample(simarea, 1000, type = \"regular\") plot(simarea) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run kNNDM for the whole domain, here the prediction points are known. knndm_folds <- knndm(train_points, ppoints = pred_points, k = 5) #> Warning: Missing CRS in training or prediction points. Assuming projected CRS. #> Gij <= Gj; a random CV assignment is returned knndm_folds #> knndm object #> Space: geographical #> Clustering algorithm: hierarchical #> Intermediate clusters (q): random CV #> W statistic: 0.1338 #> Number of folds: 5 #> Observations in each fold: 200 200 200 200 200 plot(knndm_folds) folds <- as.character(knndm_folds$clusters) ggplot() + geom_sf(data = simarea, alpha = 0) + geom_sf(data = train_points, aes(col = folds)) ######################################################################## # Example 2: Simulated data - Clustered training points ######################################################################## if (FALSE) { library(sf) library(ggplot2) # Simulate 1000 clustered training points in a 100x100 square set.seed(1234) simarea <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) simarea <- sf::st_polygon(simarea) train_points <- clustered_sample(simarea, 1000, 50, 5) pred_points <- sf::st_sample(simarea, 1000, type = \"regular\") plot(simarea) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run kNNDM for the whole domain, here the prediction points are known. knndm_folds <- knndm(train_points, ppoints = pred_points, k = 5) knndm_folds plot(knndm_folds) folds <- as.character(knndm_folds$clusters) ggplot() + geom_sf(data = simarea, alpha = 0) + geom_sf(data = train_points, aes(col = folds)) } ######################################################################## # Example 3: Real- world example; using a modeldomain instead of previously # sampled prediction locations ######################################################################## if (FALSE) { library(sf) library(terra) library(ggplot2) ### prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\", \"Easting\", \"Northing\",\"VW\")], by=list(as.character(dat$SOURCEID)),mean) pts <- dat[,-1] pts <- st_as_sf(pts,coords=c(\"Easting\",\"Northing\")) st_crs(pts) <- 26911 studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) studyArea[!is.na(studyArea)] <- 1 studyArea <- as.polygons(studyArea, values = FALSE, na.all = TRUE) |> st_as_sf() |> st_union() pts <- st_transform(pts, crs = st_crs(studyArea)) plot(studyArea) plot(st_geometry(pts), add = TRUE, col = \"red\") knndm_folds <- knndm(pts, modeldomain=studyArea, k = 5) knndm_folds plot(knndm_folds) folds <- as.character(knndm_folds$clusters) ggplot() + geom_sf(data = pts, aes(col = folds)) #use for cross-validation: library(caret) ctrl <- trainControl(method=\"cv\", index=knndm_folds$indx_train, savePredictions='final') model_knndm <- train(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\")], dat$VW, method=\"rf\", trControl = ctrl) global_validation(model_knndm) }"},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":null,"dir":"Reference","previous_headings":"","what":"MultiCV — multiCV","title":"MultiCV — multiCV","text":"Multiple Cross-Validation increasing feature space clusteres","code":""},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"MultiCV — multiCV","text":"","code":"multiCV(model, length.out, method, useWeight, ...)"},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"MultiCV — multiCV","text":"model model used get AOA length.Numeric. used multiCV=TRUE. Number cross-validation folds. See details. method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. See ?aoa explanation useWeight Logical. model given. Weight variables according importance model? ... additional parameters trainDI","code":""},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"MultiCV — multiCV","text":"preds_all","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":null,"dir":"Reference","previous_headings":"","what":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"function implements NNDM algorithm returns necessary indices perform NNDM LOO CV map validation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"","code":"nndm( tpoints, modeldomain = NULL, ppoints = NULL, samplesize = 1000, sampling = \"regular\", phi = \"max\", min_train = 0.5 )"},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"tpoints sf sfc point object. Contains training points samples. modeldomain sf polygon object defining prediction area (see Details). ppoints sf sfc point object. Contains target prediction points. Optional. Alternative modeldomain (see Details). samplesize numeric. many points modeldomain sampled prediction points? required modeldomain used instead ppoints. sampling character. draw prediction points modeldomain? See `sf::st_sample`. required modeldomain used instead ppoints. phi Numeric. Estimate landscape autocorrelation range units tpoints ppoints projected CRS, meters geographic CRS. Per default (phi=\"max\"), size prediction area used. See Details. min_train Numeric 0 1. Minimum proportion training data must used CV fold. Defaults 0.5 (.e. half training points).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"object class nndm consisting list six elements: indx_train, indx_test, indx_exclude (indices observations use training/test/excluded data NNDM LOO CV iteration), Gij (distances G function construction prediction target points), Gj (distances G function construction LOO CV), Gjstar (distances modified G function NNDM LOO CV), phi (landscape autocorrelation range). indx_train indx_test can directly used \"index\" \"indexOut\" caret's trainControl function used initiate custom validation strategy mlr3.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"NNDM proposes LOO CV scheme nearest neighbour distance distribution function test training data CV process matched nearest neighbour distance distribution function prediction training points. Details method can found Milà et al. (2022). Specifying phi allows limiting distance matching area assumed relevant due spatial autocorrelation. Distances matched phi. Beyond range, data points used training, without exclusions. phi set \"max\", nearest neighbor distance matching performed entire prediction area. Euclidean distances used projected non-defined CRS, great circle distances used geographic CRS (units meters). modeldomain sf polygon defines prediction area. function takes regular point sample (amount defined samplesize) spatial extent. alternative use ppoints instead modeldomain, already defined prediction locations (e.g. raster pixel centroids). using either modeldomain ppoints, advise plot study area polygon training/prediction points previous step ensure aligned.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"NNDM variation LOOCV therefore may take long time large training data sets. k-fold variant implemented shortly.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. 13.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"Carles Milà","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"","code":"######################################################################## # Example 1: Simulated data - Randomly-distributed training points ######################################################################## library(sf) # Simulate 100 random training points in a 100x100 square set.seed(123) poly <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) sample_poly <- sf::st_polygon(poly) train_points <- sf::st_sample(sample_poly, 100, type = \"random\") pred_points <- sf::st_sample(sample_poly, 100, type = \"regular\") plot(sample_poly) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run NNDM for the whole domain, here the prediction points are known nndm_pred <- nndm(train_points, ppoints=pred_points) nndm_pred #> nndm object #> Total number of points: 100 #> Mean number of training points: 98.54 #> Minimum number of training points: 83 plot(nndm_pred) # ...or run NNDM with a known autocorrelation range of 10 # to restrict the matching to distances lower than that. nndm_pred <- nndm(train_points, ppoints=pred_points, phi = 10) nndm_pred #> nndm object #> Total number of points: 100 #> Mean number of training points: 98.72 #> Minimum number of training points: 96 plot(nndm_pred) ######################################################################## # Example 2: Simulated data - Clustered training points ######################################################################## library(sf) # Simulate 100 clustered training points in a 100x100 square set.seed(123) poly <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) sample_poly <- sf::st_polygon(poly) train_points <- clustered_sample(sample_poly, 100, 10, 5) pred_points <- sf::st_sample(sample_poly, 100, type = \"regular\") plot(sample_poly) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run NNDM for the whole domain nndm_pred <- nndm(train_points, ppoints=pred_points) nndm_pred #> nndm object #> Total number of points: 100 #> Mean number of training points: 86.84 #> Minimum number of training points: 50 plot(nndm_pred) ######################################################################## # Example 3: Real- world example; using a modeldomain instead of previously # sampled prediction locations ######################################################################## if (FALSE) { library(sf) library(terra) ### prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\", \"Easting\", \"Northing\",\"VW\")], by=list(as.character(dat$SOURCEID)),mean) pts <- dat[,-1] pts <- st_as_sf(pts,coords=c(\"Easting\",\"Northing\")) st_crs(pts) <- 26911 studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) studyArea[!is.na(studyArea)] <- 1 studyArea <- as.polygons(studyArea, values = FALSE, na.all = TRUE) |> st_as_sf() |> st_union() pts <- st_transform(pts, crs = st_crs(studyArea)) plot(studyArea) plot(st_geometry(pts), add = TRUE, col = \"red\") nndm_folds <- nndm(pts, modeldomain= studyArea) plot(nndm_folds) #use for cross-validation: library(caret) ctrl <- trainControl(method=\"cv\", index=nndm_folds$indx_train, indexOut=nndm_folds$indx_test, savePredictions='final') model_nndm <- train(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\")], dat$VW, method=\"rf\", trControl = ctrl) global_validation(model_nndm) }"},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot CAST classes — plot","title":"Plot CAST classes — plot","text":"Generic plot function CAST Classes plotting function forward feature selection result. point mean performance model run. Error bars represent standard errors cross validation. Marked points show best model number variables variable improve results. type==\"selected\", contribution selected variables model performance shown. Density plot nearest neighbor distances geographic space feature space training data well training data prediction locations. Optional, nearest neighbor distances training data test data training data CV iterations shown. plot can used check suitability chosen CV method representative estimate map accuracy. Plot DI errormetric Cross-Validation modelled relationship","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot CAST classes — plot","text":"","code":"# S3 method for trainDI plot(x, ...) # S3 method for aoa plot(x, samplesize = 1000, ...) # S3 method for nndm plot(x, ...) # S3 method for knndm plot(x, ...) # S3 method for ffs plot( x, plotType = \"all\", palette = rainbow, reverse = FALSE, marker = \"black\", size = 1.5, lwd = 0.5, pch = 21, ... ) # S3 method for geodist plot(x, unit = \"m\", stat = \"density\", ...) # S3 method for errorModel plot(x, ...)"},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot CAST classes — plot","text":"x errorModel, see DItoErrormetric ... params samplesize numeric. many prediction samples plotted? plotType character. Either \"\" \"selected\" palette color palette reverse Character. palette reversed? marker Character. Color mark best models size Numeric. Size points lwd Numeric. Width error bars pch Numeric. Type point marking best models unit character. type==\"geo\" applied plot. Supported: \"m\" \"km\". stat \"density\" density plot \"ecdf\" empirical cumulative distribution function plot.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot CAST classes — plot","text":"ggplot ggplot","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Plot CAST classes — plot","text":"Marvin Ludwig, Hanna Meyer Carles Milà Marvin Ludwig Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot CAST classes — plot","text":"","code":"if (FALSE) { data(splotdata) splotdata <- st_drop_geometry(splotdata) ffsmodel <- ffs(splotdata[,6:16], splotdata$Species_richness, ntree = 10) plot(ffsmodel) #plot performance of selected variables only: plot(ffsmodel,plotType=\"selected\") }"},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot results of a Forward feature selection or best subset selection — plot_ffs","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"plot_ffs() deprecated removed soon. Please use generic plot() function ffs object. plotting function forward feature selection result. point mean performance model run. Error bars represent standard errors cross validation. Marked points show best model number variables variable improve results. type==\"selected\", contribution selected variables model performance shown.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"","code":"plot_ffs( ffs_model, plotType = \"all\", palette = rainbow, reverse = FALSE, marker = \"black\", size = 1.5, lwd = 0.5, pch = 21, ... )"},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"ffs_model Result forward feature selection see ffs plotType character. Either \"\" \"selected\" palette color palette reverse Character. palette reversed? marker Character. Color mark best models size Numeric. Size points lwd Numeric. Width error bars pch Numeric. Type point marking best models ... arguments base plot type=\"selected\"","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"Marvin Ludwig Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"","code":"if (FALSE) { data(iris) ffsmodel <- ffs(iris[,1:4],iris$Species) plot(ffsmodel) #plot performance of selected variables only: plot(ffsmodel,plotType=\"selected\") }"},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"Density plot nearest neighbor distances geographic space feature space training data well training data prediction locations. Optional, nearest neighbor distances training data test data training data CV iterations shown. plot can used check suitability chosen CV method representative estimate map accuracy. Alternatively distances can also calculated multivariate feature space.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"","code":"plot_geodist( x, modeldomain, type = \"geo\", cvfolds = NULL, cvtrain = NULL, testdata = NULL, samplesize = 2000, sampling = \"regular\", variables = NULL, unit = \"m\", stat = \"density\", showPlot = TRUE )"},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"x object class sf, training data locations modeldomain SpatRaster, stars sf object defining prediction area (see Details) type \"geo\" \"feature\". distance computed geographic space normalized multivariate predictor space (see Details) cvfolds optional. list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. See e.g. ?createFolds ?CreateSpacetimeFolds ?nndm cvtrain optional. List row indices x fit model CV iteration. cvtrain null cvfolds , samples included cvfolds used training data testdata optional. object class sf: Data used independent validation samplesize numeric. many prediction samples used? sampling character. draw prediction samples? See spsample. Use sampling = \"Fibonacci\" global applications. variables character vector defining predictor variables used type=\"feature. provided variables included modeldomain used. unit character. type==\"geo\" applied plot. Supported: \"m\" \"km\". stat \"density\" density plot \"ecdf\" empirical cumulative distribution function plot. showPlot logical","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"list including plot corresponding data.frame containing distances. Unit returned geographic distances meters.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"modeldomain sf polygon raster defines prediction area. function takes regular point sample (amount defined samplesize) spatial extent. type = \"feature\", argument modeldomain (provided also testdata) include predictors. Predictor values x optional modeldomain raster. provided extracted modeldomain rasterStack.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"See Meyer Pebesma (2022) application plotting function","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"Hanna Meyer, Edzer Pebesma, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) ########### prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\", \"Easting\", \"Northing\")], by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) st_crs(pts) <- 26911 pts_train <- pts[1:29,] pts_test <- pts[30:42,] studyArea <- terra::rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) studyArea <- studyArea[[c(\"DEM\",\"TWI\", \"NDRE.M\", \"NDRE.Sd\", \"Bt\")]] ########### Distance between training data and new data: dist <- plot_geodist(pts_train,studyArea) ########### Distance between training data, new data and test data: #mapview(pts_train,col.regions=\"blue\")+mapview(pts_test,col.regions=\"red\") dist <- plot_geodist(pts_train,studyArea,testdata=pts_test) ########### Distance between training data, new data and CV folds: folds <- createFolds(1:nrow(pts_train),k=3,returnTrain=FALSE) dist <- plot_geodist(x=pts_train, modeldomain=studyArea, cvfolds=folds) ## or use nndm to define folds AOI <- as.polygons(rast(studyArea), values = F) |> st_as_sf() |> st_union() |> st_transform(crs = st_crs(pts_train)) nndm_pred <- nndm(pts_train, AOI) dist <- plot_geodist(x=pts_train, modeldomain=studyArea, cvfolds=nndm_pred$indx_test, cvtrain=nndm_pred$indx_train) ########### Distances in the feature space: plot_geodist(x=pts_train, modeldomain=studyArea, type = \"feature\",variables=c(\"DEM\",\"TWI\", \"NDRE.M\")) dist <- plot_geodist(x=pts_train, modeldomain=studyArea, cvfolds = folds, testdata = pts_test, type = \"feature\",variables=c(\"DEM\",\"TWI\", \"NDRE.M\")) ############ Example for a random global dataset ############ (refer to figure in Meyer and Pebesma 2022) library(sf) library(rnaturalearth) library(ggplot2) ### Define prediction area (here: global): ee <- st_crs(\"+proj=eqearth\") co <- ne_countries(returnclass = \"sf\") co.ee <- st_transform(co, ee) ### Simulate a spatial random sample ### (alternatively replace pts_random by a real sampling dataset (see Meyer and Pebesma 2022): sf_use_s2(FALSE) pts_random <- st_sample(co.ee, 2000, exact=FALSE) ### See points on the map: ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL) ### plot distances: dist <- plot_geodist(pts_random,co.ee,showPlot=FALSE) dist$plot+scale_x_log10(labels=round) }"},{"path":"https://hannameyer.github.io/CAST/reference/print.html","id":null,"dir":"Reference","previous_headings":"","what":"Print CAST classes — print","title":"Print CAST classes — print","text":"Generic print function trainDI aoa","code":""},{"path":"https://hannameyer.github.io/CAST/reference/print.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Print CAST classes — print","text":"","code":"# S3 method for trainDI print(x, ...) show.trainDI(x, ...) # S3 method for aoa print(x, ...) show.aoa(x, ...) # S3 method for nndm print(x, ...) show.nndm(x, ...) # S3 method for knndm print(x, ...) show.knndm(x, ...) # S3 method for ffs print(x, ...) show.ffs(x, ...)"},{"path":"https://hannameyer.github.io/CAST/reference/print.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Print CAST classes — print","text":"x object type ffs ... arguments.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":null,"dir":"Reference","previous_headings":"","what":"sPlotOpen Data of Species Richness — splotdata","title":"sPlotOpen Data of Species Richness — splotdata","text":"sPlotOpen Species Richness South America associated predictors","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"sPlotOpen Data of Species Richness — splotdata","text":"","code":"data(splotdata)"},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"sPlotOpen Data of Species Richness — splotdata","text":"sf points / data.frame 703 rows 17 columns: PlotObeservationID, GIVD_ID, Country, Biome sPlotOpen Metadata Species_richness Response Variable - Plant species richness sPlotOpen bio_x, elev Predictor Variables - Worldclim SRTM elevation geometry Lat/Lon","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"source","dir":"Reference","previous_headings":"","what":"Source","title":"sPlotOpen Data of Species Richness — splotdata","text":"Plot Species_richness sPlotOpen predictors acquired via R package geodata","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"sPlotOpen Data of Species Richness — splotdata","text":"Sabatini, F. M. et al. sPlotOpen – environmentally balanced, open‐access, global dataset vegetation plots. (2021). doi:10.1111/geb.13346 Lopez-Gonzalez, G. et al. ForestPlots.net: web application research tool manage analyse tropical forest plot data: ForestPlots.net. Journal Vegetation Science (2011). Pauchard, . et al. Alien Plants Homogenise Protected Areas: Evidence Landscape Regional Scales South Central Chile. Plant Invasions Protected Areas (2013). Peyre, G. et al. VegPáramo, flora vegetation database Andean páramo. phytocoenologia (2015). Vibrans, . C. et al. Insights large-scale inventory southern Brazilian Atlantic Forest. Scientia Agricola (2020).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate Dissimilarity Index of training data — trainDI","title":"Calculate Dissimilarity Index of training data — trainDI","text":"function estimates Dissimilarity Index (DI) within training data set used prediction model. Predictors can weighted based internal variable importance machine learning algorithm used model training.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate Dissimilarity Index of training data — trainDI","text":"","code":"trainDI( model = NA, train = NULL, variables = \"all\", weight = NA, CVtest = NULL, CVtrain = NULL, method = \"L2\", useWeight = TRUE )"},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate Dissimilarity Index of training data — trainDI","text":"model train object created caret used extract weights (based variable importance) well cross-validation folds train data.frame containing data used model training. required model given variables character vector predictor variables. \"\" variables model used model given train dataset. weight data.frame containing weights variable. required model given. CVtest list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. required model given. CVtrain list. element contains data points used training cross validation iteration (.e. held back data). required model given required CVtrain opposite CVtest (.e. data point used testing, used training). Relevant data points excluded, e.g. using nndm. method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. useWeight Logical. model given. Weight variables according importance model?","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate Dissimilarity Index of training data — trainDI","text":"list class trainDI containing: train data frame containing training data weight data frame weights based variable importance. variables Names used variables catvars variables categorial scaleparam Scaling parameters. Output scale trainDist_avrg data frame average distance training point every point trainDist_avrgmean mean trainDist_avrg. Used normalizing DI trainDI Dissimilarity Index training data threshold DI threshold used inside/outside AOA","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Calculate Dissimilarity Index of training data — trainDI","text":"function called within aoa estimate DI AOA new data. However, may also used DI training data interest, facilitate parallelization aoa avoiding repeated calculation DI within training data.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Calculate Dissimilarity Index of training data — trainDI","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Calculate Dissimilarity Index of training data — trainDI","text":"Hanna Meyer, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate Dissimilarity Index of training data — trainDI","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) library(viridis) library(ggplot2) # prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) pts$ID <- 1:nrow(pts) set.seed(100) pts <- pts[1:30,] studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\"))[[1:8]] trainDat <- extract(studyArea,pts,na.rm=FALSE) trainDat <- merge(trainDat,pts,by.x=\"ID\",by.y=\"ID\") # visualize data spatially: plot(studyArea) plot(studyArea$DEM) plot(pts[,1],add=TRUE,col=\"black\") # train a model: set.seed(100) variables <- c(\"DEM\",\"NDRE.Sd\",\"TWI\") model <- train(trainDat[,which(names(trainDat)%in%variables)], trainDat$VW, method=\"rf\", importance=TRUE, tuneLength=1, trControl=trainControl(method=\"cv\",number=5,savePredictions=T)) print(model) #note that this is a quite poor prediction model prediction <- predict(studyArea,model,na.rm=TRUE) plot(varImp(model,scale=FALSE)) #...then calculate the DI of the trained model: DI = trainDI(model=model) plot(DI) # the DI can now be used to compute the AOA: AOA = aoa(studyArea, model = model, trainDI = DI) print(AOA) plot(AOA) }"},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-090","dir":"Changelog","previous_headings":"","what":"CAST 0.9.0","title":"CAST 0.9.0","text":"CAST functions now return classes generic plotting printing new dataset examples, tutorials testing: data(splotdata) calibrate_aoa now DItoErrormetric returns model (see function documentation) plot_geodist now geodist. result can visualized plot() plot_ffs now plot(ffs) fix issue #65 (threshold) plot_geodist, plot_ffs, calibrate_aoa","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-081","dir":"Changelog","previous_headings":"","what":"CAST 0.8.1","title":"CAST 0.8.1","text":"CRAN release: 2023-05-30 failed checks Fedora 34 fixed","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-080","dir":"Changelog","previous_headings":"","what":"CAST 0.8.0","title":"CAST 0.8.0","text":"CRAN release: 2023-05-21 knndm alternative nndm large training data transition raster terra","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-071","dir":"Changelog","previous_headings":"","what":"CAST 0.7.1","title":"CAST 0.7.1","text":"CRAN release: 2023-02-04 Mahalanobis distance AOA assessment option faster estimation AOA delineation default threshold fixed suggested github.com/HannaMeyer/CAST/issues/46 fixed issue github.com/ropensci/rnaturalearth/issues/69","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-070","dir":"Changelog","previous_headings":"","what":"CAST 0.7.0","title":"CAST 0.7.0","text":"CRAN release: 2022-08-24 nndm cross-validation suggested Milà et al. (2022) plot_geodist works NNDM trainDI works NNDM rename parameter folds AOA trainDI","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-060","dir":"Changelog","previous_headings":"","what":"CAST 0.6.0","title":"CAST 0.6.0","text":"CRAN release: 2022-03-17 trainDI allows calculate DI training dataset separately aoa function plot print functions AOA function plot nearest neighbor distance distributions geographic feature space function global_validation added extensive restructuring AOA function ffs bss can used global_validation error manual assignment weights fixed","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-051","dir":"Changelog","previous_headings":"","what":"CAST 0.5.1","title":"CAST 0.5.1","text":"CRAN release: 2021-04-07 resolved dependence package “GSIF” removed CRAN repository","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-050","dir":"Changelog","previous_headings":"","what":"CAST 0.5.0","title":"CAST 0.5.0","text":"CRAN release: 2021-02-19 AOA can run parallel calibration DI (calibrate_aoa) aoa work now large training sets default threshold AOA changed","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-042","dir":"Changelog","previous_headings":"","what":"CAST 0.4.2","title":"CAST 0.4.2","text":"CRAN release: 2020-07-17 aoa now working categorical variables fixed error ffs >170 variables used changed order parameters aoa tutorial “Introduction CAST” improved","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-041","dir":"Changelog","previous_headings":"","what":"CAST 0.4.1","title":"CAST 0.4.1","text":"CRAN release: 2020-05-19 vignette: tutorial introducing “area applicability” variable threshold aoa various modifications aoa line submitted paper","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-040","dir":"Changelog","previous_headings":"","what":"CAST 0.4.0","title":"CAST 0.4.0","text":"CRAN release: 2020-04-06 new function “aoa”: quantify visualize area applicability spatial prediction models “minVar” ffs: Instead always starting 2-pair combinations, ffs can now also started combinations variables (e.g starting combinations 3) ffs failed “svmLinear” previous version S4 class issues. Fixed now.","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-031","dir":"Changelog","previous_headings":"","what":"CAST 0.3.1","title":"CAST 0.3.1","text":"CRAN release: 2018-11-19 CreateSpaceTimeFolds accepts tibbles CreateSpaceTimeFolds automatically reduces k necessary ffs accepts arguments taken caret::train new feature: plot_ffs option plot selected variables ","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-030","dir":"Changelog","previous_headings":"","what":"CAST 0.3.0","title":"CAST 0.3.0","text":"CRAN release: 2018-10-11 new feature: Best subset selection (bss) target-oriented validation (slow reliable) alternative ffs minor adaptations: verbose option included, improved examples ffs bugfix: minor adaptations done usage plsr","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-021","dir":"Changelog","previous_headings":"","what":"CAST 0.2.1","title":"CAST 0.2.1","text":"CRAN release: 2018-07-12 new feature: Introduction CAST included vignette. bugfix: minor error fixed using user defined metrics model selection.","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-020","dir":"Changelog","previous_headings":"","what":"CAST 0.2.0","title":"CAST 0.2.0","text":"CRAN release: 2018-05-03 bugfix: ffs option withinSE=TRUE choose model “best model” within SE model trained earlier run number variables. bug fixed withinSE=TRUE ffs now compares performance models use less variables (e.g. model using 5 variables better model using 4 variables still SE 4-variable model, 4-variable model rated better model). new feature: plot_ffs plots results ffs visualize performance changes according model run number variables used.","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-010","dir":"Changelog","previous_headings":"","what":"CAST 0.1.0","title":"CAST 0.1.0","text":"CRAN release: 2018-01-09 Initial public version CRAN","code":""}]
+[{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"1. Introduction to CAST","text":"!!Note: recent developments CAST yet fully documented tutorial. major update can expected Apr 2024!!","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"background","dir":"Articles","previous_headings":"Introduction","what":"Background","title":"1. Introduction to CAST","text":"One key task environmental science obtaining information environmental variables continuously space space time, usually based remote sensing limited field data. respect, machine learning algorithms proven important tool learn patterns nonlinear complex systems. However, standard machine learning applications suitable spatio-temporal data, usually ignore spatio-temporal dependencies data. becomes problematic (least) two aspects predictive modelling: Overfitted models well overly optimistic error assessment (see Meyer et al 2018 Meyer et al 2019 ). approach problems, CAST supports well-known caret package (Kuhn 2018 provide methods designed spatio-temporal data. tutorial shows set spatio-temporal prediction model includes objective reliable error estimation. shows spatio-temporal overfitting can detected comparison validation strategies. shown certain variables responsible problem overfitting due spatio-temporal autocorrelation patterns. Therefore, tutorial also shows automatically exclude variables lead overfitting aim improve spatio-temporal prediction model. order follow tutorial, assume reader familiar basics predictive modelling nicely explained Kuhn Johnson 2013 well machine learning applications via caret package.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"how-to-start","dir":"Articles","previous_headings":"Introduction","what":"How to start","title":"1. Introduction to CAST","text":"work tutorial, first install CAST package load library: need help, see","code":"#install.packages(\"CAST\") library(CAST) help(CAST)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"example-of-a-typical-spatio-temporal-prediction-task","dir":"Articles","previous_headings":"","what":"Example of a typical spatio-temporal prediction task","title":"1. Introduction to CAST","text":"example prediction task tutorial following: set data loggers distributed farm, want map soil moisture, based set spatial temporal predictor variables. use Random Forests machine learning algorithm tutorial.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"description-of-the-example-dataset","dir":"Articles","previous_headings":"Example of a typical spatio-temporal prediction task","what":"Description of the example dataset","title":"1. Introduction to CAST","text":", work cookfarm dataset, described e.g. Gasch et al 2015 available via GSIF package (Hengl 2017). dataset included CAST package re-structured dataset used analysis Meyer et al 2018. want point following information dataset: “SOURCEID” represents ID data logger, “VW” soil moisture response variable, “Easting” “Northing” coordinates data loggers, “altitude” indicates depth soil VW measured, remaining columns represent different potential predictor variables terrain related (e.g. “DEM”, “TWI”), vegetation indices (e.g. “NDRE”), soil properties (e.g. “BLD”) climate-related predictors (e.g. “Precip_wrcc”). See Gasch et al 2015 description dataset. get impression spatial properties dataset, let’s look spatial distribution data loggers cookfarm: see data taken 42 locations (SOURCEID) field. loggers recorded data 2007 2013 (dataset contains data 2010 ). VW data given daily basis.","code":"data <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) head(data) ## SOURCEID VW Easting Northing altitude DEM TWI NDRE.M ## 101689 CAF357 0.303 493828.1 5181021 -0.3 792.5756 3.791253 0.08161208 ## 213001 CAF357 0.328 493828.1 5181021 -0.6 792.5756 3.791253 0.08161208 ## 324313 CAF357 0.376 493828.1 5181021 -0.9 792.5756 3.791253 0.08161208 ## 435625 CAF357 0.350 493828.1 5181021 -1.2 792.5756 3.791253 0.08161208 ## 546937 CAF357 0.323 493828.1 5181021 -1.5 792.5756 3.791253 0.08161208 ## 101690 CAF357 0.297 493828.1 5181021 -0.3 792.5756 3.791253 0.08161208 ## NDRE.Sd Bt BLD Date Precip_wrcc MaxT_wrcc MinT_wrcc ## 101689 0.2805182 0.0000 1.22 2010-01-01 5.8 2.8 -3.3 ## 213001 0.2805182 0.0000 1.36 2010-01-01 5.8 2.8 -3.3 ## 324313 0.2805182 0.0000 1.48 2010-01-01 5.8 2.8 -3.3 ## 435625 0.2805182 0.0000 1.56 2010-01-01 5.8 2.8 -3.3 ## 546937 0.2805182 0.0106 1.60 2010-01-01 5.8 2.8 -3.3 ## 101690 0.2805182 0.0000 1.22 2010-01-02 6.9 6.1 0.6 ## Precip_cum cday ## 101689 5.8 14611 ## 213001 5.8 14611 ## 324313 5.8 14611 ## 435625 5.8 14611 ## 546937 5.8 14611 ## 101690 12.7 14612 library(sf) data_sp <- unique(data[,c(\"SOURCEID\",\"Easting\",\"Northing\")]) data_sp <- st_as_sf(data_sp,coords=c(\"Easting\",\"Northing\"),crs=26911) plot(data_sp,axes=T,col=\"black\") #...or plot the data with mapview: library(mapview) mapviewOptions(basemaps = c(\"Esri.WorldImagery\")) mapview(data_sp)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"data-subsetting","dir":"Articles","previous_headings":"Example of a typical spatio-temporal prediction task","what":"Data subsetting","title":"1. Introduction to CAST","text":"reduce data amount can handled tutorial, let’s restrict data depth -0.3 two weeks year 2012. subsetting let’s overview soil moisture time series measured data loggers. can see (expected) logger location unique time series soil moisture.","code":"library(lubridate) library(ggplot2) trainDat <- data[data$altitude==-0.3& year(data$Date)==2012& week(data$Date)%in%c(10:12),] ggplot(data = trainDat, aes(x=Date, y=VW)) + geom_line(aes(colour=SOURCEID))"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"model-training-and-prediction","dir":"Articles","previous_headings":"","what":"Model training and prediction","title":"1. Introduction to CAST","text":"following use subset cookfarm data example spatially predict soil moisture (.e. map soil moisture) (without) consideration spatio-temporal dependencies. start , lets use dataset create “default” Random Forest model predicts soil moisture based predictor variables. keep computation time minimum, don’t include hyperparameter tuning (hence mtry set 2) reasonable Random Forests comparably insensitive tuning. Based trained model can make spatial predictions soil moisture. load multiband raster contains spatial data predictor variables 25th March 2012 (example). apply trained model data set. result spatially comprehensive map soil moisture day. see simply creating map using machine learning caret easy task, however accurately measuring performance less simple. Though map looks good first sight now follow question accurate map , hence need ask well model able map soil moisture. visible inspection noticeable model produces strange linear features eastern side farm looks suspicious. let’s come back later first focus statistical validation model.","code":"library(caret) predictors <- c(\"DEM\",\"TWI\",\"Precip_cum\",\"cday\", \"MaxT_wrcc\",\"Precip_wrcc\",\"BLD\", \"Northing\",\"Easting\",\"NDRE.M\") set.seed(10) model <- train(trainDat[,predictors],trainDat$VW, method=\"rf\",tuneGrid=data.frame(\"mtry\"=2), importance=TRUE,ntree=50, trControl=trainControl(method=\"cv\",number=3)) library(terra) predictors_sp <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) prediction <- predict(predictors_sp,model,na.rm=TRUE) plot(prediction)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"cross-validation-strategies-for-spatio-temporal-data","dir":"Articles","previous_headings":"","what":"Cross validation strategies for spatio-temporal data","title":"1. Introduction to CAST","text":"Among validation strategies, k-fold cross validation (CV) popular estimate performance model view data used model training. CV, models repeatedly trained (k models) model run, data one fold put side used model training model validation. way, performance model can estimated using data included model training.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"the-standard-approach-random-k-fold-cv","dir":"Articles","previous_headings":"Cross validation strategies for spatio-temporal data","what":"The Standard approach: Random k-fold CV","title":"1. Introduction to CAST","text":"example used random k-fold CV defined caret’s trainControl argument. specifically, used random 3-fold CV. Hence, data points dataset RANDOMLY split 3 folds. assess performance model let’s look output Random CV: see soil moisture modelled high R² (0.90) indicates nearly perfect fit data. Sounds good, unfortunately, random k fold CV give us good indication map accuracy. Random k-fold CV means three folds (highest certainty) contains data points data logger. Therefore, random CV indicate ability model make predictions beyond location training data (.e. map soil moisture). Since aim map soil moisture, rather need perform target-oriented validation validates model view spatial mapping.","code":"model ## Random Forest ## ## 654 samples ## 10 predictor ## ## No pre-processing ## Resampling: Cross-Validated (3 fold) ## Summary of sample sizes: 436, 437, 435 ## Resampling results: ## ## RMSE Rsquared MAE ## 0.02188303 0.9044144 0.01273172 ## ## Tuning parameter 'mtry' was held constant at a value of 2"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"target-oriented-validation","dir":"Articles","previous_headings":"Cross validation strategies for spatio-temporal data","what":"Target-oriented validation","title":"1. Introduction to CAST","text":"interested model performance view random subsets data loggers, need know well model able make predictions areas without data loggers. find , need repeatedly leave complete time series one data loggers use test data CV. first need create meaningful folds rather random folds. CAST’s function “CreateSpaceTimeFolds” designed provide index arguments used caret’s trainControl. index defines data points used model training model run reversely defines data points held back. Hence, using index argument can account dependencies data leaving complete data one data loggers (LLO CV), one time steps (LTO CV) data loggers time steps (LLTO CV). example ’re focusing LLO CV, therefore use column “SOURCEID” define location data logger split data folds using information. Analog random CV split data five folds, hence five model runs performed leaving one fifth data loggers validation. Note several suggestions spatial CV exist. call LLO just simple example. See references Meyer Pebesma 2022 examples look Mila et al 2022 methodology implemented CAST function nndm. inspecting output model, see view new locations, R² 0.16 performance much lower expected random CV (R² = 0.90). Apparently, considerable overfitting model, causing good random performance poor performance view new locations. might partly attributed choice variables must suspect certain variables misinterpreted model (see Meyer et al 2018 [talk OpenGeoHub summer school 2019] (https://www.youtube.com/watch?v=mkHlmYEzsVQ)). Let’s look variable importance ranking Random Forest see find something suspicious: importance ranking indicates among others, “Easting” important variable. fits observation inappropriate linear features predicted map. Apparently model assigns high importance variable causes high random CV performance. time model fails prediction new locations variable unsuitable predictions beyond locations data loggers used model training. Assuming certain variables misinterpreted algorithm able produce higher LLO performance variables removed. Let’s see true…","code":"set.seed(10) indices <- CreateSpacetimeFolds(trainDat,spacevar = \"SOURCEID\", k=3) set.seed(10) model_LLO <- train(trainDat[,predictors],trainDat$VW, method=\"rf\",tuneGrid=data.frame(\"mtry\"=2), importance=TRUE, trControl=trainControl(method=\"cv\", index = indices$index)) model_LLO ## Random Forest ## ## 654 samples ## 10 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 433, 430, 445 ## Resampling results: ## ## RMSE Rsquared MAE ## 0.07645742 0.1616273 0.05994028 ## ## Tuning parameter 'mtry' was held constant at a value of 2 plot(varImp(model_LLO))"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"removing-variables-that-cause-overfitting","dir":"Articles","previous_headings":"","what":"Removing variables that cause overfitting","title":"1. Introduction to CAST","text":"CAST’s forward feature selection (ffs) selects variables make sense view selected CV method excludes counterproductive (meaningless) view selected CV method. use LLO CV method, ffs selects variables lead combination highest LLO performance (.e. best spatial model). variables spatial meaning even counterproductive won’t improve even reduce LLO performance therefore excluded model ffs. ffs job first training models using possible pairs two predictor variables. best model initial models kept. basis best model predictor variables iterativly increased remaining variables tested improvement currently best model. process stops none remaining variables increases model performance added current best model. let’s run ffs case study using R² metric select optimal variables. process take 1-2 minutes… Using ffs LLO CV, R² increased 0.16 0.28. variables used model “DEM”,“NDRE.M” “Northing”. others removed (least small example) spatial meaning even counterproductive. Using plot\\(\\_\\)ffs function can visualize performance model changed depending variables used: See best model using two variables led R² slightly 0.2. Using third variable slightly increase R². variable improve LLO performance. Note R² features high standard deviation regardless variables used. due small dataset used lead robust results. effect new model spatial representation soil moisture? see variable selection effect statistical performance also predicted spatial patterns change considerably. note linear feature resulting soil moisture map likely “Easting” removed set predictor variables ffs.","code":"set.seed(10) ffsmodel_LLO <- ffs(trainDat[,predictors],trainDat$VW,metric=\"Rsquared\", method=\"rf\", tuneGrid=data.frame(\"mtry\"=2), verbose=FALSE,ntree=50, trControl=trainControl(method=\"cv\", index = indices$index)) ffsmodel_LLO ## Selected Variables: ## DEM NDRE.M Northing ## --- ## Random Forest ## ## 654 samples ## 3 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 433, 430, 445 ## Resampling results: ## ## RMSE Rsquared MAE ## 0.1013101 0.2833983 0.0767997 ## ## Tuning parameter 'mtry' was held constant at a value of 2 ffsmodel_LLO$selectedvars ## [1] \"DEM\" \"NDRE.M\" \"Northing\" plot(ffsmodel_LLO) prediction_ffs <- predict(predictors_sp,ffsmodel_LLO,na.rm=TRUE) plot(prediction_ffs)"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"area-of-applicability","dir":"Articles","previous_headings":"","what":"Area of Applicability","title":"1. Introduction to CAST","text":"Still required analyse model can applied entire study area locations different predictor properties model learned . See details vignette Area applicability Meyer Pebesma 2021. figure shows grey areas outside area applicability, hence predictions considered locations. See tutorial AOA package information.","code":"### AOA for which the spatial CV error applies: AOA <- aoa(predictors_sp,ffsmodel_LLO) plot(prediction_ffs,main=\"prediction for the AOA \\n(spatial CV error applied)\") plot(AOA$AOA,col=c(\"grey\",\"transparent\"),add=T) #spplot(prediction_ffs,main=\"prediction for the AOA \\n(spatial CV error applied)\")+ #spplot(AOA$AOA,col.regions=c(\"grey\",\"transparent\")) ### AOA for which the random CV error applies: AOA_random <- aoa(predictors_sp,model) plot(prediction,main=\"prediction for the AOA \\n(random CV error applied)\") plot(AOA_random$AOA,col=c(\"grey\",\"transparent\"),add=T) #spplot(prediction,main=\"prediction for the AOA \\n(random CV error applied)\")+ #spplot(AOA_random$AOA,col.regions=c(\"grey\",\"transparent\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"conclusions","dir":"Articles","previous_headings":"","what":"Conclusions","title":"1. Introduction to CAST","text":"conclude, tutorial shown CAST can used facilitate target-oriented (: spatial) CV spatial spatio-temporal data crucial obtain meaningful validation results. Using ffs conjunction target-oriented validation, variables can excluded counterproductive view target-oriented performance due misinterpretations algorithm. ffs therefore helps select ideal set predictor variables spatio-temporal prediction tasks gives objective error estimates.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"final-notes","dir":"Articles","previous_headings":"","what":"Final notes","title":"1. Introduction to CAST","text":"intention tutorial describe motivation led development CAST well functionality. Priority modelling soil moisture cookfarm best possible way provide example motivation functionality CAST can run within minutes. Hence, small subset entire cookfarm dataset used. Keep mind due small subset example robust quite different results might obtained depending small changes settings. intention showing motivation CAST also reason coordinates used predictor variables. Though coordinates used predictors quite scientific studies rather provide extreme example misleading variables can lead overfitting.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast01-CAST-intro-cookfarm.html","id":"further-reading","dir":"Articles","previous_headings":"","what":"Further reading","title":"1. Introduction to CAST","text":"Meyer, H., & Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. Accepted. Meyer, H., & Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. [https://doi.org/10.1111/2041-210X.13650] Meyer H, Reudenbach C, Wöllauer S,Nauss T (2019) Importance spatial predictor variable selection machine learning applications–Moving data reproduction spatial prediction. Ecological Modelling 411: 108815 [https://doi.org/10.1016/j.ecolmodel.2019.108815] Meyer H, Reudenbach C, Hengl T, Katurij M, Nauss T (2018) Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1–9 [https://doi.org/10.1016/j.envsoft.2017.12.001] Talk OpenGeoHub summer school 2019 spatial validation variable selection: https://www.youtube.com/watch?v=mkHlmYEzsVQ. Tutorial (https://youtu./EyP04zLe9qo) Lecture (https://youtu./OoNH6Nl-X2s) recording OpenGeoHub summer school 2020 area applicability. well talk OpenGeoHub summer school 2021: https://av.tib.eu/media/54879","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"2. Area of applicability of spatial prediction models","text":"spatial predictive mapping, models often applied make predictions far beyond sampling locations (.e. field observations used map variable even global scale), new locations might considerably differ environmental properties. However, areas predictor space without support training data problematic. model enabled learn relationships environments predictions areas considered highly uncertain. CAST, implement methodology described Meyer&Pebesma (2021) estimate “area applicability” (AOA) (spatial) prediction models. AOA defined area enabled model learn relationships based training data, estimated cross-validation performance holds. delineate AOA, first dissimilarity index (DI) calculated based distances training data multidimensional predictor variable space. account relevance predictor variables responsible prediction patterns weight variables model-derived importance scores prior distance calculation. AOA derived applying threshold based DI observed training data using cross-validation. tutorial shows example estimate area applicability spatial prediction models. information see: Meyer, H., & Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. [https://doi.org/10.1111/2041-210X.13650]","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"getting-started","dir":"Articles","previous_headings":"Introduction","what":"Getting started","title":"2. Area of applicability of spatial prediction models","text":"","code":"library(CAST) library(caret) library(terra) library(sf) library(viridis) library(gridExtra)"},{"path":[]},{"path":[]},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"generate-predictors","dir":"Articles","previous_headings":"Example 1: Using simulated data > Get data","what":"Generate Predictors","title":"2. Area of applicability of spatial prediction models","text":"predictor variables, set bioclimatic variables used (https://www.worldclim.org). tutorial, originally downloaded using getData function raster package cropped area central Europe. cropped data provided CAST package.","code":"predictors <- rast(system.file(\"extdata\",\"bioclim.tif\",package=\"CAST\")) plot(predictors,col=viridis(100))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"generate-response","dir":"Articles","previous_headings":"Example 1: Using simulated data > Get data","what":"Generate Response","title":"2. Area of applicability of spatial prediction models","text":"able test reliability method, ’re using simulated prediction task. therefore simulate virtual response variable bioclimatic variables.","code":"generate_random_response <- function(raster, predictornames = names(raster), seed = sample(seq(1000), 1)){ operands_1 = c(\"+\", \"-\", \"*\", \"/\") operands_2 = c(\"^1\",\"^2\") expression <- paste(as.character(predictornames, sep=\"\")) # assign random power to predictors set.seed(seed) expression <- paste(expression, sample(operands_2, length(predictornames), replace = TRUE), sep = \"\") # assign random math function between predictors (expect after the last one) set.seed(seed) expression[-length(expression)] <- paste(expression[- length(expression)], sample(operands_1, length(predictornames)-1, replace = TRUE), sep = \" \") print(paste0(expression, collapse = \" \")) # collapse e = paste0(\"raster$\", expression, collapse = \" \") response = eval(parse(text = e)) names(response) <- \"response\" return(response) } response <- generate_random_response (predictors, seed = 10) ## [1] \"bio2^1 * bio5^1 + bio10^2 - bio13^2 / bio14^2 / bio19^1\" plot(response,col=viridis(100),main=\"virtual response\")"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"simulate-sampling-locations","dir":"Articles","previous_headings":"Example 1: Using simulated data > Get data","what":"Simulate sampling locations","title":"2. Area of applicability of spatial prediction models","text":"simulate typical prediction task, field sampling locations randomly selected. , randomly select 20 points. Note small data set, used avoid long computation times.","code":"mask <- predictors[[1]] values(mask)[!is.na(values(mask))] <- 1 mask <- st_as_sf(as.polygons(mask)) mask <- st_make_valid(mask) set.seed(15) samplepoints <- st_as_sf(st_sample(mask,20,\"random\")) plot(response,col=viridis(100)) plot(samplepoints,col=\"red\",add=T,pch=3)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"model-training","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"Model training","title":"2. Area of applicability of spatial prediction models","text":"Next, machine learning algorithm applied learn relationships predictors response.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"prepare-data","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Prepare data","title":"2. Area of applicability of spatial prediction models","text":"Therefore, predictors response extracted sampling locations.","code":"trainDat <- extract(predictors,samplepoints,na.rm=FALSE) trainDat$response <- extract(response,samplepoints,na.rm=FALSE, ID=FALSE)$response trainDat <- na.omit(trainDat)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"train-the-model","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Train the model","title":"2. Area of applicability of spatial prediction models","text":"Random Forest applied machine learning algorithm (others can used well, long variable importance returned). model validated default cross-validation estimate prediction error.","code":"set.seed(10) model <- train(trainDat[,names(predictors)], trainDat$response, method=\"rf\", importance=TRUE, trControl = trainControl(method=\"cv\")) print(model) ## Random Forest ## ## 20 samples ## 6 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 18, 18, 18, 18, 18, 18, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 3854.481 1 3310.203 ## 4 3084.764 1 2675.126 ## 6 2960.314 1 2571.475 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 6."},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"variable-importance","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Variable importance","title":"2. Area of applicability of spatial prediction models","text":"estimation AOA require importance individual predictor variables.","code":"plot(varImp(model,scale = F),col=\"black\")"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"predict-and-calculate-error","dir":"Articles","previous_headings":"Example 1: Using simulated data > Model training","what":"Predict and calculate error","title":"2. Area of applicability of spatial prediction models","text":"trained model used make predictions entire area interest. Since simulated area-wide response used, ’s possible tutorial compare predictions true reference.","code":"prediction <- predict(predictors,model,na.rm=T) truediff <- abs(prediction-response) plot(rast(list(prediction,response)),main=c(\"prediction\",\"reference\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"aoa-calculation","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"AOA Calculation","title":"2. Area of applicability of spatial prediction models","text":"visualization shows predictions made model. next step, DI AOA calculated. AOA calculation takes model input extract importance predictors, used weights multidimensional distance calculation. Note AOA can also calculated without trained model (.e. using training data new data ). case predictor variables treated equally important (unless weights given form table). Plotting aoa object shows distribution DI values within training data DI new data. output aoa function two raster data: first DI normalized weighted minimum distance nearest training data point divided average distance within training data. AOA derived DI using threshold. threshold (outlier-removed) maximum DI observed training data DI training data calculated considering cross-validation folds. used threshold relevant information training data DI returned parameters list entry. can plot DI well predictions onyl AOA: patterns DI general agreement true prediction error. high values present Alps, covered training data feature distinct environmental conditions. Since DI values areas threshold, regard area outside AOA.","code":"AOA <- aoa(predictors, model) class(AOA) ## [1] \"aoa\" names(AOA) ## [1] \"parameters\" \"DI\" \"AOA\" print(AOA) ## DI: ## class : SpatRaster ## dimensions : 102, 123, 1 (nrow, ncol, nlyr) ## resolution : 14075.98, 14075.98 (x, y) ## extent : 3496791, 5228136, 2143336, 3579086 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs ## source(s) : memory ## varname : bioclim ## name : DI ## min value : 0.000000 ## max value : 3.408739 ## AOA: ## class : SpatRaster ## dimensions : 102, 123, 1 (nrow, ncol, nlyr) ## resolution : 14075.98, 14075.98 (x, y) ## extent : 3496791, 5228136, 2143336, 3579086 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs ## source(s) : memory ## varname : bioclim ## name : AOA ## min value : 0 ## max value : 1 ## ## ## Predictor Weights: ## bio2 bio5 bio10 bio13 bio14 bio19 ## 1 3.746582 17.92456 17.04888 2.15925 0 0 ## ## ## AOA Threshold: 0.3221291 plot(AOA) plot(truediff,col=viridis(100),main=\"true prediction error\") plot(AOA$DI,col=viridis(100),main=\"DI\") plot(prediction, col=viridis(100),main=\"prediction for AOA\") plot(AOA$AOA,col=c(\"grey\",\"transparent\"),add=T,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"aoa-for-spatially-clustered-data","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"AOA for spatially clustered data?","title":"2. Area of applicability of spatial prediction models","text":"example randomly distributed training samples. However, sampling locations might also highly clustered space. case, random cross-validation meaningful (see e.g. Meyer et al. 2018, Meyer et al. 2019, Valavi et al. 2019, Roberts et al. 2018, Pohjankukka et al. 2017, Brenning 2012) Also threshold AOA reliable, based distance nearest data point within training data (usually small data clustered). Instead, cross-validation based leave-cluster-approach, AOA estimation based distances nearest data point located spatial cluster. show looks like, use 15 spatial locations simulate 5 data points around location. first train model (case) inappropriate random cross-validation. …model based leave-cluster-cross-validation. AOA calculated (comparison) using model validated random cross-validation, second taking spatial clusters account calculating threshold based minimum distances nearest training point located cluster. done aoa function, folds used cross-validation automatically extracted model. Note AOA much larger spatial CV approach. However, spatial cross-validation error considerably larger, hence also area error applies larger. random cross-validation performance high, however, area performance applies small. fact also apparent plot aoa objects display distributions DI training data well DI new data. random CV predictionDI larger AOA threshold determined trainDI. Using spatial CV, predictionDI well within DI training samples.","code":"set.seed(25) samplepoints <- clustered_sample(mask,75,15,radius=25000) plot(response,col=viridis(100)) plot(samplepoints,col=\"red\",add=T,pch=3) trainDat <- extract(predictors,samplepoints,na.rm=FALSE) trainDat$response <- extract(response,samplepoints,na.rm=FALSE)$response trainDat <- data.frame(trainDat,samplepoints) trainDat <- na.omit(trainDat) set.seed(10) model_random <- train(trainDat[,names(predictors)], trainDat$response, method=\"rf\", importance=TRUE, trControl = trainControl(method=\"cv\")) prediction_random <- predict(predictors,model_random,na.rm=TRUE) print(model_random) ## Random Forest ## ## 75 samples ## 6 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 68, 67, 68, 68, 68, 67, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 1088.1729 0.9956237 790.2191 ## 4 921.1760 0.9968527 717.5578 ## 6 922.1137 0.9967308 715.7016 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 4. folds <- CreateSpacetimeFolds(trainDat, spacevar=\"parent\",k=10) set.seed(15) model <- train(trainDat[,names(predictors)], trainDat$response, method=\"rf\", importance=TRUE, tuneGrid = expand.grid(mtry = c(2:length(names(predictors)))), trControl = trainControl(method=\"cv\",index=folds$index)) print(model) ## Random Forest ## ## 75 samples ## 6 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 70, 70, 65, 70, 70, 65, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 3227.421 0.9382904 2740.529 ## 3 2761.092 0.9433621 2396.941 ## 4 2677.002 0.9570317 2349.310 ## 5 2587.598 0.9486190 2282.064 ## 6 2494.756 0.9425158 2190.718 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 6. prediction <- predict(predictors,model,na.rm=TRUE) AOA_spatial <- aoa(predictors, model) AOA_random <- aoa(predictors, model_random) plot(AOA_spatial$DI,col=viridis(100),main=\"DI\") plot(prediction, col=viridis(100),main=\"prediction for AOA \\n(spatial CV error applies)\") plot(AOA_spatial$AOA,col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\")) plot(prediction_random, col=viridis(100),main=\"prediction for AOA \\n(random CV error applies)\") plot(AOA_random$AOA,col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\")) grid.arrange(plot(AOA_spatial) + ggplot2::ggtitle(\"Spatial CV\"), plot(AOA_random) + ggplot2::ggtitle(\"Random CV\"), ncol = 2)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"comparison-prediction-error-with-model-error","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"Comparison prediction error with model error","title":"2. Area of applicability of spatial prediction models","text":"Since used simulated response variable, can now compare prediction error within AOA model error, assuming model error applies inside AOA outside. results indicate high agreement model CV error (RMSE) true prediction RMSE. case , random well spatial model.","code":"###for the spatial CV: RMSE(values(prediction)[values(AOA_spatial$AOA)==1], values(response)[values(AOA_spatial$AOA)==1]) ## [1] 3308.808 RMSE(values(prediction)[values(AOA_spatial$AOA)==0], values(response)[values(AOA_spatial$AOA)==0]) ## [1] 10855.31 model$results ## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD ## 1 2 3227.421 0.9382904 2740.529 2335.609 0.06774290 2168.398 ## 2 3 2761.092 0.9433621 2396.941 1823.280 0.07190124 1674.310 ## 3 4 2677.002 0.9570317 2349.310 1690.078 0.04208035 1549.323 ## 4 5 2587.598 0.9486190 2282.064 1595.276 0.05220790 1410.225 ## 5 6 2494.756 0.9425158 2190.718 1507.700 0.07431001 1289.825 ###and for the random CV: RMSE(values(prediction_random)[values(AOA_random$AOA)==1], values(response)[values(AOA_random$AOA)==1]) ## [1] 1365.329 RMSE(values(prediction_random)[values(AOA_random$AOA)==0], values(response)[values(AOA_random$AOA)==0]) ## [1] 3959.685 model_random$results ## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD ## 1 2 1088.1729 0.9956237 790.2191 595.2632 0.004567068 407.8754 ## 2 4 921.1760 0.9968527 717.5578 437.1580 0.002792369 311.1915 ## 3 6 922.1137 0.9967308 715.7016 412.0427 0.002498990 306.1030"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"relationship-between-the-di-and-the-performance-measure","dir":"Articles","previous_headings":"Example 1: Using simulated data","what":"Relationship between the DI and the performance measure","title":"2. Area of applicability of spatial prediction models","text":"relationship error DI can used limit predictions area (within AOA) required performance (e.g. RMSE, R2, Kappa, Accuracy) applies. can done using result DItoErrormetric used relationship analyzed window DI values. corresponding model (: shape constrained additive models default: Monotone increasing P-splines dimension basis used represent smooth term 6 2nd order penalty.) can used estimate performance pixel level, allows limiting predictions using threshold. Note used multi-purpose CV estimate relationship DI RMSE (see details paper).","code":"DI_RMSE_relation <- DItoErrormetric(model, AOA_spatial$parameters, multiCV=TRUE, window.size = 5, length.out = 5) plot(DI_RMSE_relation) expected_RMSE = terra::predict(AOA_spatial$DI, DI_RMSE_relation) # account for multiCV changing the DI threshold updated_AOA = AOA_spatial$DI > attr(DI_RMSE_relation, \"AOA_threshold\") plot(expected_RMSE,col=viridis(100),main=\"expected RMSE\") plot(updated_AOA, col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"example-2-a-real-world-example","dir":"Articles","previous_headings":"","what":"Example 2: A real-world example","title":"2. Area of applicability of spatial prediction models","text":"example used simulated data allows analyze reliability AOA. However, simulated area-wide response available usual prediction tasks. Therefore, second example AOA estimated dataset point observations reference .","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"data-and-preprocessing","dir":"Articles","previous_headings":"Example 2: A real-world example","what":"Data and preprocessing","title":"2. Area of applicability of spatial prediction models","text":", work cookfarm dataset, described e.g. Gasch et al 2015. dataset included CAST re-structured dataset. Find details also vignette “Introduction CAST”. use soil moisture (VW) response variable . Hence, ’re aiming making spatial continuous prediction based limited measurements data loggers.","code":"dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) # calculate average of VW for each sampling site: dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) # create sf object from the data: pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) ##### Extract Predictors for the locations of the sampling points studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) st_crs(pts) <- crs(studyArea) trainDat <- extract(studyArea,pts,na.rm=FALSE) pts$ID <- 1:nrow(pts) trainDat <- merge(trainDat,pts,by.x=\"ID\",by.y=\"ID\") # The final training dataset with potential predictors and VW: head(trainDat) ## ID DEM TWI BLD NDRE.M NDRE.Sd Bt Easting Northing ## 1 1 788.1906 4.304258 1.42 -0.051189531 0.2506899 0.0000 493384 5180587 ## 2 2 788.3813 3.863605 1.29 -0.046459336 0.1754623 0.0000 493514 5180567 ## 3 3 790.5244 3.947488 1.36 -0.040845532 0.2225785 0.0000 493574 5180577 ## 4 4 775.7229 5.395786 1.55 -0.004329725 0.2099845 0.0501 493244 5180587 ## 5 5 796.7618 3.534822 1.31 0.027252737 0.2002646 0.0000 493624 5180607 ## 6 6 795.8370 3.815516 1.40 -0.123434804 0.2180606 0.0000 493694 5180607 ## MinT_wrcc MaxT_wrcc Precip_cum cday Precip_wrcc Group.1 VW ## 1 1.1 36.2 10.6 15425 0 CAF003 0.2894505 ## 2 1.1 36.2 10.6 15425 0 CAF007 0.2705531 ## 3 1.1 36.2 10.6 15425 0 CAF009 0.2629683 ## 4 1.1 36.2 10.6 15425 0 CAF019 0.2993580 ## 5 1.1 36.2 10.6 15425 0 CAF031 0.2664754 ## 6 1.1 36.2 10.6 15425 0 CAF033 0.2650177 ## geometry ## 1 POINT (493383.1 5180586) ## 2 POINT (493510.7 5180568) ## 3 POINT (493574.6 5180573) ## 4 POINT (493246.6 5180590) ## 5 POINT (493628.3 5180612) ## 6 POINT (493692.2 5180610)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"model-training-and-prediction","dir":"Articles","previous_headings":"Example 2: A real-world example","what":"Model training and prediction","title":"2. Area of applicability of spatial prediction models","text":"set variables used predictors VW random Forest model. model validated leave one cross-validation. Note model performance low, due small dataset used (small dataset low ability predictors model VW).","code":"predictors <- c(\"DEM\",\"NDRE.Sd\",\"TWI\",\"Bt\") response <- \"VW\" model <- train(trainDat[,predictors],trainDat[,response], method=\"rf\",tuneLength=3,importance=TRUE, trControl=trainControl(method=\"LOOCV\")) model ## Random Forest ## ## 42 samples ## 4 predictor ## ## No pre-processing ## Resampling: Leave-One-Out Cross-Validation ## Summary of sample sizes: 41, 41, 41, 41, 41, 41, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared MAE ## 2 0.04049575 0.01826180 0.03233088 ## 3 0.04100862 0.02199224 0.03305649 ## 4 0.04153769 0.01562694 0.03340031 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 2."},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"prediction","dir":"Articles","previous_headings":"Example 2: A real-world example > Model training and prediction","what":"Prediction","title":"2. Area of applicability of spatial prediction models","text":"Next, model used make predictions entire study area.","code":"#Predictors: plot(stretch(studyArea[[predictors]])) #prediction: prediction <- predict(studyArea,model,na.rm=TRUE)"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"aoa-estimation","dir":"Articles","previous_headings":"Example 2: A real-world example","what":"AOA estimation","title":"2. Area of applicability of spatial prediction models","text":"Next ’re limiting predictions AOA. Predictions outside AOA excluded.","code":"AOA <- aoa(studyArea,model) #### Plot results: plot(AOA$DI,col=viridis(100),main=\"DI with sampling locations (red)\") plot(pts,zcol=\"ID\",col=\"red\",add=TRUE) plot(prediction, col=viridis(100),main=\"prediction for AOA \\n(LOOCV error applies)\") plot(AOA$AOA,col=c(\"grey\",\"transparent\"),add=TRUE,plg=list(x=\"topleft\",box.col=\"black\",bty=\"o\",title=\"AOA\"))"},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"final-notes","dir":"Articles","previous_headings":"","what":"Final notes","title":"2. Area of applicability of spatial prediction models","text":"AOA estimated based training data new data (.e. raster group entire area interest). trained model used getting variable importance needed weight predictor variables. can given table either, approach can used packages caret well. Knowledge AOA important predictions used baseline decision making subsequent environmental modelling. suggest AOA provided alongside prediction map complementary communication validation performances.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast02-AOA-tutorial.html","id":"further-reading","dir":"Articles","previous_headings":"Final notes","what":"Further reading","title":"2. Area of applicability of spatial prediction models","text":"Meyer, H., & Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. Accepted. Meyer, H., & Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. [https://doi.org/10.1111/2041-210X.13650] Tutorial (https://youtu./EyP04zLe9qo) Lecture (https://youtu./OoNH6Nl-X2s) recording OpenGeoHub summer school 2020 area applicability. well talk OpenGeoHub summer school 2021: https://av.tib.eu/media/54879","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast03-AOA-parallel.html","id":"generate-example-data","dir":"Articles","previous_headings":"","what":"Generate Example Data","title":"3. AOA in Parallel","text":"","code":"library(CAST) library(caret) library(terra) library(sf) data(\"splotdata\") predictors <- rast(system.file(\"extdata\",\"predictors_chile.tif\",package=\"CAST\")) splotdata <- st_drop_geometry(splotdata) set.seed(10) model_random <- train(splotdata[,names(predictors)], splotdata$Species_richness, method=\"rf\", importance=TRUE, ntrees = 50, trControl = trainControl(method=\"cv\")) prediction_random <- predict(predictors,model_random,na.rm=TRUE)"},{"path":"https://hannameyer.github.io/CAST/articles/cast03-AOA-parallel.html","id":"parallel-aoa-by-dividing-the-new-data","dir":"Articles","previous_headings":"","what":"Parallel AOA by dividing the new data","title":"3. AOA in Parallel","text":"better performances, recommended compute AOA two steps. First, DI training data resulting DI threshold computed model training data function trainDI. result trainDI usually first step aoa function, however can skipped providing trainDI object function call. makes possible compute AOA multiple raster tiles (e.g. different cores). especially useful large prediction areas, e.g. global mapping. large raster, divide multiple smaller tiles apply trainDI object afterwards tile. Use trainDI argument aoa function specify, want use previously computed trainDI object. can now run aoa function parallel different tiles! course can use favorite parallel backend task, use mclapply parallel package. larger tasks might useful save tiles hard-drive load one one avoid filling RAM.","code":"model_random_trainDI = trainDI(model_random) print(model_random_trainDI) ## DI of 703 observation ## Predictors: bio_1 bio_4 bio_5 bio_6 bio_8 bio_9 bio_12 bio_13 bio_14 bio_15 elev ## ## AOA Threshold: 0.1941761 saveRDS(model_random_trainDI, \"path/to/file\") r1 = crop(predictors, c(-75.66667, -67, -30, -17.58333)) r2 = crop(predictors, c(-75.66667, -67, -45, -30)) r3 = crop(predictors, c(-75.66667, -67, -55.58333, -45)) plot(r1[[1]],main = \"Tile 1\") plot(r2[[1]],main = \"Tile 2\") plot(r3[[1]],main = \"Tile 3\") aoa_r1 = aoa(newdata = r1, trainDI = model_random_trainDI) plot(r1[[1]], main = \"Tile 1: Predictors\") plot(aoa_r1$DI, main = \"Tile 1: DI\") plot(aoa_r1$AOA, main = \"Tile 1: AOA\") library(parallel) tiles_aoa = mclapply(list(r1, r2, r3), function(tile){ aoa(newdata = tile, trainDI = model_random_trainDI) }, mc.cores = 3) plot(tiles_aoa[[1]]$AOA, main = \"Tile 1\") plot(tiles_aoa[[2]]$AOA, main = \"Tile 2\") plot(tiles_aoa[[3]]$AOA, main = \"Tile 3\") # Simple Example Code for raster tiles on the hard drive tiles = list.files(\"path/to/tiles\", full.names = TRUE) tiles_aoa = mclapply(tiles, function(tile){ current = terra::rast(tile) aoa(newdata = current, trainDI = model_random_trainDI) }, mc.cores = 3)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"4. Visualization of nearest neighbor distance distributions","text":"tutorial shows euclidean nearest neighbor distances geographic space feature space can calculated visualized using CAST. type visualization allows assess whether training data feature representative coverage prediction area cross-validation (CV) folds (independent test data) adequately chosen representative prediction locations. See e.g. Meyer Pebesma (2022) Milà et al. (2022) discussion topic.","code":""},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"sample-data","dir":"Articles","previous_headings":"","what":"Sample data","title":"4. Visualization of nearest neighbor distance distributions","text":"example data, use two different sets global virtual reference data: One spatial random sample second example, reference data clustered geographic space (see Meyer Pebesma (2022) discussions ). can define parameters run example different settings","code":"library(CAST) library(caret) library(terra) library(sf) library(rnaturalearth) library(ggplot2) seed <- 10 # random realization samplesize <- 300 # how many samples will be used? nparents <- 20 #For clustered samples: How many clusters? radius <- 500000 # For clustered samples: What is the radius of a cluster?"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"prediction-area","dir":"Articles","previous_headings":"Sample data","what":"Prediction area","title":"4. Visualization of nearest neighbor distance distributions","text":"prediction area entire global land area, .e. imagine prediction task aim making global predictions based set reference data.","code":"ee <- st_crs(\"+proj=eqearth\") co <- ne_countries(returnclass = \"sf\") co.ee <- st_transform(co, ee)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"spatial-random-sample","dir":"Articles","previous_headings":"Sample data","what":"Spatial random sample","title":"4. Visualization of nearest neighbor distance distributions","text":", simulate random sample visualize data entire global prediction area.","code":"sf_use_s2(FALSE) set.seed(seed) pts_random <- st_sample(co.ee, samplesize) ### See points on the map: ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = \"none\", col = \"none\") + labs(x = NULL, y = NULL)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"clustered-sample","dir":"Articles","previous_headings":"Sample data","what":"Clustered sample","title":"4. Visualization of nearest neighbor distance distributions","text":"second data set use clustered design size.","code":"set.seed(seed) sf_use_s2(FALSE) pts_clustered <- clustered_sample(co.ee, samplesize, nparents, radius) ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_clustered, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = \"none\", col = \"none\") + labs(x = NULL, y = NULL)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"distances-in-geographic-space","dir":"Articles","previous_headings":"","what":"Distances in geographic space","title":"4. Visualization of nearest neighbor distance distributions","text":"can plot distributions spatial distances reference data nearest neighbor (“sample--sample”) distribution distances points global land surface nearest reference data point (“sample--prediction”). Note samples prediction locations used calculate sample--prediction nearest neighbor distances. Since ’re using global case study , throughout tutorial use sampling=Fibonacci draw prediction locations constant point density sphere. Note random data set nearest neighbor distance distribution training data quasi identical nearest neighbor distance distribution prediction area. comparison, second data set number training data heavily clustered geographic space. therefore see nearest neighbor distances within reference data rather small. Prediction locations, however, average much away.","code":"dist_random <- geodist(pts_random,co.ee, sampling=\"Fibonacci\") dist_clstr <- geodist(pts_clustered,co.ee, sampling=\"Fibonacci\") plot(dist_random, unit = \"km\")+scale_x_log10(labels=round)+ggtitle(\"Randomly distributed reference data\") plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)+ggtitle(\"Clustered reference data\")"},{"path":[]},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"random-cross-validation","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Random Cross-validation","title":"4. Visualization of nearest neighbor distance distributions","text":"Let’s use clustered data set show distribution spatial nearest neighbor distances cross-validation can visualized well. Therefore, first use “default” way random 10-fold cross validation randomly split reference data training test (see Meyer et al., 2018 2019 see might good idea). Obviously CV folds representative prediction locations (least terms distance nearest training data point). .e. folds used performance assessment model, can expect overly optimistic estimates validate predictions close proximity reference data.","code":"randomfolds <- caret::createFolds(1:nrow(pts_clustered)) dist_clstr <- geodist(pts_clustered,co.ee, sampling=\"Fibonacci\", cvfolds= randomfolds) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"spatial-cross-validation","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Spatial Cross-validation","title":"4. Visualization of nearest neighbor distance distributions","text":", however, case CV performance regarded representative prediction task. Therefore, use spatial CV instead. , use leave-cluster-CV, means iteration, one spatial clusters held back. See fits nearest neighbor distribution prediction area much better. Note geodist also allows inspecting independent test data instead cross validation folds. See ?geodist ?plot.geodist.","code":"spatialfolds <- CreateSpacetimeFolds(pts_clustered,spacevar=\"parent\",k=length(unique(pts_clustered$parent))) dist_clstr <- geodist(pts_clustered,co.ee, sampling=\"Fibonacci\", cvfolds= spatialfolds$indexOut) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"why-has-spatial-cv-sometimes-blamed-for-being-too-pessimistic","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Why has spatial CV sometimes blamed for being too pessimistic ?","title":"4. Visualization of nearest neighbor distance distributions","text":"Recently, Wadoux et al. (2021) published paper title “Spatial cross-validation right way evaluate map accuracy” state “spatial cross-validation strategies resulted grossly pessimistic map accuracy assessment”. come conclusion? reference data used study either regularly, random comparably mildly clustered geographic space, applied spatial CV strategies held large spatial units back CV. can see happens apply spatial CV randomly distributed reference data. see nearest neighbor distances cross-validation don’t match nearest neighbor distances prediction. compared section , time cross-validation folds far away reference data. Naturally end overly pessimistic performance estimates make prediction situations cross-validation harder, compared required model application entire area interest (global). spatial CV chosen therefore suitable prediction task, prediction situations created CV resemble encountered prediction.","code":"# create a spatial CV for the randomly distributed data. Here: # \"leave region-out-CV\" sf_use_s2(FALSE) pts_random_co <- st_join(st_as_sf(pts_random),co.ee) ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random_co, aes(color=subregion),size=0.5, shape=3) + scale_color_manual(values=rainbow(length(unique(pts_random_co$subregion))))+ guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL)+ ggtitle(\"spatial fold membership by color\") spfolds_rand <- CreateSpacetimeFolds(pts_random_co,spacevar = \"subregion\", k=length(unique(pts_random_co$subregion))) dist_rand_sp <- geodist(pts_random_co,co.ee, sampling=\"Fibonacci\", cvfolds= spfolds_rand$indexOut) plot(dist_rand_sp, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"nearest-neighbour-distance-matching-cv","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"Nearest Neighbour Distance Matching CV","title":"4. Visualization of nearest neighbor distance distributions","text":"good way approximate geographical prediction distances CV use Nearest Neighbour Distance Matching (NNDM) CV (see Milà et al., 2022 details). NNDM CV variation LOO CV empirical distribution function nearest neighbour distances found prediction matched CV process. NNDM CV-distance distribution matches sample--prediction distribution well. happens use NNDM CV randomly-distributed sampling points instead? NNDM CV-distance still matches sample--prediction distance function.","code":"nndmfolds_clstr <- nndm(pts_clustered, modeldomain=co.ee, samplesize = 2000) dist_clstr <- geodist(pts_clustered,co.ee, sampling = \"Fibonacci\", cvfolds = nndmfolds_clstr$indx_test, cvtrain = nndmfolds_clstr$indx_train) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round) nndmfolds_rand <- nndm(pts_random_co, modeldomain=co.ee, samplesize = 2000) dist_rand <- geodist(pts_random_co,co.ee, sampling = \"Fibonacci\", cvfolds = nndmfolds_rand$indx_test, cvtrain = nndmfolds_rand$indx_train) plot(dist_rand, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"k-fold-nearest-neighbour-distance-matching-cv","dir":"Articles","previous_headings":"Distances in geographic space > Accounting for cross-validation folds","what":"k-fold Nearest Neighbour Distance Matching CV","title":"4. Visualization of nearest neighbor distance distributions","text":"Since NNDM CV highly time consuming, k-fold version may provide good trade-. See (see Linnenbrink et al., 2023 details)","code":"knndmfolds_clstr <- knndm(pts_clustered, modeldomain=co.ee, samplesize = 2000) pts_clustered$knndmCV <- as.character(knndmfolds_clstr$clusters) ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_clustered, aes(color=knndmCV),size=0.5, shape=3) + scale_color_manual(values=rainbow(length(unique(pts_clustered$knndmCV))))+ guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL)+ ggtitle(\"spatial fold membership by color\") dist_clstr <- geodist(pts_clustered,co.ee, sampling = \"Fibonacci\", cvfolds = knndmfolds_clstr$indx_test, cvtrain = knndmfolds_clstr$indx_train) plot(dist_clstr, unit = \"km\")+scale_x_log10(labels=round)"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"distances-in-feature-space","dir":"Articles","previous_headings":"","what":"Distances in feature space","title":"4. Visualization of nearest neighbor distance distributions","text":"far compared nearest neighbor distances geographic space. can also feature space. Therefore, set bioclimatic variables used (https://www.worldclim.org) features (.e. predictors) virtual prediction task. visualize nearest neighbor feature space distances consideration cross-validation. regard chosen predictor variables see nearest neighbor distance clustered training data rather small, compared required prediction. random CV representative prediction locations spatial CV better job.","code":"predictors_global <- rast(system.file(\"extdata\",\"bioclim_global.tif\",package=\"CAST\")) plot(predictors_global) # use random CV: dist_clstr_rCV <- geodist(pts_clustered,predictors_global, type = \"feature\", sampling=\"Fibonacci\", cvfolds = randomfolds) # use spatial CV: dist_clstr_sCV <- geodist(pts_clustered,predictors_global, type = \"feature\", sampling=\"Fibonacci\", cvfolds = spatialfolds$indexOut) # Plot results: plot(dist_clstr_rCV)+scale_x_log10()+ggtitle(\"Clustered reference data and random CV\") plot(dist_clstr_sCV)+scale_x_log10()+ggtitle(\"Clustered reference data and spatial CV\")"},{"path":"https://hannameyer.github.io/CAST/articles/cast04-plotgeodist.html","id":"references","dir":"Articles","previous_headings":"Distances in feature space","what":"References","title":"4. Visualization of nearest neighbor distance distributions","text":"Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications 13, 2208. https://doi.org/10.1038/s41467-022-29838-9 Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. https://doi.org/10.1111/2041-210X.13851. Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H. (2023): kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308.","code":""},{"path":"https://hannameyer.github.io/CAST/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Hanna Meyer. Maintainer, author. Carles Milà. Author. Marvin Ludwig. Author. Jan Linnenbrink. Author. Philipp Otto. Contributor. Chris Reudenbach. Contributor. Thomas Nauss. Contributor. Edzer Pebesma. Contributor.","code":""},{"path":"https://hannameyer.github.io/CAST/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Meyer H, Milà C, Ludwig M, Linnenbrink J (2024). CAST: 'caret' Applications Spatial-Temporal Models. R package version 0.9.0, https://hannameyer.github.io/CAST/, https://github.com/HannaMeyer/CAST.","code":"@Manual{, title = {CAST: 'caret' Applications for Spatial-Temporal Models}, author = {Hanna Meyer and Carles Milà and Marvin Ludwig and Jan Linnenbrink}, year = {2024}, note = {R package version 0.9.0, https://hannameyer.github.io/CAST/}, url = {https://github.com/HannaMeyer/CAST}, }"},{"path":"https://hannameyer.github.io/CAST/index.html","id":"cast-caret-applications-for-spatio-temporal-models","dir":"","previous_headings":"","what":"caret Applications for Spatial-Temporal Models","title":"caret Applications for Spatial-Temporal Models","text":"Supporting functionality run ‘caret’ spatial spatial-temporal data. ‘caret’ frequently used package model training prediction using machine learning. CAST includes functions improve spatial spatial-temporal modelling tasks using ‘caret’. decrease spatial overfitting improve model performances, package implements forward feature selection selects suitable predictor variables view contribution spatial spatio-temporal model performance. CAST includes functionality estimate (spatial) area applicability prediction models. Note: developer version CAST can found https://github.com/HannaMeyer/CAST. CRAN Version can found https://CRAN.R-project.org/package=CAST","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"package-website","dir":"","previous_headings":"","what":"Package Website","title":"caret Applications for Spatial-Temporal Models","text":"https://hannameyer.github.io/CAST/","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"tutorials","dir":"","previous_headings":"","what":"Tutorials","title":"caret Applications for Spatial-Temporal Models","text":"Introduction CAST Area applicability spatial prediction models Area applicability parallel Visualization nearest neighbor distance distributions talk OpenGeoHub summer school 2019 spatial validation variable selection: https://www.youtube.com/watch?v=mkHlmYEzsVQ. Tutorial (https://youtu./EyP04zLe9qo) Lecture (https://youtu./OoNH6Nl-X2s) recording OpenGeoHub summer school 2020 area applicability. well talk OpenGeoHub summer school 2021: https://av.tib.eu/media/54879 Talk tutorial OpenGeoHub 2022 summer school Machine learning-based maps environment - challenges extrapolation overfitting, including discussions area applicability nearest neighbor distance matching cross-validation (https://doi.org/10.5446/59412).","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/index.html","id":"spatial-cross-validation","dir":"","previous_headings":"Scientific documentation of the methods","what":"Spatial cross-validation","title":"caret Applications for Spatial-Temporal Models","text":"Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. https://doi.org/10.1111/2041-210X.13851 Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H.: kNNDM (2023): k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation. EGUsphere [preprint]. https://doi.org/10.5194/egusphere-2023-1308","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"spatial-variable-selection","dir":"","previous_headings":"Scientific documentation of the methods","what":"Spatial variable selection","title":"caret Applications for Spatial-Temporal Models","text":"Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauss, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software, 101, 1-9. https://doi.org/10.1016/j.envsoft.2017.12.001 Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T. (2019): Importance spatial predictor variable selection machine learning applications - Moving data reproduction spatial prediction. Ecological Modelling. 411. https://doi.org/10.1016/j.ecolmodel.2019.108815","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"area-of-applicability","dir":"","previous_headings":"Scientific documentation of the methods","what":"Area of applicability","title":"caret Applications for Spatial-Temporal Models","text":"Meyer, H., Pebesma, E. (2021). Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution, 12, 1620– 1633. https://doi.org/10.1111/2041-210X.13650","code":""},{"path":"https://hannameyer.github.io/CAST/index.html","id":"applications-and-use-cases","dir":"","previous_headings":"Scientific documentation of the methods","what":"Applications and use cases","title":"caret Applications for Spatial-Temporal Models","text":"Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications, 13. https://www.nature.com/articles/s41467-022-29838-9 Ludwig, M., Moreno-Martinez, ., Hoelzel, N., Pebesma, E., Meyer, H. (2023): Assessing improving transferability current global spatial prediction models. Global Ecology Biogeography. https://doi.org/10.1111/geb.13635.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":null,"dir":"Reference","previous_headings":"","what":"'caret' Applications for Spatial-Temporal Models — CAST","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"Supporting functionality run 'caret' spatial spatial-temporal data. 'caret' frequently used package model training prediction using machine learning. CAST includes functions improve spatial-temporal modelling tasks using 'caret'. includes newly suggested 'Nearest neighbor distance matching' cross-validation estimate performance spatial prediction models allows spatial variable selection selects suitable predictor variables view contribution spatial model performance. CAST includes functionality estimate (spatial) area applicability prediction models analysing similarity new data training data. Methods described Meyer et al. (2018); Meyer et al. (2019); Meyer Pebesma (2021); Milà et al. (2022); Meyer Pebesma (2022).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"'caret' Applications Spatio-Temporal models","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H.: kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308, 2023. Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. 13. Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution. 12, 1620– 1633. Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T. (2019): Importance spatial predictor variable selection machine learning applications - Moving data reproduction spatial prediction. Ecological Modelling. 411, 108815. Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauß, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1-9.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CAST.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"'caret' Applications for Spatial-Temporal Models — CAST","text":"Hanna Meyer, Carles Milà, Marvin Ludwig, Lan Linnenbrink","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":null,"dir":"Reference","previous_headings":"","what":"Create Space-time Folds — CreateSpacetimeFolds","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Create spatial, temporal spatio-temporal Folds cross validation based pre-defined groups","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"","code":"CreateSpacetimeFolds( x, spacevar = NA, timevar = NA, k = 10, class = NA, seed = sample(1:1000, 1) )"},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"x data.frame containing spatio-temporal data spacevar Character indicating column x identifies spatial units (e.g. ID weather stations) timevar Character indicating column x identifies temporal units (e.g. day year) k numeric. Number folds. spacevar timevar NA leave one location leave one time step cv performed, set k number unique spatial temporal units. class Character indicating column x identifies class unit (e.g. land cover) seed numeric. See ?seed","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"list contains list model training list model validation can directly used \"index\" \"indexOut\" caret's trainControl function","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"function creates train test sets taking (spatial /temporal) groups account. contrast nndm, requires groups already defined (e.g. spatial clusters blocks temporal units). Using \"class\" helpful case data clustered space categorical. E.g case land cover classifications training data come training polygons. case data split way entire polygons held back (spacevar=\"polygonID\") time distribution classes similar fold (class=\"LUC\").","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Standard k-fold cross-validation can lead considerable misinterpretation spatial-temporal modelling tasks. function can used prepare Leave-Location-, Leave-Time-Leave-Location--Time-cross-validation target-oriented validation strategies spatial-temporal prediction tasks. See Meyer et al. (2018) information.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauß, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1-9.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/CreateSpacetimeFolds.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create Space-time Folds — CreateSpacetimeFolds","text":"","code":"if (FALSE) { dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) ### Prepare for 10-fold Leave-Location-and-Time-Out cross validation indices <- CreateSpacetimeFolds(dat,\"SOURCEID\",\"Date\") str(indices) ### Prepare for 10-fold Leave-Location-Out cross validation indices <- CreateSpacetimeFolds(dat,spacevar=\"SOURCEID\") str(indices) ### Prepare for leave-One-Location-Out cross validation indices <- CreateSpacetimeFolds(dat,spacevar=\"SOURCEID\", k=length(unique(dat$SOURCEID))) str(indices) }"},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":null,"dir":"Reference","previous_headings":"","what":"Model the relationship between the DI and the prediction error — DItoErrormetric","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"Performance metrics calculated moving windows DI values cross-validated training data","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"","code":"DItoErrormetric( model, trainDI, multiCV = FALSE, length.out = 10, window.size = 5, calib = \"scam\", method = \"L2\", useWeight = TRUE, k = 6, m = 2 )"},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"model model used get AOA trainDI result trainDI aoa object aoa multiCV Logical. Re-run model fitting validation different CV strategies. See details. length.Numeric. used multiCV=TRUE. Number cross-validation folds. See details. window.size Numeric. Size moving window. See rollapply. calib Character. Function model DI~performance relationship. Currently lm scam supported method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. See ?aoa explanation useWeight Logical. model given. Weight variables according importance model? k Numeric. See mgcv::s m Numeric. See mgcv::s","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"scam linear model","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"multiCV=TRUE model re-fitted validated length.new cross-validations cross-validation folds defined clusters predictor space, ranging three clusters LOOCV. Hence, large range DI values created cross-validation. AOA threshold based calibration data multiple CV larger original AOA threshold (likely extrapolation situations created CV), AOA threshold changes accordingly. See Meyer Pebesma (2021) full documentation methodology.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"Hanna Meyer, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/DItoErrormetric.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Model the relationship between the DI and the prediction error — DItoErrormetric","text":"","code":"if (FALSE) { library(CAST) library(sf) library(terra) library(caret) data(splotdata) splotdata <- st_drop_geometry(splotdata) predictors <- terra::rast(system.file(\"extdata\",\"predictors_chile.tif\", package=\"CAST\")) model <- caret::train(splotdata[,6:16], splotdata$Species_richness, ntree = 10, trControl = trainControl(method = \"cv\", savePredictions = TRUE)) AOA <- aoa(predictors, model) errormodel <- DItoErrormetric(model, AOA) plot(errormodel) expected_error = terra::predict(AOA$DI, errormodel) plot(expected_error) # with multiCV = TRUE errormodel = DItoErrormetric(model, AOA, multiCV = TRUE, length.out = 3) plot(errormodel) expected_error = terra::predict(AOA$DI, errormodel) plot(expected_error) # mask AOA based on new threshold from multiCV mask_aoa = terra::mask(expected_error, AOA$DI > attr(errormodel, 'AOA_threshold'), maskvalues = 1) plot(mask_aoa) }"},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":null,"dir":"Reference","previous_headings":"","what":"Area of Applicability — aoa","title":"Area of Applicability — aoa","text":"function estimates Dissimilarity Index (DI) derived Area Applicability (AOA) spatial prediction models considering distance new data (.e. SpatRaster spatial predictors used models) predictor variable space data used model training. Predictors can weighted based internal variable importance machine learning algorithm used model training. AOA derived applying threshold DI (outlier-removed) maximum DI cross-validated training data.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Area of Applicability — aoa","text":"","code":"aoa( newdata, model = NA, trainDI = NA, train = NULL, weight = NA, variables = \"all\", CVtest = NULL, CVtrain = NULL, method = \"L2\", useWeight = TRUE )"},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Area of Applicability — aoa","text":"newdata SpatRaster, stars object data.frame containing data model meant make predictions . model train object created caret used extract weights (based variable importance) well cross-validation folds. See examples case model available models trained via e.g. mlr3. trainDI trainDI object. Optional trainDI calculated beforehand. train data.frame containing data used model training. Optional. required model given weight data.frame containing weights variable. Optional. required model given. variables character vector predictor variables. \"\" variables model used model given train dataset. CVtest list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. required model given. CVtrain list. element contains data points used training cross validation iteration (.e. held back data). required model given required CVtrain opposite CVtest (.e. data point used testing, used training). Relevant data points excluded, e.g. using nndm. method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. useWeight Logical. model given. Weight variables according importance model?","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Area of Applicability — aoa","text":"object class aoa containing: parameters object class trainDI. see trainDI DI SpatRaster, stars object data frame. Dissimilarity index newdata AOA SpatRaster, stars object data frame. Area Applicability newdata. AOA values 0 (outside AOA) 1 (inside AOA)","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Area of Applicability — aoa","text":"Dissimilarity Index (DI) corresponding Area Applicability (AOA) calculated. variables factors, dummy variables created prior weighting distance calculation. Interpretation results: location similar properties training data low distance predictor variable space (DI towards 0) locations different properties high DI. See Meyer Pebesma (2021) full documentation methodology.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Area of Applicability — aoa","text":"classification models used, currently variable importance can automatically retrieved models trained via train(predictors,response) via formula-interface. fixed.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Area of Applicability — aoa","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. Methods Ecology Evolution 12: 1620-1633. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Area of Applicability — aoa","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/aoa.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Area of Applicability — aoa","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) library(viridis) # prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) pts$ID <- 1:nrow(pts) set.seed(100) pts <- pts[1:30,] studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\"))[[1:8]] trainDat <- extract(studyArea,pts,na.rm=FALSE) trainDat <- merge(trainDat,pts,by.x=\"ID\",by.y=\"ID\") # visualize data spatially: plot(studyArea) plot(studyArea$DEM) plot(pts[,1],add=TRUE,col=\"black\") # train a model: set.seed(100) variables <- c(\"DEM\",\"NDRE.Sd\",\"TWI\") model <- train(trainDat[,which(names(trainDat)%in%variables)], trainDat$VW, method=\"rf\", importance=TRUE, tuneLength=1, trControl=trainControl(method=\"cv\",number=5,savePredictions=T)) print(model) #note that this is a quite poor prediction model prediction <- predict(studyArea,model,na.rm=TRUE) plot(varImp(model,scale=FALSE)) #...then calculate the AOA of the trained model for the study area: AOA <- aoa(studyArea,model) plot(AOA) #### #The AOA can also be calculated without a trained model. #All variables are weighted equally in this case: #### AOA <- aoa(studyArea,train=trainDat,variables=variables) #### # The AOA can also be used for models trained via mlr3 (parameters have to be assigned manually): #### library(mlr3) library(mlr3learners) library(mlr3spatial) library(mlr3spatiotempcv) library(mlr3extralearners) # initiate and train model: train_df <- trainDat[, c(\"DEM\",\"NDRE.Sd\",\"TWI\", \"VW\")] backend <- as_data_backend(train_df) task <- as_task_regr(backend, target = \"VW\") lrn <- lrn(\"regr.randomForest\", importance = \"mse\") lrn$train(task) # cross-validation folds rsmp_cv <- rsmp(\"cv\", folds = 5L)$instantiate(task) ## predict: prediction <- predict(studyArea,lrn$model,na.rm=TRUE) ### Estimate AOA AOA <- aoa(studyArea, train = as.data.frame(task$data()), variables = task$feature_names, weight = data.frame(t(lrn$importance())), CVtest = rsmp_cv$instance[order(row_id)]$fold) }"},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":null,"dir":"Reference","previous_headings":"","what":"Best subset feature selection — bss","title":"Best subset feature selection — bss","text":"Evaluate combinations predictors model training","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Best subset feature selection — bss","text":"","code":"bss( predictors, response, method = \"rf\", metric = ifelse(is.factor(response), \"Accuracy\", \"RMSE\"), maximize = ifelse(metric == \"RMSE\", FALSE, TRUE), globalval = FALSE, trControl = caret::trainControl(), tuneLength = 3, tuneGrid = NULL, seed = 100, verbose = TRUE, ... )"},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Best subset feature selection — bss","text":"predictors see train response see train method see train metric see train maximize see train globalval Logical. models evaluated based 'global' performance? See global_validation trControl see train tuneLength see train tuneGrid see train seed random number verbose Logical. information progress printed? ... arguments passed classification regression routine (randomForest).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Best subset feature selection — bss","text":"list class train. Beside usual train content object contains vector \"selectedvars\" \"selectedvars_perf\" give best variables selected well corresponding performance. also contains \"perf_all\" gives performance model runs.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Best subset feature selection — bss","text":"bss alternative ffs ideal training set small. Models iteratively fitted using different combinations predictor variables. Hence, 2^X models calculated. try running bss large datasets computation time much higher compared ffs. internal cross validation can run parallel. See information parallel processing carets train functions details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Best subset feature selection — bss","text":"variable selection particularly suitable spatial cross validations variable selection MUST based performance model predicting new spatial units. Note bss slow since combinations variables tested. time efficient alternative forward feature selection (ffs) (ffs).","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Best subset feature selection — bss","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/bss.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Best subset feature selection — bss","text":"","code":"if (FALSE) { data(iris) bssmodel <- bss(iris[,1:4],iris$Species) bssmodel$perf_all }"},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":null,"dir":"Reference","previous_headings":"","what":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"Performance metrics calculated moving windows DI values cross-validated training data","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"","code":"calibrate_aoa( AOA, model, window.size = 5, calib = \"scam\", multiCV = FALSE, length.out = 10, maskAOA = TRUE, method = \"L2\", useWeight = TRUE, showPlot = TRUE, k = 6, m = 2 )"},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"AOA result aoa model model used get AOA window.size Numeric. Size moving window. See rollapply. calib Character. Function model DI~performance relationship. Currently lm scam supported multiCV Logical. Re-run model fitting validation different CV strategies. See details. length.Numeric. used multiCV=TRUE. Number cross-validation folds. See details. maskAOA Logical. areas outside AOA set NA? method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. See ?aoa explanation useWeight Logical. model given. Weight variables according importance model? showPlot Logical. k Numeric. See mgcv::s m Numeric. See mgcv::s","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"list length 2 elements \"AOA\": SpatRaster stars object contains original DI AOA (might updated new test data indicate option), well expected performance based relationship. Data used calibration stored attributes. second element plot showing relationship.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"multiCV=TRUE model re-fitted validated length.new cross-validations cross-validation folds defined clusters predictor space, ranging three clusters LOOCV. Hence, large range DI values created cross-validation. AOA threshold based calibration data multiple CV larger original AOA threshold (likely extrapolation situations created CV), AOA updated accordingly. See Meyer Pebesma (2021) full documentation methodology.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/calibrate_aoa.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calibrate the AOA based on the relationship between the DI and the prediction error — calibrate_aoa","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) library(viridis) library(latticeExtra) #' # prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) pts$ID <- 1:nrow(pts) studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\"))[[1:8]] dat <- extract(studyArea,pts,na.rm=TRUE) trainDat <- merge(dat,pts,by.x=\"ID\",by.y=\"ID\") # train a model: variables <- c(\"DEM\",\"NDRE.Sd\",\"TWI\") set.seed(100) model <- train(trainDat[,which(names(trainDat)%in%variables)], trainDat$VW,method=\"rf\",importance=TRUE,tuneLength=1, trControl=trainControl(method=\"cv\",number=5,savePredictions=TRUE)) #...then calculate the AOA of the trained model for the study area: AOA <- aoa(studyArea,model) AOA_new <- calibrate_aoa(AOA,model) plot(AOA_new$AOA$expected_RMSE) }"},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":null,"dir":"Reference","previous_headings":"","what":"Clustered samples simulation — clustered_sample","title":"Clustered samples simulation — clustered_sample","text":"simple procedure simulate clustered points based two-step sampling.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Clustered samples simulation — clustered_sample","text":"","code":"clustered_sample(sarea, nsamples, nparents, radius)"},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Clustered samples simulation — clustered_sample","text":"sarea polygon. Area samples simulated. nsamples integer. Number samples simulated. nparents integer. Number parents. radius integer. Radius buffer around parent offspring simulation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Clustered samples simulation — clustered_sample","text":"sf object simulated points parent point belongs .","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Clustered samples simulation — clustered_sample","text":"simple procedure simulate clustered points based two-step sampling. First, pre-specified number parents simulated using random sampling. parent, `(nsamples-nparents)/nparents` simulated within radius parent point using random sampling.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Clustered samples simulation — clustered_sample","text":"Carles Milà","code":""},{"path":"https://hannameyer.github.io/CAST/reference/clustered_sample.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Clustered samples simulation — clustered_sample","text":"","code":"# Simulate 100 points in a 100x100 square with 5 parents and a radius of 10. library(sf) #> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE library(ggplot2) set.seed(1234) simarea <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) simarea <- sf::st_polygon(simarea) simpoints <- clustered_sample(simarea, 100, 5, 10) simpoints$parent <- as.factor(simpoints$parent) ggplot() + geom_sf(data = simarea, alpha = 0) + geom_sf(data = simpoints, aes(col = parent))"},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":null,"dir":"Reference","previous_headings":"","what":"Model expected error between Metric and DI — errorModel","title":"Model expected error between Metric and DI — errorModel","text":"Model expected error Metric DI","code":""},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Model expected error between Metric and DI — errorModel","text":"","code":"errorModel(preds_all, model, window.size, calib, k, m)"},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Model expected error between Metric and DI — errorModel","text":"preds_all data.frame: pred, obs, DI model model used get AOA window.size Numeric. Size moving window. See rollapply. calib Character. Function model DI~performance relationship. Currently lm scam supported k Numeric. See mgcv::s m Numeric. See mgcv::s","code":""},{"path":"https://hannameyer.github.io/CAST/reference/errorModel.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Model expected error between Metric and DI — errorModel","text":"scam lm","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":null,"dir":"Reference","previous_headings":"","what":"Forward feature selection — ffs","title":"Forward feature selection — ffs","text":"simple forward feature selection algorithm","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Forward feature selection — ffs","text":"","code":"ffs( predictors, response, method = \"rf\", metric = ifelse(is.factor(response), \"Accuracy\", \"RMSE\"), maximize = ifelse(metric == \"RMSE\", FALSE, TRUE), globalval = FALSE, withinSE = FALSE, minVar = 2, trControl = caret::trainControl(), tuneLength = 3, tuneGrid = NULL, seed = sample(1:1000, 1), verbose = TRUE, ... )"},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Forward feature selection — ffs","text":"predictors see train response see train method see train metric see train maximize see train globalval Logical. models evaluated based 'global' performance? See global_validation withinSE Logical Models selected better currently best models Standard error minVar Numeric. Number variables combine first selection. See Details. trControl see train tuneLength see train tuneGrid see train seed random number used model training verbose Logical. information progress printed? ... arguments passed classification regression routine (randomForest).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Forward feature selection — ffs","text":"list class train. Beside usual train content object contains vector \"selectedvars\" \"selectedvars_perf\" give order best variables selected well corresponding performance (starting first two variables). also contains \"perf_all\" gives performance model runs.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Forward feature selection — ffs","text":"Models two predictors first trained using possible pairs predictor variables. best model initial models kept. basis best model predictor variables iteratively increased remaining variables tested improvement currently best model. process stops none remaining variables increases model performance added current best model. internal cross validation can run parallel. See information parallel processing carets train functions details. Using withinSE favour models less variables probably shorten calculation time Per Default, ffs starts possible 2-pair combinations. minVar allows start selection 2 variables, e.g. minVar=3 starts ffs testing combinations 3 (instead 2) variables first increasing number. important e.g. neural networks often make sense two variables. also relevant assumed optimal variables can found 2 considered time.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Forward feature selection — ffs","text":"variable selection particularly suitable spatial cross validations variable selection MUST based performance model predicting new spatial units. See Meyer et al. (2018) Meyer et al. (2019) details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Forward feature selection — ffs","text":"Gasch, C.K., Hengl, T., Gräler, B., Meyer, H., Magney, T., Brown, D.J. (2015): Spatio-temporal interpolation soil water, temperature, electrical conductivity 3D+T: Cook Agronomy Farm data set. Spatial Statistics 14: 70-90. Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauß, T. (2018): Improving performance spatio-temporal machine learning models using forward feature selection target-oriented validation. Environmental Modelling & Software 101: 1-9. doi:10.1016/j.envsoft.2017.12.001 Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T. (2019): Importance spatial predictor variable selection machine learning applications - Moving data reproduction spatial prediction. Ecological Modelling. 411, 108815. doi:10.1016/j.ecolmodel.2019.108815 . Ludwig, M., Moreno-Martinez, ., Hölzel, N., Pebesma, E., Meyer, H. (2023): Assessing improving transferability current global spatial prediction models. Global Ecology Biogeography. doi:10.1111/geb.13635 .","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Forward feature selection — ffs","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/ffs.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Forward feature selection — ffs","text":"","code":"if (FALSE) { data(iris) ffsmodel <- ffs(iris[,1:4],iris$Species) ffsmodel$selectedvars ffsmodel$selectedvars_perf } # or perform model with target-oriented validation (LLO CV) #the example is described in Gasch et al. (2015). The ffs approach for this dataset is described in #Meyer et al. (2018). Due to high computation time needed, only a small and thus not robust example #is shown here. if (FALSE) { #run the model on three cores: library(doParallel) library(lubridate) cl <- makeCluster(3) registerDoParallel(cl) #load and prepare dataset: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) trainDat <- dat[dat$altitude==-0.3&year(dat$Date)==2012&week(dat$Date)%in%c(13:14),] #visualize dataset: ggplot(data = trainDat, aes(x=Date, y=VW)) + geom_line(aes(colour=SOURCEID)) #create folds for Leave Location Out Cross Validation: set.seed(10) indices <- CreateSpacetimeFolds(trainDat,spacevar = \"SOURCEID\",k=3) ctrl <- trainControl(method=\"cv\",index = indices$index) #define potential predictors: predictors <- c(\"DEM\",\"TWI\",\"BLD\",\"Precip_cum\",\"cday\",\"MaxT_wrcc\", \"Precip_wrcc\",\"NDRE.M\",\"Bt\",\"MinT_wrcc\",\"Northing\",\"Easting\") #run ffs model with Leave Location out CV set.seed(10) ffsmodel <- ffs(trainDat[,predictors],trainDat$VW,method=\"rf\", tuneLength=1,trControl=ctrl) ffsmodel plot(ffsmodel) #or only selected variables: plot(ffsmodel,plotType=\"selected\") #compare to model without ffs: model <- train(trainDat[,predictors],trainDat$VW,method=\"rf\", tuneLength=1, trControl=ctrl) model stopCluster(cl) }"},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"Calculates nearest neighbor distances geographic space feature space training data well training data prediction locations. Optional, nearest neighbor distances training data test data training data CV iterations computed.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"","code":"geodist( x, modeldomain, type = \"geo\", cvfolds = NULL, cvtrain = NULL, testdata = NULL, preddata = NULL, samplesize = 2000, sampling = \"regular\", variables = NULL )"},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"x object class sf, training data locations modeldomain SpatRaster, stars sf object defining prediction area (see Details) type \"geo\" \"feature\". distance computed geographic space normalized multivariate predictor space (see Details) cvfolds optional. list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. See e.g. ?createFolds ?CreateSpacetimeFolds ?nndm cvtrain optional. List row indices x fit model CV iteration. cvtrain null cvfolds , samples included cvfolds used training data testdata optional. object class sf: Point data used independent validation preddata optional. object class sf: Point data indicating locations within modeldomain used target prediction points. Useful prediction objective subset locations within modeldomain rather whole area. samplesize numeric. many prediction samples used? sampling character. draw prediction samples? See spsample. Use sampling = \"Fibonacci\" global applications. variables character vector defining predictor variables used type=\"feature. provided variables included modeldomain used.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"data.frame containing distances. Unit returned geographic distances meters. attributes contain W statistic prediction area either sample data, CV folds test data. See details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"modeldomain sf polygon raster defines prediction area. function takes regular point sample (amount defined samplesize) spatial extent. type = \"feature\", argument modeldomain (provided also testdata /preddata) include predictors. Predictor values x, testdata preddata optional modeldomain raster. provided extracted modeldomain rasterStack. W statistic describes match distributions. See Linnenbrink et al (2023) details.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"See Meyer Pebesma (2022) application plotting function","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"Hanna Meyer, Edzer Pebesma, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/geodist.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate euclidean nearest neighbor distances in geographic space or feature space — geodist","text":"","code":"if (FALSE) { library(CAST) library(sf) library(terra) library(caret) library(rnaturalearth) library(ggplot2) data(splotdata) studyArea <- rnaturalearth::ne_countries(continent = \"South America\", returnclass = \"sf\") ########### Distance between training data and new data: dist <- geodist(splotdata, studyArea) plot(dist) ########### Distance between training data, new data and test data (here Chile): plot(splotdata[,\"Country\"]) dist <- geodist(splotdata[splotdata$Country != \"Chile\",], studyArea, testdata = splotdata[splotdata$Country == \"Chile\",]) plot(dist) ########### Distance between training data, new data and CV folds: folds <- createFolds(1:nrow(splotdata), k=3, returnTrain=FALSE) dist <- geodist(x=splotdata, modeldomain=studyArea, cvfolds=folds) plot(dist) ########### Distances in the feature space: predictors <- terra::rast(system.file(\"extdata\",\"predictors_chile.tif\", package=\"CAST\")) dist <- geodist(x = splotdata, modeldomain = predictors, type = \"feature\", variables = c(\"bio_1\",\"bio_12\", \"elev\")) plot(dist) dist <- geodist(x = splotdata[splotdata$Country != \"Chile\",], modeldomain = predictors, cvfolds = folds, testdata = splotdata[splotdata$Country == \"Chile\",], type = \"feature\", variables=c(\"bio_1\",\"bio_12\", \"elev\")) plot(dist) ############ Example for a random global dataset ############ (refer to figure in Meyer and Pebesma 2022) ### Define prediction area (here: global): ee <- st_crs(\"+proj=eqearth\") co <- ne_countries(returnclass = \"sf\") co.ee <- st_transform(co, ee) ### Simulate a spatial random sample ### (alternatively replace pts_random by a real sampling dataset (see Meyer and Pebesma 2022): sf_use_s2(FALSE) pts_random <- st_sample(co.ee, 2000, exact=FALSE) ### See points on the map: ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = \"none\", col = \"none\") + labs(x = NULL, y = NULL) ### plot distances: dist <- geodist(pts_random,co.ee) plot(dist) + scale_x_log10(labels=round) }"},{"path":"https://hannameyer.github.io/CAST/reference/get_preds_all.html","id":null,"dir":"Reference","previous_headings":"","what":"Get Preds all — get_preds_all","title":"Get Preds all — get_preds_all","text":"Get Preds ","code":""},{"path":"https://hannameyer.github.io/CAST/reference/get_preds_all.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Get Preds all — get_preds_all","text":"","code":"get_preds_all(model, trainDI)"},{"path":"https://hannameyer.github.io/CAST/reference/get_preds_all.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Get Preds all — get_preds_all","text":"model, model trainDI, trainDI","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":null,"dir":"Reference","previous_headings":"","what":"Evaluate 'global' cross-validation — global_validation","title":"Evaluate 'global' cross-validation — global_validation","text":"Calculate validation metric using held back predictions ","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Evaluate 'global' cross-validation — global_validation","text":"","code":"global_validation(model)"},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Evaluate 'global' cross-validation — global_validation","text":"model object class train","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Evaluate 'global' cross-validation — global_validation","text":"regression (postResample) classification (confusionMatrix) statistics","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Evaluate 'global' cross-validation — global_validation","text":"Relevant folds representative entire area interest. case, metrics like R2 meaningful since reflect general ability model explain entire gradient response. Comparable LOOCV, predictions held back folds used together calculate validation statistics.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Evaluate 'global' cross-validation — global_validation","text":"Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/global_validation.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Evaluate 'global' cross-validation — global_validation","text":"","code":"dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- dat[sample(1:nrow(dat),500),] indices <- CreateSpacetimeFolds(dat,\"SOURCEID\",\"Date\") ctrl <- caret::trainControl(method=\"cv\",index = indices$index,savePredictions=\"final\") model <- caret::train(dat[,c(\"DEM\",\"TWI\",\"BLD\")],dat$VW, method=\"rf\", trControl=ctrl, ntree=10) #> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 . #> #> Loading required package: lattice global_validation(model) #> RMSE Rsquared MAE #> 0.08848113 0.13992098 0.06953367"},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":null,"dir":"Reference","previous_headings":"","what":"K-fold Nearest Neighbour Distance Matching — knndm","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"function implements kNNDM algorithm returns necessary indices perform k-fold NNDM CV map validation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"","code":"knndm( tpoints, modeldomain = NULL, ppoints = NULL, space = \"geographical\", k = 10, maxp = 0.5, clustering = \"hierarchical\", linkf = \"ward.D2\", samplesize = 1000, sampling = \"regular\" )"},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"tpoints sf sfc point object. Contains training points samples. modeldomain sf polygon object defining prediction area. Optional; alternative ppoints (see Details). ppoints sf sfc point object. Contains target prediction points. Optional; alternative modeldomain (see Details). space character. \"geographical\" knndm, .e. kNNDM geographical space, currently implemented. k integer. Number folds desired CV. Defaults 10. maxp numeric. Maximum fold size allowed, defaults 0.5, .e. single fold can hold maximum half training points. clustering character. Possible values include \"hierarchical\" \"kmeans\". See details. linkf character. relevant clustering = \"hierarchical\". Link function agglomerative hierarchical clustering. Defaults \"ward.D2\". Check `stats::hclust` options. samplesize numeric. many points modeldomain sampled prediction points? required modeldomain used instead ppoints. sampling character. draw prediction points modeldomain? See `sf::st_sample`. required modeldomain used instead ppoints.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"object class knndm consisting list eight elements: indx_train, indx_test (indices observations use training/test data kNNDM CV iteration), Gij (distances G function construction prediction target points), Gj (distances G function construction LOO CV), Gjstar (distances modified G function kNNDM CV), clusters (list cluster IDs), W (Wasserstein statistic), space (stated user function call).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"knndm k-fold version NNDM LOO CV medium large datasets. Brielfy, algorithm tries find k-fold configuration integral absolute differences (Wasserstein W statistic) empirical nearest neighbour distance distribution function test training data CV (Gj*), empirical nearest neighbour distance distribution function prediction training points (Gij), minimised. performing clustering training points' coordinates different numbers clusters range k N (number observations), merging k final folds, selecting configuration lowest W. Using projected CRS `knndm` large computational advantages since fast nearest neighbour search can done via `FNN` package, working geographic coordinates requires computing full spherical distance matrices. clustering algorithm, `kmeans` can used projected CRS `hierarchical` can work projected geographical coordinates, though requires calculating full distance matrix training points even projected CRS. order select clustering algorithms number folds `k`, different `knndm` configurations can run compared, one lower W statistic one offers better match. W statistics `knndm` runs comparable long `tpoints` `ppoints` `modeldomain` stay . Map validation using knndm used using `CAST::global_validation`, .e. stacking --sample predictions evaluating . reasons behind 1) resulting folds can unbalanced 2) nearest neighbour functions constructed matched using CV folds simultaneously. training data points clustered respect prediction area presented knndm configuration still show signs Gj* > Gij, several things can tried. First, increase `maxp` parameter; may help control strong clustering (cost unbalanced folds). Secondly, decrease number final folds `k`, may help larger clusters. `modeldomain` sf polygon defines prediction area. function takes regular point sample (amount defined `samplesize`) spatial extent. alternative use `ppoints` instead `modeldomain`, already defined prediction locations (e.g. raster pixel centroids). using either `modeldomain` `ppoints`, advise plot study area polygon training/prediction points previous step ensure aligned.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"Experimental cycle. Article describing testing algorithm preparation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"Linnenbrink, J., Milà, C., Ludwig, M., Meyer, H.: kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308, 2023. Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"Carles Milà Jan Linnenbrink","code":""},{"path":"https://hannameyer.github.io/CAST/reference/knndm.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"K-fold Nearest Neighbour Distance Matching — knndm","text":"","code":"######################################################################## # Example 1: Simulated data - Randomly-distributed training points ######################################################################## library(sf) library(ggplot2) # Simulate 1000 random training points in a 100x100 square set.seed(1234) simarea <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) simarea <- sf::st_polygon(simarea) train_points <- sf::st_sample(simarea, 1000, type = \"random\") pred_points <- sf::st_sample(simarea, 1000, type = \"regular\") plot(simarea) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run kNNDM for the whole domain, here the prediction points are known. knndm_folds <- knndm(train_points, ppoints = pred_points, k = 5) #> Warning: Missing CRS in training or prediction points. Assuming projected CRS. #> Gij <= Gj; a random CV assignment is returned knndm_folds #> knndm object #> Space: geographical #> Clustering algorithm: hierarchical #> Intermediate clusters (q): random CV #> W statistic: 0.1338 #> Number of folds: 5 #> Observations in each fold: 200 200 200 200 200 plot(knndm_folds) folds <- as.character(knndm_folds$clusters) ggplot() + geom_sf(data = simarea, alpha = 0) + geom_sf(data = train_points, aes(col = folds)) ######################################################################## # Example 2: Simulated data - Clustered training points ######################################################################## if (FALSE) { library(sf) library(ggplot2) # Simulate 1000 clustered training points in a 100x100 square set.seed(1234) simarea <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) simarea <- sf::st_polygon(simarea) train_points <- clustered_sample(simarea, 1000, 50, 5) pred_points <- sf::st_sample(simarea, 1000, type = \"regular\") plot(simarea) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run kNNDM for the whole domain, here the prediction points are known. knndm_folds <- knndm(train_points, ppoints = pred_points, k = 5) knndm_folds plot(knndm_folds) folds <- as.character(knndm_folds$clusters) ggplot() + geom_sf(data = simarea, alpha = 0) + geom_sf(data = train_points, aes(col = folds)) } ######################################################################## # Example 3: Real- world example; using a modeldomain instead of previously # sampled prediction locations ######################################################################## if (FALSE) { library(sf) library(terra) library(ggplot2) ### prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\", \"Easting\", \"Northing\",\"VW\")], by=list(as.character(dat$SOURCEID)),mean) pts <- dat[,-1] pts <- st_as_sf(pts,coords=c(\"Easting\",\"Northing\")) st_crs(pts) <- 26911 studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) studyArea[!is.na(studyArea)] <- 1 studyArea <- as.polygons(studyArea, values = FALSE, na.all = TRUE) |> st_as_sf() |> st_union() pts <- st_transform(pts, crs = st_crs(studyArea)) plot(studyArea) plot(st_geometry(pts), add = TRUE, col = \"red\") knndm_folds <- knndm(pts, modeldomain=studyArea, k = 5) knndm_folds plot(knndm_folds) folds <- as.character(knndm_folds$clusters) ggplot() + geom_sf(data = pts, aes(col = folds)) #use for cross-validation: library(caret) ctrl <- trainControl(method=\"cv\", index=knndm_folds$indx_train, savePredictions='final') model_knndm <- train(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\")], dat$VW, method=\"rf\", trControl = ctrl) global_validation(model_knndm) }"},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":null,"dir":"Reference","previous_headings":"","what":"MultiCV — multiCV","title":"MultiCV — multiCV","text":"Multiple Cross-Validation increasing feature space clusteres","code":""},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"MultiCV — multiCV","text":"","code":"multiCV(model, length.out, method, useWeight, ...)"},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"MultiCV — multiCV","text":"model model used get AOA length.Numeric. used multiCV=TRUE. Number cross-validation folds. See details. method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. See ?aoa explanation useWeight Logical. model given. Weight variables according importance model? ... additional parameters trainDI","code":""},{"path":"https://hannameyer.github.io/CAST/reference/multiCV.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"MultiCV — multiCV","text":"preds_all","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":null,"dir":"Reference","previous_headings":"","what":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"function implements NNDM algorithm returns necessary indices perform NNDM LOO CV map validation.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"","code":"nndm( tpoints, modeldomain = NULL, ppoints = NULL, samplesize = 1000, sampling = \"regular\", phi = \"max\", min_train = 0.5 )"},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"tpoints sf sfc point object. Contains training points samples. modeldomain sf polygon object defining prediction area (see Details). ppoints sf sfc point object. Contains target prediction points. Optional. Alternative modeldomain (see Details). samplesize numeric. many points modeldomain sampled prediction points? required modeldomain used instead ppoints. sampling character. draw prediction points modeldomain? See `sf::st_sample`. required modeldomain used instead ppoints. phi Numeric. Estimate landscape autocorrelation range units tpoints ppoints projected CRS, meters geographic CRS. Per default (phi=\"max\"), size prediction area used. See Details. min_train Numeric 0 1. Minimum proportion training data must used CV fold. Defaults 0.5 (.e. half training points).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"object class nndm consisting list six elements: indx_train, indx_test, indx_exclude (indices observations use training/test/excluded data NNDM LOO CV iteration), Gij (distances G function construction prediction target points), Gj (distances G function construction LOO CV), Gjstar (distances modified G function NNDM LOO CV), phi (landscape autocorrelation range). indx_train indx_test can directly used \"index\" \"indexOut\" caret's trainControl function used initiate custom validation strategy mlr3.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"NNDM proposes LOO CV scheme nearest neighbour distance distribution function test training data CV process matched nearest neighbour distance distribution function prediction training points. Details method can found Milà et al. (2022). Specifying phi allows limiting distance matching area assumed relevant due spatial autocorrelation. Distances matched phi. Beyond range, data points used training, without exclusions. phi set \"max\", nearest neighbor distance matching performed entire prediction area. Euclidean distances used projected non-defined CRS, great circle distances used geographic CRS (units meters). modeldomain sf polygon defines prediction area. function takes regular point sample (amount defined samplesize) spatial extent. alternative use ppoints instead modeldomain, already defined prediction locations (e.g. raster pixel centroids). using either modeldomain ppoints, advise plot study area polygon training/prediction points previous step ensure aligned.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"NNDM variation LOOCV therefore may take long time large training data sets. k-fold variant implemented shortly.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Cross-Validation map validation. Methods Ecology Evolution 00, 1– 13. Meyer, H., Pebesma, E. (2022): Machine learning-based global maps ecological variables challenge assessing . Nature Communications. 13.","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"Carles Milà","code":""},{"path":"https://hannameyer.github.io/CAST/reference/nndm.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Nearest Neighbour Distance Matching (NNDM) algorithm — nndm","text":"","code":"######################################################################## # Example 1: Simulated data - Randomly-distributed training points ######################################################################## library(sf) # Simulate 100 random training points in a 100x100 square set.seed(123) poly <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) sample_poly <- sf::st_polygon(poly) train_points <- sf::st_sample(sample_poly, 100, type = \"random\") pred_points <- sf::st_sample(sample_poly, 100, type = \"regular\") plot(sample_poly) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run NNDM for the whole domain, here the prediction points are known nndm_pred <- nndm(train_points, ppoints=pred_points) nndm_pred #> nndm object #> Total number of points: 100 #> Mean number of training points: 98.54 #> Minimum number of training points: 83 plot(nndm_pred) # ...or run NNDM with a known autocorrelation range of 10 # to restrict the matching to distances lower than that. nndm_pred <- nndm(train_points, ppoints=pred_points, phi = 10) nndm_pred #> nndm object #> Total number of points: 100 #> Mean number of training points: 98.72 #> Minimum number of training points: 96 plot(nndm_pred) ######################################################################## # Example 2: Simulated data - Clustered training points ######################################################################## library(sf) # Simulate 100 clustered training points in a 100x100 square set.seed(123) poly <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE)) sample_poly <- sf::st_polygon(poly) train_points <- clustered_sample(sample_poly, 100, 10, 5) pred_points <- sf::st_sample(sample_poly, 100, type = \"regular\") plot(sample_poly) plot(pred_points, add = TRUE, col = \"blue\") plot(train_points, add = TRUE, col = \"red\") # Run NNDM for the whole domain nndm_pred <- nndm(train_points, ppoints=pred_points) nndm_pred #> nndm object #> Total number of points: 100 #> Mean number of training points: 86.84 #> Minimum number of training points: 50 plot(nndm_pred) ######################################################################## # Example 3: Real- world example; using a modeldomain instead of previously # sampled prediction locations ######################################################################## if (FALSE) { library(sf) library(terra) ### prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\", \"Easting\", \"Northing\",\"VW\")], by=list(as.character(dat$SOURCEID)),mean) pts <- dat[,-1] pts <- st_as_sf(pts,coords=c(\"Easting\",\"Northing\")) st_crs(pts) <- 26911 studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) studyArea[!is.na(studyArea)] <- 1 studyArea <- as.polygons(studyArea, values = FALSE, na.all = TRUE) |> st_as_sf() |> st_union() pts <- st_transform(pts, crs = st_crs(studyArea)) plot(studyArea) plot(st_geometry(pts), add = TRUE, col = \"red\") nndm_folds <- nndm(pts, modeldomain= studyArea) plot(nndm_folds) #use for cross-validation: library(caret) ctrl <- trainControl(method=\"cv\", index=nndm_folds$indx_train, indexOut=nndm_folds$indx_test, savePredictions='final') model_nndm <- train(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\")], dat$VW, method=\"rf\", trControl = ctrl) global_validation(model_nndm) }"},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot CAST classes — plot","title":"Plot CAST classes — plot","text":"Generic plot function CAST Classes plotting function forward feature selection result. point mean performance model run. Error bars represent standard errors cross validation. Marked points show best model number variables variable improve results. type==\"selected\", contribution selected variables model performance shown. Density plot nearest neighbor distances geographic space feature space training data well training data prediction locations. Optional, nearest neighbor distances training data test data training data CV iterations shown. plot can used check suitability chosen CV method representative estimate map accuracy. Plot DI errormetric Cross-Validation modelled relationship","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot CAST classes — plot","text":"","code":"# S3 method for trainDI plot(x, ...) # S3 method for aoa plot(x, samplesize = 1000, ...) # S3 method for nndm plot(x, ...) # S3 method for knndm plot(x, ...) # S3 method for ffs plot( x, plotType = \"all\", palette = rainbow, reverse = FALSE, marker = \"black\", size = 1.5, lwd = 0.5, pch = 21, ... ) # S3 method for geodist plot(x, unit = \"m\", stat = \"density\", ...) # S3 method for errorModel plot(x, ...)"},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot CAST classes — plot","text":"x errorModel, see DItoErrormetric ... params samplesize numeric. many prediction samples plotted? plotType character. Either \"\" \"selected\" palette color palette reverse Character. palette reversed? marker Character. Color mark best models size Numeric. Size points lwd Numeric. Width error bars pch Numeric. Type point marking best models unit character. type==\"geo\" applied plot. Supported: \"m\" \"km\". stat \"density\" density plot \"ecdf\" empirical cumulative distribution function plot.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot CAST classes — plot","text":"ggplot ggplot","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Plot CAST classes — plot","text":"Marvin Ludwig, Hanna Meyer Carles Milà Marvin Ludwig Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot CAST classes — plot","text":"","code":"if (FALSE) { data(splotdata) splotdata <- st_drop_geometry(splotdata) ffsmodel <- ffs(splotdata[,6:16], splotdata$Species_richness, ntree = 10) plot(ffsmodel) #plot performance of selected variables only: plot(ffsmodel,plotType=\"selected\") }"},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot results of a Forward feature selection or best subset selection — plot_ffs","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"plot_ffs() deprecated removed soon. Please use generic plot() function ffs object. plotting function forward feature selection result. point mean performance model run. Error bars represent standard errors cross validation. Marked points show best model number variables variable improve results. type==\"selected\", contribution selected variables model performance shown.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"","code":"plot_ffs( ffs_model, plotType = \"all\", palette = rainbow, reverse = FALSE, marker = \"black\", size = 1.5, lwd = 0.5, pch = 21, ... )"},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"ffs_model Result forward feature selection see ffs plotType character. Either \"\" \"selected\" palette color palette reverse Character. palette reversed? marker Character. Color mark best models size Numeric. Size points lwd Numeric. Width error bars pch Numeric. Type point marking best models ... arguments base plot type=\"selected\"","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"Marvin Ludwig Hanna Meyer","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_ffs.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot results of a Forward feature selection or best subset selection — plot_ffs","text":"","code":"if (FALSE) { data(iris) ffsmodel <- ffs(iris[,1:4],iris$Species) plot(ffsmodel) #plot performance of selected variables only: plot(ffsmodel,plotType=\"selected\") }"},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"Density plot nearest neighbor distances geographic space feature space training data well training data prediction locations. Optional, nearest neighbor distances training data test data training data CV iterations shown. plot can used check suitability chosen CV method representative estimate map accuracy. Alternatively distances can also calculated multivariate feature space.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"","code":"plot_geodist( x, modeldomain, type = \"geo\", cvfolds = NULL, cvtrain = NULL, testdata = NULL, samplesize = 2000, sampling = \"regular\", variables = NULL, unit = \"m\", stat = \"density\", showPlot = TRUE )"},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"x object class sf, training data locations modeldomain SpatRaster, stars sf object defining prediction area (see Details) type \"geo\" \"feature\". distance computed geographic space normalized multivariate predictor space (see Details) cvfolds optional. list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. See e.g. ?createFolds ?CreateSpacetimeFolds ?nndm cvtrain optional. List row indices x fit model CV iteration. cvtrain null cvfolds , samples included cvfolds used training data testdata optional. object class sf: Data used independent validation samplesize numeric. many prediction samples used? sampling character. draw prediction samples? See spsample. Use sampling = \"Fibonacci\" global applications. variables character vector defining predictor variables used type=\"feature. provided variables included modeldomain used. unit character. type==\"geo\" applied plot. Supported: \"m\" \"km\". stat \"density\" density plot \"ecdf\" empirical cumulative distribution function plot. showPlot logical","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"list including plot corresponding data.frame containing distances. Unit returned geographic distances meters.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"modeldomain sf polygon raster defines prediction area. function takes regular point sample (amount defined samplesize) spatial extent. type = \"feature\", argument modeldomain (provided also testdata) include predictors. Predictor values x optional modeldomain raster. provided extracted modeldomain rasterStack.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"See Meyer Pebesma (2022) application plotting function","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"Hanna Meyer, Edzer Pebesma, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/plot_geodist.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot euclidean nearest neighbor distances in geographic space or feature space — plot_geodist","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) ########### prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"DEM\",\"TWI\", \"NDRE.M\", \"Easting\", \"Northing\")], by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) st_crs(pts) <- 26911 pts_train <- pts[1:29,] pts_test <- pts[30:42,] studyArea <- terra::rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\")) studyArea <- studyArea[[c(\"DEM\",\"TWI\", \"NDRE.M\", \"NDRE.Sd\", \"Bt\")]] ########### Distance between training data and new data: dist <- plot_geodist(pts_train,studyArea) ########### Distance between training data, new data and test data: #mapview(pts_train,col.regions=\"blue\")+mapview(pts_test,col.regions=\"red\") dist <- plot_geodist(pts_train,studyArea,testdata=pts_test) ########### Distance between training data, new data and CV folds: folds <- createFolds(1:nrow(pts_train),k=3,returnTrain=FALSE) dist <- plot_geodist(x=pts_train, modeldomain=studyArea, cvfolds=folds) ## or use nndm to define folds AOI <- as.polygons(rast(studyArea), values = F) |> st_as_sf() |> st_union() |> st_transform(crs = st_crs(pts_train)) nndm_pred <- nndm(pts_train, AOI) dist <- plot_geodist(x=pts_train, modeldomain=studyArea, cvfolds=nndm_pred$indx_test, cvtrain=nndm_pred$indx_train) ########### Distances in the feature space: plot_geodist(x=pts_train, modeldomain=studyArea, type = \"feature\",variables=c(\"DEM\",\"TWI\", \"NDRE.M\")) dist <- plot_geodist(x=pts_train, modeldomain=studyArea, cvfolds = folds, testdata = pts_test, type = \"feature\",variables=c(\"DEM\",\"TWI\", \"NDRE.M\")) ############ Example for a random global dataset ############ (refer to figure in Meyer and Pebesma 2022) library(sf) library(rnaturalearth) library(ggplot2) ### Define prediction area (here: global): ee <- st_crs(\"+proj=eqearth\") co <- ne_countries(returnclass = \"sf\") co.ee <- st_transform(co, ee) ### Simulate a spatial random sample ### (alternatively replace pts_random by a real sampling dataset (see Meyer and Pebesma 2022): sf_use_s2(FALSE) pts_random <- st_sample(co.ee, 2000, exact=FALSE) ### See points on the map: ggplot() + geom_sf(data = co.ee, fill=\"#00BFC4\",col=\"#00BFC4\") + geom_sf(data = pts_random, color = \"#F8766D\",size=0.5, shape=3) + guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL) ### plot distances: dist <- plot_geodist(pts_random,co.ee,showPlot=FALSE) dist$plot+scale_x_log10(labels=round) }"},{"path":"https://hannameyer.github.io/CAST/reference/print.html","id":null,"dir":"Reference","previous_headings":"","what":"Print CAST classes — print","title":"Print CAST classes — print","text":"Generic print function trainDI aoa","code":""},{"path":"https://hannameyer.github.io/CAST/reference/print.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Print CAST classes — print","text":"","code":"# S3 method for trainDI print(x, ...) show.trainDI(x, ...) # S3 method for aoa print(x, ...) show.aoa(x, ...) # S3 method for nndm print(x, ...) show.nndm(x, ...) # S3 method for knndm print(x, ...) show.knndm(x, ...) # S3 method for ffs print(x, ...) show.ffs(x, ...)"},{"path":"https://hannameyer.github.io/CAST/reference/print.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Print CAST classes — print","text":"x object type ffs ... arguments.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":null,"dir":"Reference","previous_headings":"","what":"sPlotOpen Data of Species Richness — splotdata","title":"sPlotOpen Data of Species Richness — splotdata","text":"sPlotOpen Species Richness South America associated predictors","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"sPlotOpen Data of Species Richness — splotdata","text":"","code":"data(splotdata)"},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"sPlotOpen Data of Species Richness — splotdata","text":"sf points / data.frame 703 rows 17 columns: PlotObeservationID, GIVD_ID, Country, Biome sPlotOpen Metadata Species_richness Response Variable - Plant species richness sPlotOpen bio_x, elev Predictor Variables - Worldclim SRTM elevation geometry Lat/Lon","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"source","dir":"Reference","previous_headings":"","what":"Source","title":"sPlotOpen Data of Species Richness — splotdata","text":"Plot Species_richness sPlotOpen predictors acquired via R package geodata","code":""},{"path":"https://hannameyer.github.io/CAST/reference/splotdata.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"sPlotOpen Data of Species Richness — splotdata","text":"Sabatini, F. M. et al. sPlotOpen – environmentally balanced, open‐access, global dataset vegetation plots. (2021). doi:10.1111/geb.13346 Lopez-Gonzalez, G. et al. ForestPlots.net: web application research tool manage analyse tropical forest plot data: ForestPlots.net. Journal Vegetation Science (2011). Pauchard, . et al. Alien Plants Homogenise Protected Areas: Evidence Landscape Regional Scales South Central Chile. Plant Invasions Protected Areas (2013). Peyre, G. et al. VegPáramo, flora vegetation database Andean páramo. phytocoenologia (2015). Vibrans, . C. et al. Insights large-scale inventory southern Brazilian Atlantic Forest. Scientia Agricola (2020).","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate Dissimilarity Index of training data — trainDI","title":"Calculate Dissimilarity Index of training data — trainDI","text":"function estimates Dissimilarity Index (DI) within training data set used prediction model. Predictors can weighted based internal variable importance machine learning algorithm used model training.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate Dissimilarity Index of training data — trainDI","text":"","code":"trainDI( model = NA, train = NULL, variables = \"all\", weight = NA, CVtest = NULL, CVtrain = NULL, method = \"L2\", useWeight = TRUE )"},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate Dissimilarity Index of training data — trainDI","text":"model train object created caret used extract weights (based variable importance) well cross-validation folds train data.frame containing data used model training. required model given variables character vector predictor variables. \"\" variables model used model given train dataset. weight data.frame containing weights variable. required model given. CVtest list vector. Either list element contains data points used testing cross validation iteration (.e. held back data). vector contains ID fold training point. required model given. CVtrain list. element contains data points used training cross validation iteration (.e. held back data). required model given required CVtrain opposite CVtest (.e. data point used testing, used training). Relevant data points excluded, e.g. using nndm. method Character. Method used distance calculation. Currently euclidean distance (L2) Mahalanobis distance (MD) implemented L2 tested. Note MD takes considerably longer. useWeight Logical. model given. Weight variables according importance model?","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate Dissimilarity Index of training data — trainDI","text":"list class trainDI containing: train data frame containing training data weight data frame weights based variable importance. variables Names used variables catvars variables categorial scaleparam Scaling parameters. Output scale trainDist_avrg data frame average distance training point every point trainDist_avrgmean mean trainDist_avrg. Used normalizing DI trainDI Dissimilarity Index training data threshold DI threshold used inside/outside AOA","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Calculate Dissimilarity Index of training data — trainDI","text":"function called within aoa estimate DI AOA new data. However, may also used DI training data interest, facilitate parallelization aoa avoiding repeated calculation DI within training data.","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Calculate Dissimilarity Index of training data — trainDI","text":"Meyer, H., Pebesma, E. (2021): Predicting unknown space? Estimating area applicability spatial prediction models. doi:10.1111/2041-210X.13650","code":""},{"path":[]},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Calculate Dissimilarity Index of training data — trainDI","text":"Hanna Meyer, Marvin Ludwig","code":""},{"path":"https://hannameyer.github.io/CAST/reference/trainDI.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate Dissimilarity Index of training data — trainDI","text":"","code":"if (FALSE) { library(sf) library(terra) library(caret) library(viridis) library(ggplot2) # prepare sample data: dat <- readRDS(system.file(\"extdata\",\"Cookfarm.RDS\",package=\"CAST\")) dat <- aggregate(dat[,c(\"VW\",\"Easting\",\"Northing\")],by=list(as.character(dat$SOURCEID)),mean) pts <- st_as_sf(dat,coords=c(\"Easting\",\"Northing\")) pts$ID <- 1:nrow(pts) set.seed(100) pts <- pts[1:30,] studyArea <- rast(system.file(\"extdata\",\"predictors_2012-03-25.tif\",package=\"CAST\"))[[1:8]] trainDat <- extract(studyArea,pts,na.rm=FALSE) trainDat <- merge(trainDat,pts,by.x=\"ID\",by.y=\"ID\") # visualize data spatially: plot(studyArea) plot(studyArea$DEM) plot(pts[,1],add=TRUE,col=\"black\") # train a model: set.seed(100) variables <- c(\"DEM\",\"NDRE.Sd\",\"TWI\") model <- train(trainDat[,which(names(trainDat)%in%variables)], trainDat$VW, method=\"rf\", importance=TRUE, tuneLength=1, trControl=trainControl(method=\"cv\",number=5,savePredictions=T)) print(model) #note that this is a quite poor prediction model prediction <- predict(studyArea,model,na.rm=TRUE) plot(varImp(model,scale=FALSE)) #...then calculate the DI of the trained model: DI = trainDI(model=model) plot(DI) # the DI can now be used to compute the AOA: AOA = aoa(studyArea, model = model, trainDI = DI) print(AOA) plot(AOA) }"},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-090","dir":"Changelog","previous_headings":"","what":"CAST 0.9.0","title":"CAST 0.9.0","text":"CRAN release: 2024-01-09 CAST functions now return classes generic plotting printing new dataset examples, tutorials testing: data(splotdata) calibrate_aoa now DItoErrormetric returns model (see function documentation) plot_geodist now geodist. result can visualized plot() plot_ffs now plot(ffs) fix issue #65 (threshold) plot_geodist, plot_ffs, calibrate_aoa","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-081","dir":"Changelog","previous_headings":"","what":"CAST 0.8.1","title":"CAST 0.8.1","text":"CRAN release: 2023-05-30 failed checks Fedora 34 fixed","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-080","dir":"Changelog","previous_headings":"","what":"CAST 0.8.0","title":"CAST 0.8.0","text":"CRAN release: 2023-05-21 knndm alternative nndm large training data transition raster terra","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-071","dir":"Changelog","previous_headings":"","what":"CAST 0.7.1","title":"CAST 0.7.1","text":"CRAN release: 2023-02-04 Mahalanobis distance AOA assessment option faster estimation AOA delineation default threshold fixed suggested github.com/HannaMeyer/CAST/issues/46 fixed issue github.com/ropensci/rnaturalearth/issues/69","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-070","dir":"Changelog","previous_headings":"","what":"CAST 0.7.0","title":"CAST 0.7.0","text":"CRAN release: 2022-08-24 nndm cross-validation suggested Milà et al. (2022) plot_geodist works NNDM trainDI works NNDM rename parameter folds AOA trainDI","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-060","dir":"Changelog","previous_headings":"","what":"CAST 0.6.0","title":"CAST 0.6.0","text":"CRAN release: 2022-03-17 trainDI allows calculate DI training dataset separately aoa function plot print functions AOA function plot nearest neighbor distance distributions geographic feature space function global_validation added extensive restructuring AOA function ffs bss can used global_validation error manual assignment weights fixed","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-051","dir":"Changelog","previous_headings":"","what":"CAST 0.5.1","title":"CAST 0.5.1","text":"CRAN release: 2021-04-07 resolved dependence package “GSIF” removed CRAN repository","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-050","dir":"Changelog","previous_headings":"","what":"CAST 0.5.0","title":"CAST 0.5.0","text":"CRAN release: 2021-02-19 AOA can run parallel calibration DI (calibrate_aoa) aoa work now large training sets default threshold AOA changed","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-042","dir":"Changelog","previous_headings":"","what":"CAST 0.4.2","title":"CAST 0.4.2","text":"CRAN release: 2020-07-17 aoa now working categorical variables fixed error ffs >170 variables used changed order parameters aoa tutorial “Introduction CAST” improved","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-041","dir":"Changelog","previous_headings":"","what":"CAST 0.4.1","title":"CAST 0.4.1","text":"CRAN release: 2020-05-19 vignette: tutorial introducing “area applicability” variable threshold aoa various modifications aoa line submitted paper","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-040","dir":"Changelog","previous_headings":"","what":"CAST 0.4.0","title":"CAST 0.4.0","text":"CRAN release: 2020-04-06 new function “aoa”: quantify visualize area applicability spatial prediction models “minVar” ffs: Instead always starting 2-pair combinations, ffs can now also started combinations variables (e.g starting combinations 3) ffs failed “svmLinear” previous version S4 class issues. Fixed now.","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-031","dir":"Changelog","previous_headings":"","what":"CAST 0.3.1","title":"CAST 0.3.1","text":"CRAN release: 2018-11-19 CreateSpaceTimeFolds accepts tibbles CreateSpaceTimeFolds automatically reduces k necessary ffs accepts arguments taken caret::train new feature: plot_ffs option plot selected variables ","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-030","dir":"Changelog","previous_headings":"","what":"CAST 0.3.0","title":"CAST 0.3.0","text":"CRAN release: 2018-10-11 new feature: Best subset selection (bss) target-oriented validation (slow reliable) alternative ffs minor adaptations: verbose option included, improved examples ffs bugfix: minor adaptations done usage plsr","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-021","dir":"Changelog","previous_headings":"","what":"CAST 0.2.1","title":"CAST 0.2.1","text":"CRAN release: 2018-07-12 new feature: Introduction CAST included vignette. bugfix: minor error fixed using user defined metrics model selection.","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-020","dir":"Changelog","previous_headings":"","what":"CAST 0.2.0","title":"CAST 0.2.0","text":"CRAN release: 2018-05-03 bugfix: ffs option withinSE=TRUE choose model “best model” within SE model trained earlier run number variables. bug fixed withinSE=TRUE ffs now compares performance models use less variables (e.g. model using 5 variables better model using 4 variables still SE 4-variable model, 4-variable model rated better model). new feature: plot_ffs plots results ffs visualize performance changes according model run number variables used.","code":""},{"path":"https://hannameyer.github.io/CAST/news/index.html","id":"cast-010","dir":"Changelog","previous_headings":"","what":"CAST 0.1.0","title":"CAST 0.1.0","text":"CRAN release: 2018-01-09 Initial public version CRAN","code":""}]