Maxim Jaffe's geospatial analytics demonstration
This project showscases my data geospatial analytic skils with a case study species.
The case study is the great xenops Megaxenops parnaguae a typical furnariid bird of the Brazilian Caatinga (Wikipedia, BirdLife Factsheet).
Image source: Wikimedia (João Quental CC BY 2.0)
The project creates a Species Distribution Model (SDM) for the case study species. It uses a prototype tool MAXDM (Maxim's Species Distribution Models) specifically coded for this demonstration.
MAXDM SDMs predict patterns based on environmental variable similarity to occurence sites. It implements a geometric median similarity (GMS) and a k nearest neighbours similarity (KNNS) method.
These similarity methods are applicable to presence-only data and are relatively straightforward to calculate and reason about.
To better understand the author's choices for the project see justifications.
Example of map for a technical report:
Summary of tasks and tools:
- Setup project (folders, scripts, packages, GRASS GIS):
Makefile
,bash
- Download base data (WorldClim, 'Natural Earth', GBIF):
wget
,bash
- Process data:
bash
, GRASS GIS - Fit and apply model:
python
(MAXDM), GRASS GIS python API (grassscript
) - Visualise model results (PNG map):
bash
, GRASS GIS
Current setup is for Linux Mint 20.3.
In project root run make
in command-line. For specific tasks run:
make setup
make download
make process
make model
make visualise
To list subtasks make summary
, for further details read Makefile
.
Look in scripts
folder for specific bash or python scripts, these have similar names to those defined in the Makefile
.
External data is downloaded into data/external
folder. Internal data is stored in data/internal
folder, including GRASS GIS data.
Generated maps are saved into maps
folder.
Most scripts are in bash as it integrates well with GRASS GIS. Python is used for complex components.
If you get any warnings due to GRASS GIS environment run GUI with the following inputs at startup:
- Database directory: analytics-demo/data/internal/grassdata/
- Location: WGS_84
- Mapset: FFI
The entire setup can be cleaned up as follows:
make clean
: removes data folder for clean data setup (still keeps installed packages)make clean-grass
: removes grassdata folder for clean GRASS GIS setup
- GRASS GIS 7.8
- python 3.8
- pandas 0.25
- xarray 0.16
- scikit-learn 0.22
- bash 5.0
- wget 1.20
- gawk 1:5.0.1
Porting to other operating systems should be possible:
- Linux distributions:
- Adapt
setup-packages
inMakefile
to use OS package manager (apt
,yum
, etc.)
- Adapt
- Mac OS:
- Adapt
setup-packages
inMakefile
to use MacPorts or other ports/package manager
- Adapt
- Windows:
I choose the great xenops as I have a great interest in the Caatinga seasonally dry tropical forest and ornithology. This species is interesting as it is closely associated with both dense Caatinga, while tolerating degraded Caatinga. It is also an iconic Caatinga species.
I choose data sources that have worldwide application to demonstrate how the project could be adapted for other target species/taxa. Worldclim 2.5 minutes data was selected as compromise between resolution and download time.
This project uses GRASS GIS, the Python ecosystem, bash
, make
and other Linux/UNIX commands (e.g. wget, awk) for geospatial analysis. It is completely based on open source software and tools.
GRASS GIS is particularly apt for dealing with raster data which is common in SDMs. It has good integration with with python and bash, which makes it particularly suited for automated and reproducible data analysis.
It also provides a good user interface that is useful for interactive data analysis, for protyping batch analysis, and for veryfying batch analysis results
GRASS GIS provides a more robust, homogenous, and well integrated geospatial analysis experience when compared to using exclusively python ecosystem packages (e.g. fiona, geopandas, rasterio, xarray, cartopy, etc.). A similar argument can be made for R. Nevertheless it can integrate well with
GRASS GIS is also open source, which makes it particularly well suited for used in resource-constrained environments (conservation projects in the Global South)
Python is particularly useful due to the following packages:
- numerical computation (numpy, scipy, xarray)
- data processing (pandas, numpy)
- machine learning and statistical modelling (scikit-learn, etc.)
A make is a useful tool for organise data analysis pipelines as it allows to define different task and data dependencies using a Makefile
.
This is more flexible then a 'task' script since specific tasks can easily run. When dependencies are met (downloaded data files) this also avoids repeating work.
- wget: easy to use tool for downloading data
- awk: useful language for text/csv processing
I prototyped MAXDM to demonstrate my ability to develop tools/models, in this case using a flexible package (scikit-learn) with off-the-shelf components. This similarity/distance based approach was selected as it could be implemented in a short period of time (2-3 days).
Note that in previous positions I have worked heavily with the following kinds of modelling techniques / tools:
- Generalised Linear Models (GLMs) based on abundance monitoring data (using
statsmodels
andscikit-learn
). - Hybrid ecological models linking GLMs to land use / land cover dynamic models agent-based / system dynamics models (using NetLogo and Stella)
- GBIF
- Megaxenops parnaguae Reiser, 1905 occurences with coordinates (presence-only)
- WorldCLim 2.1 historical climate data 2.5 minutes resolution
- Bioclimatic variables
- Elevation
- Natural Earth 1:10m
- Cultural Vectors: Admin 1 – States, Provinces
- GBIF.org (15 April 2022) GBIF Occurrence Download https://doi.org/10.15468/dl.mcet5w
- Fick, S.E. and R.J. Hijmans, 2017. WorldClim 2: new 1km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37 (12): 4302-4315.