Skip to content

quipa/analytics-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

analytics-demo

Maxim Jaffe's geospatial analytics demonstration

Introduction

This project showscases my data geospatial analytic skils with a case study species.

The case study is the great xenops Megaxenops parnaguae a typical furnariid bird of the Brazilian Caatinga (Wikipedia, BirdLife Factsheet).

Megaxenops parnaguae

Image source: Wikimedia (João Quental CC BY 2.0)

The project creates a Species Distribution Model (SDM) for the case study species. It uses a prototype tool MAXDM (Maxim's Species Distribution Models) specifically coded for this demonstration.

MAXDM SDMs predict patterns based on environmental variable similarity to occurence sites. It implements a geometric median similarity (GMS) and a k nearest neighbours similarity (KNNS) method.

These similarity methods are applicable to presence-only data and are relatively straightforward to calculate and reason about.

To better understand the author's choices for the project see justifications.

Example of map for a technical report: Megaxenops parnaguae SDM

Tasks

Summary of tasks and tools:

  1. Setup project (folders, scripts, packages, GRASS GIS): Makefile, bash
  2. Download base data (WorldClim, 'Natural Earth', GBIF): wget, bash
  3. Process data: bash, GRASS GIS
  4. Fit and apply model: python (MAXDM), GRASS GIS python API (grassscript)
  5. Visualise model results (PNG map): bash, GRASS GIS

Setup

Current setup is for Linux Mint 20.3.

In project root run make in command-line. For specific tasks run:

  1. make setup
  2. make download
  3. make process
  4. make model
  5. make visualise

To list subtasks make summary, for further details read Makefile.

Look in scripts folder for specific bash or python scripts, these have similar names to those defined in the Makefile.

External data is downloaded into data/external folder. Internal data is stored in data/internal folder, including GRASS GIS data.

Generated maps are saved into maps folder.

Most scripts are in bash as it integrates well with GRASS GIS. Python is used for complex components.

If you get any warnings due to GRASS GIS environment run GUI with the following inputs at startup:

  • Database directory: analytics-demo/data/internal/grassdata/
  • Location: WGS_84
  • Mapset: FFI

The entire setup can be cleaned up as follows:

  • make clean : removes data folder for clean data setup (still keeps installed packages)
  • make clean-grass: removes grassdata folder for clean GRASS GIS setup

Dependencies

  • GRASS GIS 7.8
  • python 3.8
  • pandas 0.25
  • xarray 0.16
  • scikit-learn 0.22
  • bash 5.0
  • wget 1.20
  • gawk 1:5.0.1

Porting

Porting to other operating systems should be possible:

  • Linux distributions:
    • Adapt setup-packages in Makefile to use OS package manager (apt, yum, etc.)
  • Mac OS:
    • Adapt setup-packages in Makefile to use MacPorts or other ports/package manager
  • Windows:
    • install POSIX compliant subsystem/runtime (WSL, cygwin)
    • adapt setup-packages in Makefile to use subsystem package manager

Project justification

Species choice

I choose the great xenops as I have a great interest in the Caatinga seasonally dry tropical forest and ornithology. This species is interesting as it is closely associated with both dense Caatinga, while tolerating degraded Caatinga. It is also an iconic Caatinga species.

Data choice

I choose data sources that have worldwide application to demonstrate how the project could be adapted for other target species/taxa. Worldclim 2.5 minutes data was selected as compromise between resolution and download time.

Geospatial analysis framework

This project uses GRASS GIS, the Python ecosystem, bash, make and other Linux/UNIX commands (e.g. wget, awk) for geospatial analysis. It is completely based on open source software and tools.

GRASS GIS

GRASS GIS is particularly apt for dealing with raster data which is common in SDMs. It has good integration with with python and bash, which makes it particularly suited for automated and reproducible data analysis.

It also provides a good user interface that is useful for interactive data analysis, for protyping batch analysis, and for veryfying batch analysis results

GRASS GIS provides a more robust, homogenous, and well integrated geospatial analysis experience when compared to using exclusively python ecosystem packages (e.g. fiona, geopandas, rasterio, xarray, cartopy, etc.). A similar argument can be made for R. Nevertheless it can integrate well with

GRASS GIS is also open source, which makes it particularly well suited for used in resource-constrained environments (conservation projects in the Global South)

Python

Python is particularly useful due to the following packages:

  • numerical computation (numpy, scipy, xarray)
  • data processing (pandas, numpy)
  • machine learning and statistical modelling (scikit-learn, etc.)

make

A make is a useful tool for organise data analysis pipelines as it allows to define different task and data dependencies using a Makefile.

This is more flexible then a 'task' script since specific tasks can easily run. When dependencies are met (downloaded data files) this also avoids repeating work.

Other unix tools

  • wget: easy to use tool for downloading data
  • awk: useful language for text/csv processing

MAXDM protoype

I prototyped MAXDM to demonstrate my ability to develop tools/models, in this case using a flexible package (scikit-learn) with off-the-shelf components. This similarity/distance based approach was selected as it could be implemented in a short period of time (2-3 days).

Note that in previous positions I have worked heavily with the following kinds of modelling techniques / tools:

  • Generalised Linear Models (GLMs) based on abundance monitoring data (using statsmodels and scikit-learn).
  • Hybrid ecological models linking GLMs to land use / land cover dynamic models agent-based / system dynamics models (using NetLogo and Stella)

Data sources

  • GBIF
    • Megaxenops parnaguae Reiser, 1905 occurences with coordinates (presence-only)
  • WorldCLim 2.1 historical climate data 2.5 minutes resolution
    • Bioclimatic variables
    • Elevation
  • Natural Earth 1:10m
    • Cultural Vectors: Admin 1 – States, Provinces

 Made with Natural Earth.

References

  • GBIF.org (15 April 2022) GBIF Occurrence Download https://doi.org/10.15468/dl.mcet5w
  • Fick, S.E. and R.J. Hijmans, 2017. WorldClim 2: new 1km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37 (12): 4302-4315.