Skip to content

Latest commit

 

History

History
353 lines (301 loc) · 25.4 KB

README.md

File metadata and controls

353 lines (301 loc) · 25.4 KB

ST-DeepHydro

Python library for spatio-temporal aware hydrological modelling (especially, rainfall-runoff modelling) using deep learning.

This library facilitates the training of Neural Networks for spatio-temporal timeseries prediction. It is based on the Deep Learning library Tensorflow and aims to support hydrological use cases. For this purpose, the library implements different Neural Network architectures, with a special focus on learning spatio-temporal processes within catchments. Model types comprise lumped and distributed models, which enable training on aggregated meteorological timeseries data as well as spatially distributed forcings such as gridded datasets. In addition, the library comes with various data loading mechanisms for common hydrometeorological datasets, such as Daymet and CAMELS-US, as well useful preprocessing utilities for handling spatio-temporal data.

To train and evaluate some models on your own hydrometeorological datasets for one or more catchments,this library comes with a simple command line tool. You also can use this library for implementing your own deep learning applications by using the already implemented models and data loading classes. The library is designed in a way that also facilitates additional model or data loading and processing pipelines. To get started, just follow the documentation below.

The library is inspired by the great NeuralHydrology package [1] which has been used for various research aspects regarding hydrological modelling. However, since NeuralHydrology mainly focuses on lumped models, the ST DeepHydro package addresses needs for spatial distributed modelling.

Get Started

Requirements

To be prepared for using the stdeephydro package for your own purposes, you first have to set up you local environment by installing all required dependencies. The easiest way to do so is creating a virtual environment by using Conda. +Make sure, your have installed Miniconda or even Anaconda and create a new environment using the environment.yml file that comes with this repository:

conda env create -f environment.yml

If you'd rather like to create a virtual environment with venv or virtualenv, you can also use the requirements.txt that is shipped with this repository. Just create a virtual environment with your preferred tool and install all dependencies:

python3 -m pip install -r requirements.txt 

Installation

This package simply can be installed by using pip. Up to now, the package has not been published and uploaded to PyPi, yet. However, you can install the latest version of the package, which is based on the master branch:

python3 -m pip install git+https://github.com/SebaDro/st-deep-hydro.git

It is also possible to clone this repository to your local machine and develop your own models and data loading routines. Finally, you can install the package from your local copy:

python3 -m pip install -e .

The installation also makes a bash script (run_training) available within your environment.

Data

The ST-DeepHydro library mainly focuses on training models for rainfall runoff predictions by using hydrometeorological datasets. For this purpose, a variety of datasets are suitable to be used as training data. Though, especially CAMELS-US and Daymet datasets has been widely proven as appropriate input datasets for hydrological modelling.

For loading different types of hydrometeorological datasets the stdeephydro.dataloader module comes with various dataloader implementations.

CAMELS-US

The CAMELS-US dataset contains hydrometeorological timeseries data for 671 basins in the continuous United States [2]. Meteorological products contain basin aggregated daily forcings from three different data sources (Daymet, Maurer and NLDAS). Daily streamflow data for 671 gauges comes from the United States Geological Survey National Water Information System. You simply can download this large-sample dataset from the NCAR website.

To load CAMELS-US datasets use the CamelsUsStreamflowDataLoader and CamelsUsForcingsDataLoader classes. See their documentation for further usage information.

Daymet

Daymet data contain gridded estimates of daily weather and climatology parameters at a 1 km x 1 km raster for North America, Hawaii, and Puerto Rico [3, 4]. Daymet Version 3 and Version 4 data are provided by (ORNL DAAC)[https://daymet.ornl.gov/] and can be via ORNL DAAC's Thematic Real-time Environmental Distributed Data Services (THREDDS). To download these datasets for your preferred region and prepare it for model training, you might want to use the Daymet PyProcessing toolset.

The DaymetDataLoader class is able to load 1-dimensional (temporally distributed) as well as 2-dimensional (raster-based, spatio-temporally distributed) Daymet NetCDF data.

See it's documentation for further details.

Models

To train neural networks for timeseries forecasting the ST-DeepHydro library implements different network architectures, based on the Deep Learning framework Tensorflow. Although, these networks are intended to model rainfall-runoff in river catchments, other hydrological modelling use-cases are conceivable. Model types comprise lumped and distributed models. While lumped models are trained on aggregated meteorological timeseries data, distributed models have a special focus on learning spatio-temporal catchment processes. This makes them suitable to be trained on spatio-temporal hydrometeorological datasets, i.e. timeseries of raster data.

The stdeephydro.models module contains all model implementations. Here, you will find a variety of Tensorflow models for different use cases and data types.

LSTM

stdeephydro.models.LstmModel builds a classical LSTM model, which is able to learn hydrological processes within catchment areas from aggregated hydrometeorological input datasets. The model is applicable for rainfall-runoff timeseries forecasting by predicting gauge streamflow.

The Tensorflow model comprises one or more stacked (hidden) LSTM layers with a fully connected layer on top for predicting one or more target variables from timeseries inputs.

LSTM Attributes:

Required values for cfg.params:

  • lstm:
    • hiddenLayers: number of LSTM layers (int)
    • units: units for each LSTM layer (list of int, with the same length as hiddenLayers)
    • dropout: dropout for each LSTM layer (list of float, with the same length as hiddenLayers)

Example:

params:
  lstm:
    hiddenLayers: 2
    units:
      - 32
      - 32
    dropout:
      - 0.1
      - 0

CNN-LSTM

stdeephydro.models.CnnLstmModel builds a combination of Convolutional Neural Network (CNN) and Long short-term memory (LSTM) Tensorflow model. The neural network architecture addresses needs to learn spatio-temporal processes within catchments from spatially distributed (raster-based) timeseries data. Therefore, this model type can be trained on meteorological raster data to forecast gauge streamflow or any other hydrological variables within river catchments.

The idea of this model architecture is to extract features from a timeseries of 2-dimensional raster data by convolutional operations at first. The extracted timeseries features then are passed to a stack of LSTM layer to predict one or more target variables.

CNN-LSTM Attributes:

Required values for cfg.params:

  • cnn:
    • hiddenLayers: number of time-distributed Conv2D layers (int). After each Conv2D layer follows a MaxPooling2D layer, except the last Conv2D layer, which has a GlobalMaxPooling2D on top.
    • filters: number of filters for each Conv2D layer (list of int, with the same length as hiddenLayers)
  • lstm:
    • hiddenLayers: number of LSTM layers (int)
    • units: units for each LSTM layer (list of int, with the same length as hiddenLayers)
    • dropout: dropout for each LSTM layer (list of float, with the same length as hiddenLayers)

Example:

params:
  cnn:
    hiddenLayers: 3
    filters:
      - 8
      - 16
      - 32
  lstm:
    hiddenLayers: 2
    units:
      - 32
      - 32
    dropout:
      - 0.1
      - 0

Multi Input CNN-LSTM

The stdeephydro.models.MultiInputCnnLstmModel class concatenates a combination of Convolutional Neural Network (CNN) and Long short-term memory (LSTM), CNN-LSTM, with a classical LSTM Tensorflow model. With this architecture design the neural network is able to process two input datasets that differ in its spatio-temporal dimensions. Hence, it is possible to train the model with lumped meteorological long-term timeseries data as well as spatially-distributed short-term raster data.

The idea of this model is to enhance the capability of a classical LSTM model to predict target variables from one-dimensional timeseries data but also considering spatial distributed timeseries data that are processed by a CNN-LSTM part of the model. This approach adds enhanced spatial information to the model and limits computational efforts for training the model at the same time.

Multi input CNN-LSTM Attributes:

Required values for cfg.params:

  • cnn:
    • hiddenLayers: number of time-distributed Conv2D layers for the CNN-LSTM part of the model (int). After each Conv2D layer follows a MaxPooling2D layer, except the last Conv2D layer, which has a GlobalMaxPooling2D on top.
    • filters: number of filters for each time-distributed Conv2D layer (list of int, with the same length as hiddenLayers)
  • lstm:
    • hiddenLayers: number of LSTM layers for both the LSTM and CNN-LSTM part of the model (int)
    • units: units for each LSTM layer (list of int, with the same length as hiddenLayers)
    • dropout: dropout for each LSTM layer (list of float, with the same length as hiddenLayers)

Example:

params:
  cnn:
    hiddenLayers: 3
    filters:
      - 8
      - 16
      - 32
  lstm:
    hiddenLayers: 2
    units:
      - 32
      - 32
    dropout:
      - 0.1
      - 0

ConvLSTM

The stdeephydro.models.ConvLstmModel class builds a Convolutional LSTM model that maily builds up on Tensorflow ConvLSTM2D layers. This architecture is able to predict one or more target variables based on spatially distributed timeseries data. The neural network processes timeseries of raster data with a stack of LSTM layers that perform convolutional operations by using input-to-state and state-to-state transitions.

Up to now, the ConvLSTM model can be trained on meteorological raster data to predict one-dimensional or any other hydrological variables, such as the CNN-LSTM model can do. However, originally ConvLSTM layers are intended for building models that are able to produce raster-based predictions. This maybe implemented for future releases to support relevant hydrological use cases.

ConvLSTM Attributes:

Required values for cfg.params:

  • cnn:
    • hiddenLayers: number of ConvLSTM2D layers (int). After each ConvLSTM2D layer follows a MaxPooling3D layer, except the last ConvLSTM2D layer, which has a GlobalMaxPooling2D on top.
    • filters: number of filters for each Conv2D layer (list of int, with the same length as hiddenLayers)

Example:

cnn:
  hiddenLayers: 3
  filters:
    - 8
    - 16
    - 32

Conv3D

The stdeephydro.models.Conv3DModel builds a Tensorflow model based multiple stacked Conv3D and MaxPooling3D layers. Usually, intended to process video data, it applies convolutional and max pooling operations on the spatial as well as the temporal dimension of the input data. The model can be trained on a timeseries of meteorological raster data to predict gauge streamflow or any other hydrological variables.

Conv3D Attributes:

Required values for cfg.params:

  • cnn:
    • hiddenLayers: number of Conv3D layers (int). After each Conv3D layer follows a MaxPooling3D layer, except the last Conv3D layer, which has a GlobalMaxPooling3D on top.
    • filters: number of filters for each Conv3D layer (list of int, with the same length as hiddenLayers)

Example:

cnn:
  hiddenLayers: 3
  filters:
    - 8
    - 16
    - 32

How to use

Training

Model training and evaluating for multiple basins can be simply performed by using the run_training command line tool. This tool will be automatically available in your environment when installing this package.

Just run run_training .\config\your-training-config.yml to perform training according to your own configuration.

For testing purposes you can add the --dryrun flag to this call: run_training --dryrun .\config\your-training-config.yml. This causes no results such as model checkpoints or evaluation metrics to be stored during the run.

Data preparation

To train a model, one of the supported datasets mentioned in the Data section is required. So, before you start with model training, make sure that you have properly prepared all the datasets you want to use for training:

  1. Download one or more of the supported datasets. You'll need a streamflow datasets and a forcings dataset.
  2. Place all datasets within a separate folder. It is required, that for each basin you want to train a model for, a corresponding dataset file exists int the data folder, which has the basin ID in its file name.
  3. Forcing files and streamflow files should be placed in different folders, if your datasets does not contain both variable sets jointly.
  4. Create a text file (e.b. "basins.txt") that lists all basins you want to perform model training and evaluating on. Each basin ID must be on a separate line. There is an example file within this repository. Make sure, that the data folder contains corresponding files for each basin you list in the basins file.

Configuration

Several training aspects, such as the neural network architecture, dataset types or number of training epochs, can be customized by providing a configuration file. The ./config folder comes with several examples for such a file. In addition, this section describes all configuration parameters that are supported at the moment.

General Parameters

General configuration parameters must be defined under the generalkey:

Config Parameter Type Description
name string Name of the experiment. This name will be used for prefixing some of the outputs.
logTensorboardEvents boolean Indicates whether to log events during training for Tensorboard or not.
loggingConfig string Path to a logging configuration file. This must be a YAML file according to the Python logging dictionary schema.
outputDir string Path to a directory, which will be used for storing outputs such as the trained model, checkpoints and evaluation results.
saveCheckpoints boolean Indicates whether to save training checkpoints or not.
saveModel boolean Indicates whether to store the trained model or not.
seed int Set a fixed seed, which e.g. affects weight initialization in order to achieve reproducibility. If this parameter is not set, a random seed will be used.
Data Parameters

The data key contains several definitions for the hydrometeorological datasets, which should be used for model training, validation and testing in your experiments.

Config Parameter Type Description
forcings.dir string Path to a directory, which contains forcings datasets.
forcings.type string Type of the forcings datasets. Currently supported: daymet, camels-us
forcings.variables array List of forcing variables, which should be considered for training the model
streamflow.dir string Path to a directory, which contains streamflow datasets.
streamflow.type string Type of the forcings datasets. Currently supported: camels-us
streamflow.variables array List of streamflow variables, which should be considered for training the model. Actually, only one variable is supported, up to now.
training.startDate string Start of the training period (ISO 8601 date string in the format yyyy-MM-dd)
training.endDate string End of the training period (ISO 8601 date string in the format yyyy-MM-dd)
validation.startDate string Start of the validation period (ISO 8601 date string in the format yyyy-MM-dd)
validation.endDate string End of the validation period (ISO 8601 date string in the format yyyy-MM-dd)
test.startDate string Start of the testing period (ISO 8601 date string in the format yyyy-MM-dd)
test.endDate string End of the testing period (ISO 8601 date string in the format yyyy-MM-dd)
Model Parameters

The model configuration section contains several parameters that define the model architecture and control the training process.

Config Parameter Type Description
type string Name of the model type that will be trained. Currently supported: lstm, cnn-lstm, multi-cnn-lstm, convlstm, conv3d
timesteps array Timesteps that will be used for creating the input (forcings) timeseries windows. E.g., if timesteps are 10, the last 10 days of forcings values will be used as inputs for model training to predict the target variable with a defined offset. If you train a model that accepts multiple inputs, such as as the multi-cnn-lstm model, you have to define a timesteps value for each input.
offset int Offset between inputs (forcings) and target (streamflow). An offset of 1 means that forcings for the last n-days will be taken as input and the streamflow for n + 1 will be taken as target.
loss array List of loss functions to use for training. Name of the used loss functions must refer to Tensorflow supported loss functions.
metrics array List of metrics used for validation and evaluation. Name of the used metrics must refer to Tensorflow supported metrics.
optimizer string Defines an optimizer that will be used for training. Must be one of Tensorflow supported optimizers.
epochs int Number of training epochs
batchSize int Batch size to use for training
multiOutput boolean Indicates whether the model should predict multiple target variables at once or only one (currently, not supported)
params dict Additional model specific configuration parameters. Which parameters can be defined depends on the type parameter value. Supported parameters for each model type are listed in the models section

References

[1] Kratzert, F., Gauch, M., Nearing, G., & Klotz, D. (2022). NeuralHydrology — A Python library for Deep Learning research in hydrology. In: Journal of Open Source Software, 7(71), 4050. https://doi.org/10.21105/joss.04050

[2] Newman, A., Sampson, K., Clark, M. P., Bock, A., Viger, R. J., Blodgett, D. (2014). A large-sample watershed-scale hydrometeorological dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://dx.doi.org/10.5065/D6MW2F4D

[3] Thornton, P.E., M.M. Thornton, B.W. Mayer, Y. Wei, R. Devarakonda, R.S. Vose, and R.B. Cook. 2016. Daymet: Daily Surface Weather Data on a 1-km Grid for North America, Version 3. ORNL DAAC, Oak Ridge, Tennessee, USA. https://doi.org/10.3334/ORNLDAAC/1328

[4] Thornton, M.M., R. Shrestha, Y. Wei, P.E. Thornton, S. Kao, and B.E. Wilson. 2020. Daymet: Daily Surface Weather Data on a 1-km Grid for North America, Version 4. ORNL DAAC, Oak Ridge, Tennessee, USA. https://doi.org/10.3334/ORNLDAAC/1840