Add streamflow #5

hdugan · 2024-05-08T19:19:47Z

How can we do this? Is there a modeled discharge for CONUS using comid?

Thus far have tried stream order, but wasn't useful.

lindsayplatt · 2024-05-14T17:02:28Z

I believe we can get median streamflow for COMIDs from the National Water Model using a package called nwmTools as described in this paper (see Fig 4).

https://www.nature.com/articles/s41597-023-02316-7

lindsayplatt · 2024-05-20T18:56:17Z

So this might not be as easy as I thought because of a recent data hack. Waiting to hear back from this. BUT we could at least download the NetCDF files ourselves and parse. Just not as clean as using this package.

Here's the code I was hoping to test out but am getting the error:

# Get median streamflow from NWM data

# Data release: https://www.hydroshare.org/resource/89b0952512dd4b378dc5be8d2093310f/
# R package: https://github.com/mikejohnson51/nwmTools
# Paper: https://www.nature.com/articles/s41597-023-02316-7

library(nwmTools)
nwm_output <- readNWMdata(comid = 6078267, 
                          startDate = "1993-01-01",
                          endDate = "2017-12-31",
                          version = 1.2)

# Aggregate to yearly median, then overall median
nwm_medianFlow_yearly <- aggregate_y(fun = "median")

lindsayplatt · 2024-05-20T19:55:21Z

Once that is working, I plan to compare the NWM median streamflow to the NWIS median streamflow for all the sites/comids in my thesis work. Hopefully shows a mostly 1:1 relationship :)

lindsayplatt · 2024-05-22T20:58:19Z

OK, here is an example of accessing NWM raw output via zarr files (don't have to download and can use cloud resources to compute). I am going to work on formalizing this on Friday.

Upside is that I don't have to download any files and can get around the problems with the package nwmTools, downside is that this is pretty slow when calculating medians for just 10 years and 5 COMIDs (takes ~7 minutes). I might be able to figure out how to parallelize the calculation when we scale up.

Setup

You need the R package reticulate and then you need to install additional python libraries.

install.packages('reticulate')

# Install python libraries (only need to do this once)
reticulate::py_install("xarray") 
reticulate::py_install("fsspec")
reticulate::py_install("numpy")

Running example

Here is an example that calculates median Q for 5 COMIDs between 2008-2018. I followed along with Rich Signell's Jupyter Notebook to understand the data structure and basic commands.

# Load the reticulate package in order to run python commands
library(reticulate)

# Import Python packages
xr <- import("xarray")
fsspec <- import("fsspec")
np <- import("numpy")

# Use Python's slice function to create a slice object
slice <- import_builtins()$slice

# Load streamflow from zarr file stored in the NWM's Amazon S3 bucket as an xarray
nwm_url <- 's3://noaa-nwm-retro-v2-zarr-pds' # I tried newer versions but 
nwm_zarr <- xr$open_zarr(fsspec$get_mapper(nwm_url, anon=TRUE), consolidated=TRUE)
nwm_streamflow_xr <- nwm_zarr['streamflow']

# Define the time range to filter
# This version of the data is available from 1993-2018
# as declared https://registry.opendata.aws/nwm-archive/
start_time <- "2008-01-01 00:00:00"
end_time <- "2018-12-31 00:00:00"
nwm_streamflow_xr_lastDecade <- nwm_streamflow_xr$sel(time=slice(start_time, end_time))

# Define the 5 COMIDs to use and filter the streamflow xarray data to just those
desired_comids <- np$array(c(6078267L, 6078329L, 6078383L, 5867679L, 5867685L))
nwm_streamflow_comids_xr <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids)

# Calculate the median flow value for each COMID and convert to regular R vector
nwm_streamflow_median_xr <- nwm_streamflow_comids_xr$median(dim='time', skipna=TRUE) 
nwm_streamflow_median <- py_to_r(nwm_streamflow_median_xr$data)

lindsayplatt · 2024-05-22T21:05:26Z

I tried comparing what it was like to calculate with different numbers of COMIDS. I thought that maybe grouping is faster. It is but not grouping too many. I think the ideal groupings for this methods are 5 COMIDs at a time. So, we should definitely parallelize this or it will take a zillion years to calculate median Q.

It took 101 seconds for 1 COMID.
It took 240 seconds for 5 COMIDS (that's ~48 seconds per COMID).
It took 927 seconds for 10 COMIDS (that's ~93 seconds per COMID).
It took 1,868 seconds for 25 COMIDS (that's ~75 seconds per COMID).

Here's my test setup:

time1 <- system.time({
  desired_comids1 <- np$array(14640391L)
  nwm_streamflow_comids_xr1 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids1)
  message('Starting median Q calculation for 1 COMID')
  nwm_streamflow_median_xr1 <- nwm_streamflow_comids_xr1$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median1 <- py_to_r(nwm_streamflow_median_xr1$data)
})

time5 <- system.time({
  desired_comids5 <- np$array(c(6078267L, 6078329L, 6078383L, 5867679L, 5867685L))
  nwm_streamflow_comids_xr5 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids5)
  message('Starting median Q calculation for 5 COMIDs')
  nwm_streamflow_median_xr5 <- nwm_streamflow_comids_xr5$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median5 <- py_to_r(nwm_streamflow_median_xr5$data)
})

time10 <- system.time({
  desired_comids10 <- np$array(c(6078383L, 9441199L, 4151212L, 6128763L, 13153643L, 3715602L, 
                                 7665728L, 6250614L, 12264996L, 1777616L))
  nwm_streamflow_comids_xr10 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids10)
  message('Starting median Q calculation for 10 COMIDs')
  nwm_streamflow_median_xr10 <- nwm_streamflow_comids_xr10$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median10 <- py_to_r(nwm_streamflow_median_xr10$data)
})

time25 <- system.time({
  desired_comids25 <- np$array(c(14365060L, 22336379L, 22337975L, 22338427L, 22340329L, 3717136L, 
3808745L, 3806147L, 2087785L, 3405683L, 3407001L, 1814983L, 6848607L, 
12005077L, 904040051L, 15582369L, 15537883L, 21972746L, 2252949L, 
4390605L, 4390909L, 4390597L, 6013072L, 7665728L, 7666946L))
  nwm_streamflow_comids_xr25 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids25)
  message('Starting median Q calculation for 25 COMIDs')
  nwm_streamflow_median_xr25 <- nwm_streamflow_comids_xr25$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median25 <- py_to_r(nwm_streamflow_median_xr25$data)
})

hdugan · 2024-05-23T21:24:17Z

I took a look at this, because I'd rather learn something new than work on talks. I figured out how to do it in R. I avoid python in R when I can, since it seemingly always leads to version errors. It's still slow as heck. Part of that is because the zarr files are in kind of useless chunks. Would be quicker if they were chunked by comid or time, but I'm sure there's a logical reason to chunk them in big rectangles across the entire raster.

[sidebar, I was also using v2. I could get v3 streamflow to load, but not the feature_id. There must be something wrong on their end]

The chunks are 672 x 30000 [hours x comid]. If we pull from the least chunks possible, it speeds it up a lot.
This example takes about 30 sec to run. So across all comids should be about 1 hour? It's assuming that pulling 1/5 of the data gives you a representative streamflow.

library(Rarr) # devtools::install_github(repo = 'grimbough/Rarr')
library(tidyverse)
library(lubridate)

# Time parameters in the data file 
start_time <- as.POSIXct("1993-01-01 00:00:00")
end_time <- as.POSIXct("2018-12-31 23:59:00")
timeseq = seq.POSIXt(from = start_time, to = end_time, by = 'hour')

# Constrain data request to every 5th chunk after 2010
indx = data.frame(time = timeseq) |> 
  mutate(indx = row_number()) |> # add row number
  mutate(chunk = rep(1:340, each = 672)[1:length(timeseq)]) |> # create chunk group
  filter(year(time) >= 2010) |> # filter after 2010
  filter(chunk %% 5 == 0) |> # filter every 5th row
  pull(indx)

# configure client
s3_client <- paws.storage::s3(
  config = list(
    credentials = list(anonymous = TRUE), 
    region = "us-west-2", 
    s3_force_path_style = TRUE)
)

# Look at features available from NWM v2 output 
s3_address <- "https://s3.amazonaws.com/noaa-nwm-retro-v2-zarr-pds"
zarr_overview(s3_address, s3_client = s3_client)

s3_address <- "https://s3.amazonaws.com/noaa-nwm-retro-v2-zarr-pds/feature_id"
zarr_overview(s3_address, s3_client = s3_client)
# Shape: 2729077
# Chunk Shape: 2729077
# No. of Chunks: 1 (1)
z_id <- read_zarr_array(s3_address, s3_client = s3_client)


# Load streamflow from zarr file stored in the NWM's Amazon S3 bucket as an xarray
s3_address <- "https://s3.amazonaws.com/noaa-nwm-retro-v2-zarr-pds/streamflow"
zarr_overview(s3_address, s3_client = s3_client)
# Shape: 227904 x 2729077
# Chunk Shape: 672 x 30000
# No. of Chunks: 30940 (340 x 91)


# Start the clock
ptm <- proc.time()

# Request the chunk index across 30000 comids
z_streamflow <- read_zarr_array(s3_address, s3_client = s3_client, index = list(indx, 1:30000))

# Stop the clock
proc.time() - ptm

hdugan · 2024-05-23T21:27:08Z

If this works (which I think it does), we should pull all the comids and make a median streamflow output file, and then use that in our workflow. Instead of running this more than once.

lindsayplatt · 2024-05-23T21:40:32Z

Amazing! Can't believe it only took 30 seconds to do this. Although, what is the output format for z_streamflow? I think it was the median calculation step that seemed to take awhile in my example.

Also, I assume the 1:30000 for comids can just be substituted for our vector of COMIDS? I'll try running this tomorrow and then work on incorporating this into pipeline :)

hdugan · 2024-05-23T21:50:33Z

It's a raster, so can then do median of each column. Doesn't seem too bad.

# Get median of each column 
ptm <- proc.time()
apply(z_streamflow, 2, FUN = median)
proc.time() - ptm # 73 sec

# package is faster than apply 
library(matrixStats)
ptm <- proc.time()
colMedians(x = z_streamflow)
proc.time() - ptm # 12 sec

There are a lot of zeros, and a lot of -999900. So have to figure that out... but maybe it's from comids we're not using. Also, I assume feature_id = comid?

lindsayplatt · 2024-05-24T14:04:11Z

Huh, maybe it is my computer. I just ran your first code chunk to get the data and my download time is so much longer. 4 minutes compared to your 30 sec!

# This took 4 minutes (yours took 30 sec)
z_streamflow <- read_zarr_array(s3_address, s3_client = s3_client, index = list(indx, 1:30000))

The median calculation code was just as fast as yours, though. And I have also been assuming COMID == feature_id based on their docs.

hdugan · 2024-05-24T15:01:26Z

Tried it on my CFL desktop and it also took 30 seconds.

hdugan added the enhancement New feature or request label May 8, 2024

lindsayplatt mentioned this issue May 20, 2024

Remove Q downloads and criteria? #8

Open

lindsayplatt added the ASLO something to do ahead of the ASLO talk label May 22, 2024

lindsayplatt self-assigned this May 22, 2024

lindsayplatt mentioned this issue May 28, 2024

Add download of median streamflow from NWM #15

Merged

2 tasks

lindsayplatt closed this as completed in #15 May 29, 2024

lindsayplatt mentioned this issue Aug 13, 2024

NetCDF: Malformed or inaccessible DAP2 DDS or DAP4 DMR response mikejohnson51/nwmTools#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streamflow #5

Add streamflow #5

hdugan commented May 8, 2024

lindsayplatt commented May 14, 2024

lindsayplatt commented May 20, 2024

lindsayplatt commented May 20, 2024

lindsayplatt commented May 22, 2024

lindsayplatt commented May 22, 2024 •

edited

Loading

hdugan commented May 23, 2024 •

edited

Loading

hdugan commented May 23, 2024

lindsayplatt commented May 23, 2024

hdugan commented May 23, 2024

lindsayplatt commented May 24, 2024

hdugan commented May 24, 2024

Add streamflow #5

Add streamflow #5

Comments

hdugan commented May 8, 2024

lindsayplatt commented May 14, 2024

lindsayplatt commented May 20, 2024

lindsayplatt commented May 20, 2024

lindsayplatt commented May 22, 2024

Setup

Running example

lindsayplatt commented May 22, 2024 • edited Loading

hdugan commented May 23, 2024 • edited Loading

hdugan commented May 23, 2024

lindsayplatt commented May 23, 2024

hdugan commented May 23, 2024

lindsayplatt commented May 24, 2024

hdugan commented May 24, 2024

lindsayplatt commented May 22, 2024 •

edited

Loading

hdugan commented May 23, 2024 •

edited

Loading