Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streamflow #5

Closed
hdugan opened this issue May 8, 2024 · 11 comments · Fixed by #15
Closed

Add streamflow #5

hdugan opened this issue May 8, 2024 · 11 comments · Fixed by #15
Assignees
Labels
ASLO something to do ahead of the ASLO talk enhancement New feature or request

Comments

@hdugan
Copy link
Collaborator

hdugan commented May 8, 2024

How can we do this? Is there a modeled discharge for CONUS using comid?

Thus far have tried stream order, but wasn't useful.

@hdugan hdugan added the enhancement New feature or request label May 8, 2024
@lindsayplatt
Copy link
Owner

I believe we can get median streamflow for COMIDs from the National Water Model using a package called nwmTools as described in this paper (see Fig 4).

https://www.nature.com/articles/s41597-023-02316-7

@lindsayplatt
Copy link
Owner

So this might not be as easy as I thought because of a recent data hack. Waiting to hear back from this. BUT we could at least download the NetCDF files ourselves and parse. Just not as clean as using this package.

Here's the code I was hoping to test out but am getting the error:

# Get median streamflow from NWM data

# Data release: https://www.hydroshare.org/resource/89b0952512dd4b378dc5be8d2093310f/
# R package: https://github.com/mikejohnson51/nwmTools
# Paper: https://www.nature.com/articles/s41597-023-02316-7

library(nwmTools)
nwm_output <- readNWMdata(comid = 6078267, 
                          startDate = "1993-01-01",
                          endDate = "2017-12-31",
                          version = 1.2)

# Aggregate to yearly median, then overall median
nwm_medianFlow_yearly <- aggregate_y(fun = "median")

@lindsayplatt
Copy link
Owner

Once that is working, I plan to compare the NWM median streamflow to the NWIS median streamflow for all the sites/comids in my thesis work. Hopefully shows a mostly 1:1 relationship :)

@lindsayplatt lindsayplatt added the ASLO something to do ahead of the ASLO talk label May 22, 2024
@lindsayplatt lindsayplatt self-assigned this May 22, 2024
@lindsayplatt
Copy link
Owner

OK, here is an example of accessing NWM raw output via zarr files (don't have to download and can use cloud resources to compute). I am going to work on formalizing this on Friday.

Upside is that I don't have to download any files and can get around the problems with the package nwmTools, downside is that this is pretty slow when calculating medians for just 10 years and 5 COMIDs (takes ~7 minutes). I might be able to figure out how to parallelize the calculation when we scale up.

Setup

You need the R package reticulate and then you need to install additional python libraries.

install.packages('reticulate')

# Install python libraries (only need to do this once)
reticulate::py_install("xarray") 
reticulate::py_install("fsspec")
reticulate::py_install("numpy")

Running example

Here is an example that calculates median Q for 5 COMIDs between 2008-2018. I followed along with Rich Signell's Jupyter Notebook to understand the data structure and basic commands.

# Load the reticulate package in order to run python commands
library(reticulate)

# Import Python packages
xr <- import("xarray")
fsspec <- import("fsspec")
np <- import("numpy")

# Use Python's slice function to create a slice object
slice <- import_builtins()$slice

# Load streamflow from zarr file stored in the NWM's Amazon S3 bucket as an xarray
nwm_url <- 's3://noaa-nwm-retro-v2-zarr-pds' # I tried newer versions but 
nwm_zarr <- xr$open_zarr(fsspec$get_mapper(nwm_url, anon=TRUE), consolidated=TRUE)
nwm_streamflow_xr <- nwm_zarr['streamflow']

# Define the time range to filter
# This version of the data is available from 1993-2018
# as declared https://registry.opendata.aws/nwm-archive/
start_time <- "2008-01-01 00:00:00"
end_time <- "2018-12-31 00:00:00"
nwm_streamflow_xr_lastDecade <- nwm_streamflow_xr$sel(time=slice(start_time, end_time))

# Define the 5 COMIDs to use and filter the streamflow xarray data to just those
desired_comids <- np$array(c(6078267L, 6078329L, 6078383L, 5867679L, 5867685L))
nwm_streamflow_comids_xr <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids)

# Calculate the median flow value for each COMID and convert to regular R vector
nwm_streamflow_median_xr <- nwm_streamflow_comids_xr$median(dim='time', skipna=TRUE) 
nwm_streamflow_median <- py_to_r(nwm_streamflow_median_xr$data)

@lindsayplatt
Copy link
Owner

lindsayplatt commented May 22, 2024

I tried comparing what it was like to calculate with different numbers of COMIDS. I thought that maybe grouping is faster. It is but not grouping too many. I think the ideal groupings for this methods are 5 COMIDs at a time. So, we should definitely parallelize this or it will take a zillion years to calculate median Q.

It took 101 seconds for 1 COMID.
It took 240 seconds for 5 COMIDS (that's ~48 seconds per COMID).
It took 927 seconds for 10 COMIDS (that's ~93 seconds per COMID).
It took 1,868 seconds for 25 COMIDS (that's ~75 seconds per COMID).

Here's my test setup:

time1 <- system.time({
  desired_comids1 <- np$array(14640391L)
  nwm_streamflow_comids_xr1 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids1)
  message('Starting median Q calculation for 1 COMID')
  nwm_streamflow_median_xr1 <- nwm_streamflow_comids_xr1$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median1 <- py_to_r(nwm_streamflow_median_xr1$data)
})

time5 <- system.time({
  desired_comids5 <- np$array(c(6078267L, 6078329L, 6078383L, 5867679L, 5867685L))
  nwm_streamflow_comids_xr5 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids5)
  message('Starting median Q calculation for 5 COMIDs')
  nwm_streamflow_median_xr5 <- nwm_streamflow_comids_xr5$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median5 <- py_to_r(nwm_streamflow_median_xr5$data)
})

time10 <- system.time({
  desired_comids10 <- np$array(c(6078383L, 9441199L, 4151212L, 6128763L, 13153643L, 3715602L, 
                                 7665728L, 6250614L, 12264996L, 1777616L))
  nwm_streamflow_comids_xr10 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids10)
  message('Starting median Q calculation for 10 COMIDs')
  nwm_streamflow_median_xr10 <- nwm_streamflow_comids_xr10$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median10 <- py_to_r(nwm_streamflow_median_xr10$data)
})

time25 <- system.time({
  desired_comids25 <- np$array(c(14365060L, 22336379L, 22337975L, 22338427L, 22340329L, 3717136L, 
3808745L, 3806147L, 2087785L, 3405683L, 3407001L, 1814983L, 6848607L, 
12005077L, 904040051L, 15582369L, 15537883L, 21972746L, 2252949L, 
4390605L, 4390909L, 4390597L, 6013072L, 7665728L, 7666946L))
  nwm_streamflow_comids_xr25 <- nwm_streamflow_xr_lastDecade$sel(feature_id=desired_comids25)
  message('Starting median Q calculation for 25 COMIDs')
  nwm_streamflow_median_xr25 <- nwm_streamflow_comids_xr25$median(dim='time', skipna=TRUE) 
  nwm_streamflow_median25 <- py_to_r(nwm_streamflow_median_xr25$data)
})

@hdugan
Copy link
Collaborator Author

hdugan commented May 23, 2024

I took a look at this, because I'd rather learn something new than work on talks. I figured out how to do it in R. I avoid python in R when I can, since it seemingly always leads to version errors. It's still slow as heck. Part of that is because the zarr files are in kind of useless chunks. Would be quicker if they were chunked by comid or time, but I'm sure there's a logical reason to chunk them in big rectangles across the entire raster.

[sidebar, I was also using v2. I could get v3 streamflow to load, but not the feature_id. There must be something wrong on their end]

The chunks are 672 x 30000 [hours x comid]. If we pull from the least chunks possible, it speeds it up a lot.
This example takes about 30 sec to run. So across all comids should be about 1 hour? It's assuming that pulling 1/5 of the data gives you a representative streamflow.

library(Rarr) # devtools::install_github(repo = 'grimbough/Rarr')
library(tidyverse)
library(lubridate)

# Time parameters in the data file 
start_time <- as.POSIXct("1993-01-01 00:00:00")
end_time <- as.POSIXct("2018-12-31 23:59:00")
timeseq = seq.POSIXt(from = start_time, to = end_time, by = 'hour')

# Constrain data request to every 5th chunk after 2010
indx = data.frame(time = timeseq) |> 
  mutate(indx = row_number()) |> # add row number
  mutate(chunk = rep(1:340, each = 672)[1:length(timeseq)]) |> # create chunk group
  filter(year(time) >= 2010) |> # filter after 2010
  filter(chunk %% 5 == 0) |> # filter every 5th row
  pull(indx)

# configure client
s3_client <- paws.storage::s3(
  config = list(
    credentials = list(anonymous = TRUE), 
    region = "us-west-2", 
    s3_force_path_style = TRUE)
)

# Look at features available from NWM v2 output 
s3_address <- "https://s3.amazonaws.com/noaa-nwm-retro-v2-zarr-pds"
zarr_overview(s3_address, s3_client = s3_client)

s3_address <- "https://s3.amazonaws.com/noaa-nwm-retro-v2-zarr-pds/feature_id"
zarr_overview(s3_address, s3_client = s3_client)
# Shape: 2729077
# Chunk Shape: 2729077
# No. of Chunks: 1 (1)
z_id <- read_zarr_array(s3_address, s3_client = s3_client)


# Load streamflow from zarr file stored in the NWM's Amazon S3 bucket as an xarray
s3_address <- "https://s3.amazonaws.com/noaa-nwm-retro-v2-zarr-pds/streamflow"
zarr_overview(s3_address, s3_client = s3_client)
# Shape: 227904 x 2729077
# Chunk Shape: 672 x 30000
# No. of Chunks: 30940 (340 x 91)


# Start the clock
ptm <- proc.time()

# Request the chunk index across 30000 comids
z_streamflow <- read_zarr_array(s3_address, s3_client = s3_client, index = list(indx, 1:30000))

# Stop the clock
proc.time() - ptm


@hdugan
Copy link
Collaborator Author

hdugan commented May 23, 2024

If this works (which I think it does), we should pull all the comids and make a median streamflow output file, and then use that in our workflow. Instead of running this more than once.

@lindsayplatt
Copy link
Owner

Amazing! Can't believe it only took 30 seconds to do this. Although, what is the output format for z_streamflow? I think it was the median calculation step that seemed to take awhile in my example.

Also, I assume the 1:30000 for comids can just be substituted for our vector of COMIDS? I'll try running this tomorrow and then work on incorporating this into pipeline :)

@hdugan
Copy link
Collaborator Author

hdugan commented May 23, 2024

It's a raster, so can then do median of each column. Doesn't seem too bad.

# Get median of each column 
ptm <- proc.time()
apply(z_streamflow, 2, FUN = median)
proc.time() - ptm # 73 sec

# package is faster than apply 
library(matrixStats)
ptm <- proc.time()
colMedians(x = z_streamflow)
proc.time() - ptm # 12 sec

There are a lot of zeros, and a lot of -999900. So have to figure that out... but maybe it's from comids we're not using. Also, I assume feature_id = comid?

@lindsayplatt
Copy link
Owner

Huh, maybe it is my computer. I just ran your first code chunk to get the data and my download time is so much longer. 4 minutes compared to your 30 sec!

# This took 4 minutes (yours took 30 sec)
z_streamflow <- read_zarr_array(s3_address, s3_client = s3_client, index = list(indx, 1:30000)) 

The median calculation code was just as fast as yours, though. And I have also been assuming COMID == feature_id based on their docs.

@hdugan
Copy link
Collaborator Author

hdugan commented May 24, 2024

Tried it on my CFL desktop and it also took 30 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASLO something to do ahead of the ASLO talk enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants