-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document best practice: I/O-free STAC item generation #369
Comments
I agree.
So you would pass, e.g., a pandas dataframe, to |
I think we need |
Hmm I had a reply but I might have closed that browser tab before submitting it. We chatted a bit about this on Tuesday, but I unfortunately mixed up two things here.
Hopefully 1 is uncontroversial, and is fairly straightforward to implement. Any function that takes an Item 2 is more subjective. I've just found it handy in the past to avoid filesytem(s) and I/O in functions where possible. One example use-case is generating STAC metadata from in-memory data structures as a performance optimization (reading / writing to disk can be relatively slow, and if you already have the data in-memory why pay that cost?) My hope is that the only cost on package developers is a single extra layer of indirection. I think most packages would be structured like def create_item(asset_href, ...):
data = read_href(...) # into a rasterio.Dataset / dataframe / table / xarray structure / ...
return create_item_from_data(data, asset_href)
def create_item_from_data(data, asset_href):
... I suspect most packages are doing something like this, only the
Indeed, an |
I think its possible, but it gets sticky when you need to look up some static information that's not contained in the dataset, such as classification semantics for a multi-band NetCDF dataset where each band is its own cog. In mostly-code: RASTER_BANDS = {
"sea_ice_concentration": ...,
"sea_ice_other_variable": ...,
}
def create_item_from_data(data, asset_href): # data is a rasterio dataset, asset_href is a COG href
item = Item(...)
item.add_asset("data", Asset(...))
raster = RasterExtension.ext(item.assets["data"], add_if_missing=True)
raster.bands = [RASTER_BANDS[variable]] # <-- where do I learn what variable this COG represents? I could parse information about what variable from the file name, but that feels icky to me. I think you're still going to need to hit the original source NetCDF. So, I think the pattern is good for a one-to-one use-case, but it gets harder for one-to-many.
Agreed, especially for the simple one-to-one case. For a real world example of how I'm trying to work around this, here's me skipping re-creation of COGs when they already exist for a many-to-many NetCDF->COG dataset: stactools-packages/noaa-cdr#39 |
This issue is discussing what is (IMO) a best-practice for stactools packages: the ability to generate a STAC item without any I/O.
Currently most stactools packages have a high-level
stac.create_item(asset_href: str, ...) -> pystac.Item
function that generates a STAC item from a string. If the method requires reading any data / metadata, it will handle that I/O. This is very convenient, and ideally every stactools package has a way of doing this (especially useful when using a CLI).Some of the more complicated stactools packages also generate cloud-optimized assets from the "source" asset at
asset_href
. In some of these packages, whether the output STAC item catalogs the cloud-optimized asset is directly tied to that function creating the cloud-optimized asset itself (see https://github.com/stactools-packages/goes-glm/blob/c9c3bc42685e66e0eaace599096ef6050c05eb57/src/stactools/goes_glm/stac.py#L46-L47 for example).At a minimum, it should be easy to regenerate STAC metadata (including metadata for the cloud-optimized assets) without having to regenerate the cloud-optimized assets.
Now we have a couple ways to handle this:
create_item
method is responsible for reading the data:If the user provides
cloud_optimzied_asset_hrefs
then cloud-optimized asset (re)generation can be skipped.2. The user passes in the data (and perhaps the hrefs, to easily set the
href
for each asset).Of these, I think we should steer package developers towards option 2, but I'm curious to hear others' thoughts. That's the approach taken by stac-table and xstac, and I think it works pretty well. Users are able to provide (essentially) any dataframe or Dataset and we can generate STAC metadata for it. Crucially, all of rasterio, pyarrow /
dask.dataframe
, and xarray can lazily read data so creating / passing around a DataFrame or Dataset doesn't actually read data (unless it's required by the method).The text was updated successfully, but these errors were encountered: