Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document best practice: I/O-free STAC item generation #369

Open
TomAugspurger opened this issue Nov 1, 2022 · 4 comments
Open

Document best practice: I/O-free STAC item generation #369

TomAugspurger opened this issue Nov 1, 2022 · 4 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Nov 1, 2022

This issue is discussing what is (IMO) a best-practice for stactools packages: the ability to generate a STAC item without any I/O.

Currently most stactools packages have a high-level stac.create_item(asset_href: str, ...) -> pystac.Item function that generates a STAC item from a string. If the method requires reading any data / metadata, it will handle that I/O. This is very convenient, and ideally every stactools package has a way of doing this (especially useful when using a CLI).

Some of the more complicated stactools packages also generate cloud-optimized assets from the "source" asset at asset_href. In some of these packages, whether the output STAC item catalogs the cloud-optimized asset is directly tied to that function creating the cloud-optimized asset itself (see https://github.com/stactools-packages/goes-glm/blob/c9c3bc42685e66e0eaace599096ef6050c05eb57/src/stactools/goes_glm/stac.py#L46-L47 for example).

At a minimum, it should be easy to regenerate STAC metadata (including metadata for the cloud-optimized assets) without having to regenerate the cloud-optimized assets.

Now we have a couple ways to handle this:

  1. The user passes all the hrefs to both the source asset and the cloud-optimized asset. The create_item method is responsible for reading the data:
def create_item(source_asset_href, cloud_optimized_asset_hrefs, ...):
    ...

If the user provides cloud_optimzied_asset_hrefs then cloud-optimized asset (re)generation can be skipped.
2. The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).

def create_item(source_data, cloud_optimized_data):
    ...

Of these, I think we should steer package developers towards option 2, but I'm curious to hear others' thoughts. That's the approach taken by stac-table and xstac, and I think it works pretty well. Users are able to provide (essentially) any dataframe or Dataset and we can generate STAC metadata for it. Crucially, all of rasterio, pyarrow / dask.dataframe, and xarray can lazily read data so creating / passing around a DataFrame or Dataset doesn't actually read data (unless it's required by the method).

@TomAugspurger TomAugspurger added the enhancement New feature or request label Nov 1, 2022
@pjhartzell
Copy link
Collaborator

At a minimum, it should be easy to regenerate STAC metadata (including metadata for the cloud-optimized assets) without having to regenerate the cloud-optimized assets.

I agree.

The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).

So you would pass, e.g., a pandas dataframe, to create_item (and potentially the href) rather than just the href to the dataframe? A COG would be handled by passing src from something like with rasterio.open("cog_href") as src:? I'm not clear on the advantage of passing the "data". Why not allow any data or metadata reading from the href to happen inside the create_item function?

@gadomski
Copy link
Member

gadomski commented Nov 4, 2022

The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).

I think we need hrefs for each Asset.href -- if we just have the data, how does the asset know where to point?

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Nov 4, 2022

Hmm I had a reply but I might have closed that browser tab before submitting it.

We chatted a bit about this on Tuesday, but I unfortunately mixed up two things here.

  1. Easily regenerate (all) STAC metadata without having to regenerate cloud-optimzied assets.
  2. Provider APIs for generating STAC metadata from data objects, rather than (just) from HREFs

Hopefully 1 is uncontroversial, and is fairly straightforward to implement. Any function that takes an asset_href and genertes cloud-optimzied assets should also take hrefs for the cloud-optimzied assets. If provided, those (existing) cloud-optimzied assets should be used to generate the STAC metadata.

Item 2 is more subjective. I've just found it handy in the past to avoid filesytem(s) and I/O in functions where possible. One example use-case is generating STAC metadata from in-memory data structures as a performance optimization (reading / writing to disk can be relatively slow, and if you already have the data in-memory why pay that cost?)

My hope is that the only cost on package developers is a single extra layer of indirection. I think most packages would be structured like

def create_item(asset_href, ...):
    data = read_href(...)  # into a rasterio.Dataset / dataframe / table / xarray structure / ...
    return create_item_from_data(data, asset_href)

def create_item_from_data(data, asset_href):
    ...

I suspect most packages are doing something like this, only the create_item_from_data might not be refactored into a standalone function.

The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).
I think we need hrefs for each Asset.href -- if we just have the data, how does the asset know where to point?

Indeed, an href would be required in addition to the data object.

@gadomski
Copy link
Member

gadomski commented Nov 4, 2022

Provider APIs for generating STAC metadata from data objects, rather than (just) from HREFs

I think its possible, but it gets sticky when you need to look up some static information that's not contained in the dataset, such as classification semantics for a multi-band NetCDF dataset where each band is its own cog. In mostly-code:

RASTER_BANDS = {
    "sea_ice_concentration": ...,
    "sea_ice_other_variable": ...,
}

def create_item_from_data(data, asset_href):  # data is a rasterio dataset, asset_href is a COG href
    item  = Item(...)
    item.add_asset("data", Asset(...))
    raster = RasterExtension.ext(item.assets["data"], add_if_missing=True)
    raster.bands = [RASTER_BANDS[variable]]  # <-- where do I learn what variable this COG represents?

I could parse information about what variable from the file name, but that feels icky to me. I think you're still going to need to hit the original source NetCDF. So, I think the pattern is good for a one-to-one use-case, but it gets harder for one-to-many.

I suspect most packages are doing something like this, only the create_item_from_data might not be refactored into a standalone function.

Agreed, especially for the simple one-to-one case.

For a real world example of how I'm trying to work around this, here's me skipping re-creation of COGs when they already exist for a many-to-many NetCDF->COG dataset: stactools-packages/noaa-cdr#39

@gadomski gadomski added the documentation Improvements or additions to documentation label Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants