Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granule to STAC Item / Kerchunk Script #32

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

agoodm
Copy link
Contributor

@agoodm agoodm commented Sep 23, 2024

This adds a script for generating Kerchunk JSON references for an image pair granule, and from these a STAC item.

Usage:

python gran_to_cat.py /path/to/gran.nc

There is an optional --s3_path flag if processing a granule locally, but the included URLs in both JSON files will reference their S3 Path.

Copy link
Contributor

@jhkennedy jhkennedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script doesn't run for me.

I tried regenerating the proposed STAC item using the same granule, and I get a KeyError:

python gran_to_cat.py LC08_L1GT_004010_20140206_20200912_02_T2_X_LC08_L1GT_004010_20140529_20200911_02_T2_G0120V02_P008.nc --s3_path s3://its-live-data/velocity_image_pair/landsatOLI/v02/N70W040/LC08_L1GT_004010_20140206_20200912_02_T2_X_LC08_L1GT_004010_20140529_20200911_02_T2_G0120V02_P008.nc
Traceback (most recent call last):
  File ".../_test/kerchunk/gran_to_cat.py", line 197, in <module>
    make_granule_stac_kerchunk()
  File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../_test/kerchunk/gran_to_cat.py", line 190, in make_granule_stac_kerchunk
    stac = refs_to_stac_item(refs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File ".../_test/kerchunk/gran_to_cat.py", line 168, in refs_to_stac_item
    item = xstac.xarray_to_stac(DuckDataset(ref),
                                ^^^^^^^^^^^^^^^^
  File ".../_test/kerchunk/gran_to_cat.py", line 92, in __init__
    zarray = ujson.loads(r[f"{v}/.zarray"])
                         ~^^^^^^^^^^^^^^^^
KeyError: 'img_pair_info/.zarray'

My environment file for this script:

name: its-kerchunk
channels:
  - conda-forge
  - nodefaults
dependencies:
  - python
  - click
  - fsspec
  - geopandas
  - numpy
  - pandas
  - pyproj
  - shapely
  - ujson
  - xstac
  - kerchunk
  - netcdf4
  - h5py
  - ca-certificates
  - certifi
  - openssl

and the environment listing:
env_list.txt

src/kerchunk/gran_to_cat.py Outdated Show resolved Hide resolved
src/kerchunk/gran_to_cat.py Outdated Show resolved Hide resolved
src/kerchunk/gran_to_cat.py Outdated Show resolved Hide resolved
src/kerchunk/gran_to_cat.py Outdated Show resolved Hide resolved
src/kerchunk/gran_to_cat.py Outdated Show resolved Hide resolved
src/kerchunk/gran_to_cat.py Outdated Show resolved Hide resolved
src/kerchunk/gran_to_cat.py Outdated Show resolved Hide resolved
@agoodm
Copy link
Contributor Author

agoodm commented Sep 26, 2024

@jhkennedy I went and addressed most of your comments but regarding your issue I could use some additional information about your package versions (mainly kerchunk, h5py, and netCDF4) since after these changes my code still works in my environment.

@jhkennedy
Copy link
Contributor

@agoodm I included the listing of the environment in my comment (env_list.txt) which has the versions for every package in the environment.

@agoodm
Copy link
Contributor Author

agoodm commented Sep 26, 2024

@jhkennedy Sorry for missing that!

I was unable to track down which package is causing the inconsistency but I realized that my code should be more robust in handling this case anyway, so I made a quick change. Please try running it again with the latest commit.

@agoodm
Copy link
Contributor Author

agoodm commented Sep 27, 2024

This latest version now uses pathlib with your suggested changes. how does it look to you now?

@jhkennedy
Copy link
Contributor

jhkennedy commented Sep 27, 2024

This latest version now uses pathlib with your suggested changes. How does it look to you now?

@agoodm looks good!

I've finished my review of the stac item (below) and have a couple of questions about how you're using the refs.


kerchunk refs

@agoodm in the stac item, I don't see the entire refs encoded in the STAC item; are you just using the chunks, shape, and attrs for each variable in item['properties']['cube:variables'], or are you expecting to use the refs file (either directly off S3 or separately encoded in the catalog)?

Overall, it'd be good to know exactly what you intend to use from the STAC item.


stac item

I've reviewed the STAC item from a search and discovery standpoint since that's the primary purpose of the STAC catalog (e.g., facilitating the queries @betolink will need to do). I don't know enough about how you're using the STAC item, so some of these points may affect that.

Using this granule as the test again:

s3://its-live-data/velocity_image_pair/landsatOLI/v02/N70W040/LC08_L1GT_004010_20140206_20200912_02_T2_X_LC08_L1GT_004010_20140529_20200911_02_T2_G0120V02_P008.nc

I have a few comments/suggestions on the generated STAC item:

  1. id shouldn't have the .nc extension
  2. for the cube:dimensions property,
    1. each dimension reference_system is a proj json object, but could just be the EPSG code: https://github.com/stac-extensions/datacube?tab=readme-ov-file#horizontal-spatial-raster-dimension-object
    2. there's no description for the dimension, but I'm not sure if it is worth putting "x coordinate of projection in meters", so maybe that's fine
  3. for the cube:variables property,
    1. the mapping variable is missing, so the grid_mapping attribute doesn't really make sense to include in each variable
    2. For these variables, I think we can drop attrs, shapes, and chunks entirely (they aren't in the cube extension spec) -- are they needed for your kerchunk work?
  4. for the assets,
    1. we should include the thumbnail and browse image
    2. we should describe them a bit better by including the type, description, and title as well as the href: https://github.com/radiantearth/stac-spec/blob/master/commons/assets.md#asset-object
    3. I think the asset key should be "data" so it's easy to select that asset for any item
  5. we'll need a bunch more metadata,
    1. most everything should be included from the img_pair_info variable as item properties
    2. we'll need most of the information encoded in the variable name as well in the item properties
    3. we'll need the ITS_LIVE datacube "tile" in the item properties
    4. we'll need some of the global attributes in the item properties

I've edited the output stac item for how I'd like to see it:
LC08_L1GT_004010_20140206_20200912_02_T2_X_LC08_L1GT_004010_20140529_20200911_02_T2_G0120V02_P008.stac-joe.json

which results in this mini catalog using stac_geoparquet (zipped b/c GitHub):
its_live_mini_catalog.zip

But there are a few things there that may need to change. @betolink @alex-s-gardner @markf6 it'd be good for y'all to review my proposed STAC item as well. I've got a few open questions regarding the item:

  • Is the "roles" specifier for assets valuable? It's not required and the spec implies it's mostly for items with multiple data assets.
  • Are we happy with the item property names? There are some changes as compared to the netCDF metadata. The major one is:
    • In @betolink's db he uses sat1 and sat2, while in the netCDF granule we use img1 and img2. I've used reference and secondary as that's more common in the remote sensing/InSAR world (ref and sec could also work) and is more descriptive, I think.
  • Is there metadata in the granule I missed?

Feedback

  • We'll want ascending/descending for at least Sentinel-1, and if it's easy Landsat and Sentinel-2
  • [] Can we have higher precision on center_latitude / center_longitude

@jhkennedy
Copy link
Contributor

@agoodm I'm happy to make some (all?) of these changes myself -- do you want me to just push to your branch or open a PR to your branch, or do you want to handle all of it?

@agoodm
Copy link
Contributor Author

agoodm commented Sep 27, 2024

@agoodm I'm happy to make some (all?) of these changes myself -- do you want me to just push to your branch or open a PR to your branch, or do you want to handle all of it?

You can go ahead and apply these yourself, none of them should have any major impact with my planned virtual cubes work. As far as that is concerned the additional STAC metadata is primarily useful for potential ways to narrow down queries when arrow expands these into separate columns in the geoparquet. However we decided to keep the kerchunk JSON entirely separate for now since it is a bit cleaner without it and also easier and faster to decode later this way.

Comment on lines +52 to +53
# Kerchunk can"t handle these structs properly so these are converted to
# global attributes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Kerchunk can"t handle these structs properly so these are converted to
# global attributes
# Kerchunk can"t handle these structs properly so we convert them to
# global attributes

Comment on lines +54 to +64
for variable in ["mapping", "img_pair_info"]:
if variable in nc.variables:
obj = nc[variable]
for attr in obj.ncattrs():
val = getattr(obj, attr)
if hasattr(val, "item"):
val = val.item()
attrs[attr] = val
del refs["refs"][variable+"/0"]
del refs["refs"][variable+"/.zattrs"]
del refs["refs"][variable+"/.zarray"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we inject mapping into a dictionary, so the top-level metadata looks like:

...
"mapping": {grid_mapping_keys...},
...

In conversation with the MPI-BGC folks who also use quite a lot of geo-Zarr data, this is the format they've settled on and it's the most compliant with the CF spec that we can get. In NetCDF, because you can't have nested metadata attributes, these were stored in empty variables. But Zarr doesn't have the concept of empty variables, so a single top-level key holding a JSON dictionary is as close as we can get.

Given that each file has a single CRS, we may also want to have a top-level spatial_epsg or spatial_ref key as well, just for safety's sake.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants