-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Granule to STAC Item / Kerchunk Script #32
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script doesn't run for me.
I tried regenerating the proposed STAC item using the same granule, and I get a KeyError
:
python gran_to_cat.py LC08_L1GT_004010_20140206_20200912_02_T2_X_LC08_L1GT_004010_20140529_20200911_02_T2_G0120V02_P008.nc --s3_path s3://its-live-data/velocity_image_pair/landsatOLI/v02/N70W040/LC08_L1GT_004010_20140206_20200912_02_T2_X_LC08_L1GT_004010_20140529_20200911_02_T2_G0120V02_P008.nc
Traceback (most recent call last):
File ".../_test/kerchunk/gran_to_cat.py", line 197, in <module>
make_granule_stac_kerchunk()
File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../mambaforge/envs/its-kerchunk/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../_test/kerchunk/gran_to_cat.py", line 190, in make_granule_stac_kerchunk
stac = refs_to_stac_item(refs)
^^^^^^^^^^^^^^^^^^^^^^^
File ".../_test/kerchunk/gran_to_cat.py", line 168, in refs_to_stac_item
item = xstac.xarray_to_stac(DuckDataset(ref),
^^^^^^^^^^^^^^^^
File ".../_test/kerchunk/gran_to_cat.py", line 92, in __init__
zarray = ujson.loads(r[f"{v}/.zarray"])
~^^^^^^^^^^^^^^^^
KeyError: 'img_pair_info/.zarray'
My environment file for this script:
name: its-kerchunk
channels:
- conda-forge
- nodefaults
dependencies:
- python
- click
- fsspec
- geopandas
- numpy
- pandas
- pyproj
- shapely
- ujson
- xstac
- kerchunk
- netcdf4
- h5py
- ca-certificates
- certifi
- openssl
and the environment listing:
env_list.txt
Co-authored-by: Joseph H Kennedy <[email protected]>
Co-authored-by: Joseph H Kennedy <[email protected]>
Co-authored-by: Joseph H Kennedy <[email protected]>
@jhkennedy I went and addressed most of your comments but regarding your issue I could use some additional information about your package versions (mainly kerchunk, h5py, and netCDF4) since after these changes my code still works in my environment. |
@agoodm I included the listing of the environment in my comment ( |
@jhkennedy Sorry for missing that! I was unable to track down which package is causing the inconsistency but I realized that my code should be more robust in handling this case anyway, so I made a quick change. Please try running it again with the latest commit. |
This latest version now uses pathlib with your suggested changes. how does it look to you now? |
@agoodm looks good! I've finished my review of the stac item (below) and have a couple of questions about how you're using the refs. kerchunk refs@agoodm in the stac item, I don't see the entire refs encoded in the STAC item; are you just using the Overall, it'd be good to know exactly what you intend to use from the STAC item. stac itemI've reviewed the STAC item from a search and discovery standpoint since that's the primary purpose of the STAC catalog (e.g., facilitating the queries @betolink will need to do). I don't know enough about how you're using the STAC item, so some of these points may affect that. Using this granule as the test again:
I have a few comments/suggestions on the generated STAC item:
I've edited the output stac item for how I'd like to see it: which results in this mini catalog using But there are a few things there that may need to change. @betolink @alex-s-gardner @markf6 it'd be good for y'all to review my proposed STAC item as well. I've got a few open questions regarding the item:
Feedback
|
@agoodm I'm happy to make some (all?) of these changes myself -- do you want me to just push to your branch or open a PR to your branch, or do you want to handle all of it? |
You can go ahead and apply these yourself, none of them should have any major impact with my planned virtual cubes work. As far as that is concerned the additional STAC metadata is primarily useful for potential ways to narrow down queries when arrow expands these into separate columns in the geoparquet. However we decided to keep the kerchunk JSON entirely separate for now since it is a bit cleaner without it and also easier and faster to decode later this way. |
# Kerchunk can"t handle these structs properly so these are converted to | ||
# global attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Kerchunk can"t handle these structs properly so these are converted to | |
# global attributes | |
# Kerchunk can"t handle these structs properly so we convert them to | |
# global attributes |
for variable in ["mapping", "img_pair_info"]: | ||
if variable in nc.variables: | ||
obj = nc[variable] | ||
for attr in obj.ncattrs(): | ||
val = getattr(obj, attr) | ||
if hasattr(val, "item"): | ||
val = val.item() | ||
attrs[attr] = val | ||
del refs["refs"][variable+"/0"] | ||
del refs["refs"][variable+"/.zattrs"] | ||
del refs["refs"][variable+"/.zarray"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we inject mapping
into a dictionary, so the top-level metadata looks like:
...
"mapping": {grid_mapping_keys...},
...
In conversation with the MPI-BGC folks who also use quite a lot of geo-Zarr data, this is the format they've settled on and it's the most compliant with the CF spec that we can get. In NetCDF, because you can't have nested metadata attributes, these were stored in empty variables. But Zarr doesn't have the concept of empty variables, so a single top-level key holding a JSON dictionary is as close as we can get.
Given that each file has a single CRS, we may also want to have a top-level spatial_epsg
or spatial_ref
key as well, just for safety's sake.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
This adds a script for generating Kerchunk JSON references for an image pair granule, and from these a STAC item.
Usage:
There is an optional --s3_path flag if processing a granule locally, but the included URLs in both JSON files will reference their S3 Path.