Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON not properly decoded by backends #415

Open
TomNicholas opened this issue Feb 3, 2024 · 6 comments
Open

JSON not properly decoded by backends #415

TomNicholas opened this issue Feb 3, 2024 · 6 comments

Comments

@TomNicholas
Copy link

TomNicholas commented Feb 3, 2024

Kerchunk doesn't properly decode the JSON for zarr array-level attributes, instead leaving dictionaries as long strings. For example:

# create example netCDF4 file
xr.tutorial.open_dataset('air_temperature').to_netcdf('air.nc')

kerchunk.backends.SingleHdf5ToZarr('air.nc', inline_threshold=300).translate()
{'version': 1,
 'refs': {'.zgroup': '{"zarr_format":2}',
  '.zattrs': '{"Conventions":"COARDS","description":"Data is from NMC initialized reanalysis\\n(4x\\/day).  These are the 0.9950 sigma level values.","platform":"Model","references":"http:\\/\\/[www.esrl.noaa.gov\\/psd\\/data\\/gridded\\/data.ncep.reanalysis.html](https://www.esrl.noaa.gov///psd///data///gridded///data.ncep.reanalysis.html)","title":"4x daily NMC reanalysis (1948)"}',
  'air/.zarray': '{"chunks":[2920,25,53],"compressor":null,"dtype":"<i2","fill_value":null,"filters":null,"order":"C","shape":[2920,25,53],"zarr_format":2}',
  'air/.zattrs': '{"GRIB_id":11,"GRIB_name":"TMP","_ARRAY_DIMENSIONS":["time","lat","lon"],"actual_range":[185.16000366210938,322.1000061035156],"dataset":"NMC Reanalysis","level_desc":"Surface","long_name":"4xDaily Air temperature at sigma level 995","parent_stat":"Other","precision":2,"scale_factor":0.01,"statistic":"Individual Obs","units":"degK","var_desc":"Air temperature"}',
  'air/0.0.0': ['air.nc', 15419, 7738000],
  'lat/.zarray': '{"chunks":[25],"compressor":null,"dtype":"<f4","fill_value":"NaN","filters":null,"order":"C","shape":[25],"zarr_format":2}',
  'lat/.zattrs': '{"_ARRAY_DIMENSIONS":["lat"],"axis":"Y","long_name":"Latitude","standard_name":"latitude","units":"degrees_north"}',
  'lat/0': ['air.nc', 5179, 100],
  'lon/.zarray': '{"chunks":[53],"compressor":null,"dtype":"<f4","fill_value":"NaN","filters":null,"order":"C","shape":[53],"zarr_format":2}',
  'lon/.zattrs': '{"_ARRAY_DIMENSIONS":["lon"],"axis":"X","long_name":"Longitude","standard_name":"longitude","units":"degrees_east"}',
  'lon/0': ['air.nc', 5279, 212],
  'time/.zarray': '{"chunks":[2920],"compressor":null,"dtype":"<f4","fill_value":"NaN","filters":null,"order":"C","shape":[2920],"zarr_format":2}',
  'time/.zattrs': '{"_ARRAY_DIMENSIONS":["time"],"calendar":"standard","long_name":"Time","standard_name":"time","units":"hours since 1800-01-01"}',
  'time/0': ['air.nc', 7757515, 11680]}}

Notice that this is only partially decoded - the top two levels are nested python dictionaries, but below that the various zarr attributes are stored as long strings, e.g:

'{"chunks":[2920,25,53],"compressor":null,"dtype":"<i2","fill_value":null,"filters":null,"order":"C","shape":[2920,25,53],"zarr_format":2}'

This seems silly, why not just decode the whole thing properly at the beginning so you can always treat it like a nested python dictionary? (Or even better use a dedicated abstraction like suggested in #375)

@martindurant
Copy link
Member

This is exactly what zarr expects: JSON data encoded as (ascii) bytestrings. When you load this reference set with zarr, that's when the decoding happens.

@TomNicholas
Copy link
Author

huh, thanks. So is there an un-translated version accessible from within kerchunk that can actually be treated as a dictionary?

@martindurant
Copy link
Member

SingleHDF actually uses zarr to fill in the JSON metadata into the references dict, so no: it converts the data immediately. You could of course open the dataset with zarr, and use its .attrs to get dicts back.

@dcherian
Copy link
Contributor

dcherian commented Feb 7, 2024

does json.loads do what you need?

@TomNicholas
Copy link
Author

does json.loads do what you need?

It does, but this issue was more of a complaint that as-is the references are in the wrong form to be traversable and manipulatable.

This is exactly what zarr expects

I personally think that kerchunk should create a useful full internal model of zarr (similar to the Zarr Object Models idea), manipulate that, then at the end encode it before serializing. Rather than carrying around just some part of the zarr info as dictionaries in encoded form.

@martindurant
Copy link
Member

Would it not be fine to just open the created reference set with zarr? The thing is, the bytestrings are written by zarr in the format you see, so we would have to decode them, only to encode them again at the time of writing to a file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants