-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icechunk as storage format for JSON-like trees with array leaves #601
Comments
👋 welcome @dionhaefner! I think the answer is yes, this is a good fit for your use case. The main detail I would add is that Icechunk's user-facing data model is identical to Zarr's (which is itself based on HDF5). All Icechunk data are created via the Zarr API. Icechunk sits under the hood, managing the actual storage of the data. So I would reframe your questions to
For 1, here's how I would model the data above as Zarr. Zarr basically defines import numpy as np
import zarr
from zarr.storage import LocalStore
store = LocalStore("tmp.zarr")
root_group = zarr.group(store, zarr_format=3)
root_group.update_attributes(
{
"a_float": 0.1,
"list_of_stuff": ["execute order", 66]
}
)
child_group = root_group.create_group("nested")
child_group.attrs["not_an_array"] = [1, 2, 3]
array = child_group.create_array("look_ma_an_array", shape=24, dtype="int32")
array[:] = np.arange(24) Many organizations are ditching their bespoke data formats in favor of Zarr because of the many advantages of adopting a community-maintained framework. So this would be step 1 for you. I'd say that the main advantage of Zarr for your use case is a much more sophisticated encoding and storage of arrays than your simple flat binary format (chunking, compression, etc.) which translates to much better performance for large arrays. On top of Zarr (or techically beneath Zarr in the stack 😆 ), Icechunk can provide several optimizations that would bring this format more in line with your requirements:
Hope this is helpful. |
Thanks for providing that perspective, and especially the code sample :) Always good to see things in action. I understand where you're coming from, although I should mention that vanilla Zarr falls flat for us because it's missing this key property:
So, creating a second |
We have a simple in-house data format that we use for I/O between simulations, ML models, and processing on geometric data structures. We're wondering whether Icechunk could fit the bill to replace it in situations where we want to support slicing / chunking / compression of array data, without losing the inherent flexibility and decent scaling to a large spectrum of use cases.
Our current data format works roughly like this:
base64
encoded binary, or contain a pointer to a binary file + offset.For example, this would be valid data:
Where
<ARRAY>
could be either of those things:Full example file
This works well because it allows us to change the tree structure of the data and non-array values without touching large arrays – they're just inserted as a reference so the data is only read when it's actually needed, and never copied.
However, this suffers from 2 issues:
.bin
files just contain the array data as a linear buffer, so no efficient slicing due to the lack of chunking.I'm intrigued by Icechunk because it seems to address similar problems and looks really well executed (congrats!). I can see how it could potentially solve (2) by inserting references to Icechunk stores instead of
.bin
files, and adding a mini-DSL for accessing slices. Does that sound reasonable?Would love to hear your thoughts, also whether Icechunk could potentially help with issue (1).
The text was updated successfully, but these errors were encountered: