Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icechunk as storage format for JSON-like trees with array leaves #601

Open
dionhaefner opened this issue Jan 22, 2025 · 2 comments
Open

Comments

@dionhaefner
Copy link

We have a simple in-house data format that we use for I/O between simulations, ML models, and processing on geometric data structures. We're wondering whether Icechunk could fit the bill to replace it in situations where we want to support slicing / chunking / compression of array data, without losing the inherent flexibility and decent scaling to a large spectrum of use cases.

Our current data format works roughly like this:

  • It consists of arbitrary trees of JSON data, with a custom definition for array data at the leaves.
  • Arrays can be inserted as base64 encoded binary, or contain a pointer to a binary file + offset.
  • All files are valid JSON.

For example, this would be valid data:

{
    "a_float": 0.1,
    "list_of_stuff": ["execute order", 66],
    "nested": {
        "look_ma_an_array": <ARRAY>
        "not_an_array": [1, 2, 3]
    }
}

Where <ARRAY> could be either of those things:

# base64 encoded
{
    "object_type": "array",
    "shape": [24],
    "dtype": "int32",
    "data": {
        "buffer": "BwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAA",
        "encoding": "base64"
    }
}

# pointer to binary on disk
{
    "object_type": "array",
    "shape": [120],
    "dtype": "int32",
    "data": {
        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:0",
        "encoding": "binref"
    }
}
Full example file

[
    {
        "name": "bm01-flow-over-cylinder-sample037_510",
        "time": 51,
        "data": {
            "inlet": {
                "object_type": "PasteurSurfaceMesh",
                "cell_connectivity": {
                    "object_type": "array",
                    "shape": [
                        120
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:0",
                        "encoding": "binref"
                    }
                },
                "cell_types": {
                    "object_type": "array",
                    "shape": [
                        24
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "BwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAA",
                        "encoding": "base64"
                    }
                },
                "points": {
                    "object_type": "array",
                    "shape": [
                        50,
                        3
                    ],
                    "dtype": "float32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:480",
                        "encoding": "binref"
                    }
                },
                "cell_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            24,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "GnOVPDoVmzyDsKQ8d+6wPIjuvjyz7c08pULdPENl7Dx5APs8SHYEPVb2Cj1w/BA9fnUWPY9JGz3JbB890bwiPdIxJT0n0yY9HIgnPdDxJz3rjJM8TcuTPF7yJz1D9ic9",
                            "encoding": "base64"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            24,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "jO5nPgAAAAAAAAAAGgKmPgAAAAAAAAAAgEPJPgAAAAAAAAAAxPrgPgAAAAAAAAAAAADwPgAAAAAAAAAAHMT4PgAAAAAAAAAAzlD9PgAAAAAAAAAAmkj/PgAAAAAAAAAA0+b/PgAAAAAAAAAAl///PgAAAAAAAAAAl///PgAAAAAAAAAA0+b/PgAAAAAAAAAAmkj/PgAAAAAAAAAAzlD9PgAAAAAAAAAAHMT4PgAAAAAAAAAAAADwPgAAAAAAAAAAxPrgPgAAAAAAAAAAgEPJPgAAAAAAAAAAGwKmPgAAAAAAAAAAjO5nPgAAAAAAAAAAcf0ePQAAAAAAAAAAU/8APgAAAAAAAAAAU/8APgAAAAAAAAAAcf0ePQAAAAAAAAAA",
                            "encoding": "base64"
                        }
                    }
                },
                "point_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            50,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "a4aUPCpEmDwqRJg8a4aUPN7inzze4p88fc+qPH3Pqjx/7rc8f+63PB1uxjwdbsY8LJjVPCyY1Tz00+Q89NPkPN6y8zzesvM8QvsAPUL7AD1Ptgc9T7YHPWP5DT1j+Q0997gTPfe4Ez2G3xg9ht8YPSxbHT0sWx09zRQhPc0UIT1R9yM9UfcjPXwCJj18AiY9oS0nPaEtJz32vCc99rwnPR/yJz0f8ic964yTPByskzwcrJM864yTPFD0Jz1Q9Cc9Q/YnPUP2Jz0=",
                            "encoding": "base64"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            50,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:1080",
                            "encoding": "binref"
                        }
                    }
                }
            },
            "outlet": {
                "object_type": "PasteurSurfaceMesh",
                "cell_connectivity": {
                    "object_type": "array",
                    "shape": [
                        120
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:1680",
                        "encoding": "binref"
                    }
                },
                "cell_types": {
                    "object_type": "array",
                    "shape": [
                        24
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "BwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAA",
                        "encoding": "base64"
                    }
                },
                "points": {
                    "object_type": "array",
                    "shape": [
                        50,
                        3
                    ],
                    "dtype": "float32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:2160",
                        "encoding": "binref"
                    }
                },
                "cell_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            24,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
                            "encoding": "base64"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            24,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "2+/PPmsIaTsAAAAAFmbRPlfe6LoAAAAAyB/RPjNPDzwAAAAAHoXOPpE7vjsAAAAA7ELPPmRwQjyP4Qgf7mPOPiFhOjzWOQmfX6zOPrj1dTyWlsue4TDOPuosdzw3yssem8LNPhOCjzwAAAAAsjfNPr7ZizwAAAAAetPMPv6BljwAAAAAU4TMPhBdjDyPYs6DMgvMPjMbkjw81s4dc/7LPvm1dTx/Q8+dIabLPlfsdTwAAAAAswTMPggKJzwAAAAA0CnLPnIuODzuXwifyuvLPtUoejulyQgfHAnLPlRv5juFdhEf1FHNPjMoXLtTohKfuzfLPrEnijsAAAAAZuHOPuFKWrzXgWGf69LOPpD5JTyYneEfwzXOPsRH37sAAAAA",
                            "encoding": "base64"
                        }
                    }
                },
                "point_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            50,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=",
                            "encoding": "base64"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            50,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:2760",
                            "encoding": "binref"
                        }
                    }
                }
            },
            "wall": {
                "object_type": "PasteurSurfaceMesh",
                "cell_connectivity": {
                    "object_type": "array",
                    "shape": [
                        450
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:3360",
                        "encoding": "binref"
                    }
                },
                "cell_types": {
                    "object_type": "array",
                    "shape": [
                        90
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "BwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAA",
                        "encoding": "base64"
                    }
                },
                "points": {
                    "object_type": "array",
                    "shape": [
                        184,
                        3
                    ],
                    "dtype": "float32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:5160",
                        "encoding": "binref"
                    }
                },
                "cell_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            90,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "Q/YnPfTsJz0tyCc9Z+EnPUuuKD1QxCk9EoQrPdlMLD3bzCw9VWwtPT+eMj2PuD09G3JJPWilVT0gnGM9z+51PYHfhD1H+Y09JxeTPYBbkD2LjYE9BehCPfwexTxmS7q6ZNekvGK6Dr0jomq9BOmsvRHL8L1ZugC+m9WiveffRL1X9Oq8UVxmvF6aqLuH8Fm5gEsqO30phTtNo5Q7a2qQO1mafjuCsEs7ZgIXO2xewjr0wxk664yTPF+DkzxCM5M8QyqTPFRqlDyLtZU86uCXPPQ3mDwIBpg8TbeWPEjVmzwGP6A8X2ujPPTYojzbrZU8iPSFPD9ZVjzJMhk8/35lOy3/ebtNumG8gs3HvAEvE73KMjq9jeRavUMzbL3Q/nO9Hwtrvf4hW73PqUO97NYrvU84F72p5wS9f3HnvCd4x7wVjKq8QXaQvD8AcrwHNke86o0hvGkM/bupL8C7vuWEu/7oG7vpl4a6",
                            "encoding": "base64"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            90,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:7368",
                            "encoding": "binref"
                        }
                    }
                },
                "point_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            184,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:8448",
                            "encoding": "binref"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            184,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:9184",
                            "encoding": "binref"
                        }
                    }
                }
            },
            "obstacle": {
                "object_type": "PasteurSurfaceMesh",
                "cell_connectivity": {
                    "object_type": "array",
                    "shape": [
                        400
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:11392",
                        "encoding": "binref"
                    }
                },
                "cell_types": {
                    "object_type": "array",
                    "shape": [
                        80
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "BwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAAHAAAABwAAAAcAAAA=",
                        "encoding": "base64"
                    }
                },
                "points": {
                    "object_type": "array",
                    "shape": [
                        160,
                        3
                    ],
                    "dtype": "float32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:12992",
                        "encoding": "binref"
                    }
                },
                "cell_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            80,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "TrMBvscFH74Esj6+Eephvqj5gL7PnI6+vwOYvjJMnr68jKC+bOqevrU0mL7EY42+dKl9vqhvXL6Zjjq+lrAevtj8D76H9BK+GhgavrVsGL5CtR6+MXEivvcnJr4GFjO+qftBvlX/Ub4Dslm+0ndTvhCAPL44AB2+V3D3vbguw70pJqG9vW6WvTOjkb2Ydo+9EvaDveQpX71UVSa91BL1vGv8nTzUxx47mWmOvGKwH72UnnO9m12ZvfqLrr32f7u9hLG8vQLGsb0vtpi9fKdovalaE731MMO8kEq2vLFKq7z3y7G8HKvBvCRH0Lxg29S8uY7Cve5kdb2Lm8+8RvrkO1A+Gz0KkYM99gu0PbrN2j02+vk9pmIHPrCcDT7a/w4+aakLPnRBBT7C/fY9T3rdPVnyvD37U5k9H6hkPfaUEz0=",
                            "encoding": "base64"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            80,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:14912",
                            "encoding": "binref"
                        }
                    }
                },
                "point_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            160,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:15872",
                            "encoding": "binref"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            160,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:16512",
                            "encoding": "binref"
                        }
                    }
                }
            },
            "internal": {
                "object_type": "PasteurVolumetricMesh",
                "cell_connectivity": {
                    "object_type": "array",
                    "shape": [
                        20520
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:18432",
                        "encoding": "binref"
                    }
                },
                "cell_types": {
                    "object_type": "array",
                    "shape": [
                        2280
                    ],
                    "dtype": "int32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:100512",
                        "encoding": "binref"
                    }
                },
                "points": {
                    "object_type": "array",
                    "shape": [
                        4778,
                        3
                    ],
                    "dtype": "float32",
                    "data": {
                        "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:109632",
                        "encoding": "binref"
                    }
                },
                "cell_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            2280,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:166968",
                            "encoding": "binref"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            2280,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:176088",
                            "encoding": "binref"
                        }
                    }
                },
                "point_data": {
                    "p": {
                        "object_type": "array",
                        "shape": [
                            4778,
                            1
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:203448",
                            "encoding": "binref"
                        }
                    },
                    "U": {
                        "object_type": "array",
                        "shape": [
                            4778,
                            3
                        ],
                        "dtype": "float32",
                        "data": {
                            "buffer": "fe5be72c-cc13-4c2c-bdd9-f628e33ea71e.bin:222560",
                            "encoding": "binref"
                        }
                    }
                }
            }
        }
    }
]

This works well because it allows us to change the tree structure of the data and non-array values without touching large arrays – they're just inserted as a reference so the data is only read when it's actually needed, and never copied.

However, this suffers from 2 issues:

  1. JSON files can still get large if there are a lot of arrays.
  2. .bin files just contain the array data as a linear buffer, so no efficient slicing due to the lack of chunking.

I'm intrigued by Icechunk because it seems to address similar problems and looks really well executed (congrats!). I can see how it could potentially solve (2) by inserting references to Icechunk stores instead of .bin files, and adding a mini-DSL for accessing slices. Does that sound reasonable?

Would love to hear your thoughts, also whether Icechunk could potentially help with issue (1).

@rabernat
Copy link
Contributor

👋 welcome @dionhaefner!

I think the answer is yes, this is a good fit for your use case. The main detail I would add is that Icechunk's user-facing data model is identical to Zarr's (which is itself based on HDF5). All Icechunk data are created via the Zarr API. Icechunk sits under the hood, managing the actual storage of the data.

So I would reframe your questions to

  1. Can we map the bespoke, in-house data model to Zarr?
  2. Do you need the extra features that Icechunk provides on top of vanilla Zarr?

For 1, here's how I would model the data above as Zarr. Zarr basically defines groups and arrays. Groups can contain other groups or arrays. Both object can have arbitrary json-style attributes attached to them, but these attributes aren't meant for big data. Arrays can be arbitrarily large and are potentially spread over many files via chunking.

import numpy as np
import zarr
from zarr.storage import LocalStore

store = LocalStore("tmp.zarr")
root_group = zarr.group(store, zarr_format=3)
root_group.update_attributes(
    {
        "a_float": 0.1,
        "list_of_stuff": ["execute order", 66]
    }
)

child_group = root_group.create_group("nested")
child_group.attrs["not_an_array"] = [1, 2, 3]
array = child_group.create_array("look_ma_an_array", shape=24, dtype="int32")
array[:] = np.arange(24)

Many organizations are ditching their bespoke data formats in favor of Zarr because of the many advantages of adopting a community-maintained framework. So this would be step 1 for you. I'd say that the main advantage of Zarr for your use case is a much more sophisticated encoding and storage of arrays than your simple flat binary format (chunking, compression, etc.) which translates to much better performance for large arrays.

On top of Zarr (or techically beneath Zarr in the stack 😆 ), Icechunk can provide several optimizations that would bring this format more in line with your requirements:

  • More efficient storage of metadata via a custom binary protocol
  • "Inlining" of small arrays into those binary files (as with your inline base64 array above)
  • Transactional updates

Hope this is helpful.

@dionhaefner
Copy link
Author

Thanks for providing that perspective, and especially the code sample :) Always good to see things in action.

I understand where you're coming from, although I should mention that vanilla Zarr falls flat for us because it's missing this key property:

This works well because it allows us to change the tree structure of the data and non-array values without touching large arrays – they're just inserted as a reference so the data is only read when it's actually needed, and never copied.

So, creating a second LocalStore with a different group structure but the same arrays will trigger a copy, right? Where what we want is more of a data structure that works by reference (to something like a tensor database?) that we can reshape with impunity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants