Poor h5repack / H5DOpen performance for virtual datasets with 1000+ real underlying datasets #5187
Labels
Component - C Library
Core C library issues (usually in the src directory)
Type - Improvement
Improvements that don't add a new feature or functionality
versioned-hdf5 is an abstraction layer on top of h5py which adds diff versioning to HDF5 files. This is done by Copy-on-Write at chunk level. When the user modifies a dataset, they
a. now have two read-only datasets, before and after the change, and
b. the total disk usage is the size of the original dataset, plus the size of the changed chunks.
This is achieved by creating a virtual dataset for each version, which is a stitch of single chunks of a "raw data" plain dataset. Every time the user commits a new version which changes a subset of chunks, the original chunk is not modified; instead versioned-hdf5 appends the changed chunks to the raw dataset and creates a brand new virtual dataset:
The problem we're facing is that libhdf5 is performing very poorly on H5DOpen, as I suspect nobody else is handling huge amounts of virtual datasets, each stitched from huge amounts of underlying referenced datasets.
E.g. if you have 100 versions of a dataset with 10k chunks, the HDF5 file will contain 1 raw dataset plus 100 virtual datasets, each with 10k references to the same raw dataset.
The problem becomes unbearable with h5repack.
Here's a demo script that builds 95 datasets of 44 to 440 chunks each, then proceeds to create 100 incremental diff versions from them:
https://gist.github.com/crusaderky/b91549221447e966fb2b22c5177df724
h5repack is exceptionally slow to duplicate it (all tests on NVMe):
if I replace the virtual datasets with real datasets that conceptually store the same metadata, I get a drastic improvement:
Profiling shows that more than half the time is, unsurprisingly, spent by H5DOpen2 opening the virtual datasets:
Before I redesign versioned-hdf5 from scratch to avoid using virtual datasets, I would like to figure out if it's possible to improve the situation in libhdf5:
https://github.com/deshaw/versioned-hdf5/blob/1a22450e90cea878ed16f99f42a4c82eb966249f/versioned_hdf5/backend.py#L465-L477
When calling
H5Pset_virtual
, I wonder if it would be possible to allow leaving the dataset name NULL in all calls beyond the first one for a virtual dataset. H5DOpen would need to be changed to match. I'm unsure however how much benefit such a change would cause.hdf5/tools/src/h5repack/h5repack_copy.c
Line 803 in 331193f
hdf5/tools/src/h5repack/h5repack_copy.c
Line 885 in 331193f
hdf5/tools/src/h5repack/h5repack_copy.c
Lines 1315 to 1317 in 331193f
hdf5/tools/src/h5repack/h5repack_refs.c
Line 102 in 331193f
hdf5/tools/src/h5repack/h5repack_refs.c
Line 318 in 331193f
In theory, I could reduce them to two (one for the input and one for the output) by keeping the object alive for the whole duration of h5repack. However, that would mean having all datasets open at the same time. Are there limits / caveats regarding the number of open objects referenced by
hid_t
?tools/src/h5repack/h5repack_copy.c::do_copy_objects
andh5repack_refs.c::do_copy_refsobjs
to process a single dataset from beginning to the end before moving to the next. I guess however that the reason why the current algorithm is breadth-first is because you need to ensure target datasets exist before creating cross-references to them?The text was updated successfully, but these errors were encountered: