You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When writing an HDF5 file to the HSDS with H5PYD, it appears that although chunks are being created in the final output file, the initial writing of the data seems to operate in a contiguous manner. This would sometimes produce interrupts (http request errors) when writing large, ~GB-size HDF5 files (~GB-size) with H5PYD to the HSDS despite having more than enough memory in each of the HSDS data nodes. Writing smaller, ~MB-sized files was hit and miss, and KB-sized files had no issues. The 3D datasets in the HDF5 files of varied sizes (~GB, ~MB, and ~KB-size) used in these tests were filled with 3D random numpy arrays.
In order to use the H5PYD Chunk Interator in create_dataset, the following fix is suggested:
The line below is added to the import statements in the group.py file in h5pyd/_hl:
from h5pyd._apps.chunkiter import ChunkIterator
In the group.py file under h5pyd/_hl, change lines 334-336 from this:
if data is not None:
self.log.info("initialize data")
dset[...] = data
to this:
if data is not None:
self.log.info("initialize data")
# dset[...] = data
it = ChunkIterator(dset)
for chunk in it:
dset[chunk] = data[chunk]
The text was updated successfully, but these errors were encountered:
In the h5pyd dataset.py that's a good solution for initializing the dataset.
There's a max request size limit (defaults to 100mb) so there server will respond with a 413 error if you try to write more than that much data in one request. I don't know if that explains the problems you had with writing larger datasets or not.
I'd been planning to make changes that would paginate large writes - basically have the code for dset[...] = data send multiple requests to the server if that data is too large. Read operations are supported this way now. Your approach would be easier to implement since it just needs to deal with the dataset initialization. Have you tried making this change yourself?
When writing an HDF5 file to the HSDS with H5PYD, it appears that although chunks are being created in the final output file, the initial writing of the data seems to operate in a contiguous manner. This would sometimes produce interrupts (http request errors) when writing large, ~GB-size HDF5 files (~GB-size) with H5PYD to the HSDS despite having more than enough memory in each of the HSDS data nodes. Writing smaller, ~MB-sized files was hit and miss, and KB-sized files had no issues. The 3D datasets in the HDF5 files of varied sizes (~GB, ~MB, and ~KB-size) used in these tests were filled with 3D random numpy arrays.
In order to use the H5PYD Chunk Interator in create_dataset, the following fix is suggested:
The line below is added to the import statements in the group.py file in h5pyd/_hl:
from h5pyd._apps.chunkiter import ChunkIterator
In the group.py file under h5pyd/_hl, change lines 334-336 from this:
if data is not None:
self.log.info("initialize data")
dset[...] = data
to this:
The text was updated successfully, but these errors were encountered: