Review H5Easy "extend/part" API. #1018

1uc · 2024-06-10T06:52:27Z

In H5Easy there's API for reading and writing one element at a time:

HighFive/include/highfive/h5easy_bits/H5Easy_scalar.hpp

Lines 66 to 70 in 5f3ded6

    
           inline static DataSet dump_extend(File& file, 
        
                                             const std::string& path, 
        
                                             const T& data, 
        
                                             const std::vector<size_t>& idx, 
        
                                             const DumpOptions& options) {

HighFive/include/highfive/h5easy_bits/H5Easy_scalar.hpp

Lines 120 to 122 in 5f3ded6

    
           inline static T load_part(const File& file, 
        
                                     const std::string& path, 
        
                                     const std::vector<size_t>& idx) {

It does this by creating a dataset that can be extended in all directions; and automatically grows if the index of the element written requires it to do so. (Negating our ability to spot off-by-one programming errors.)

The API for reading/writing one element at a time feels like it would tempt users into writing files that way in a loop. Which is a rather serious issue on common HPC hardware (and not great on consumer hardware).

To enable this API it must make a default choice for the chunk size, currently 10^n. That seems very small and is at risk of creating files that can't be read efficiently. Picking it reasonably large might inflate the size of the file by a factor 100 or more.

I think it might be fine to allow users to read and write single elements of an existing dataset, i.e. without the automatically growing aspect; and a warning in the documentation to not use it in a loop. In core we support various selection APIs that are reasonably compact: list of random points, regular hyperslabs (general too) and there's a proposal to allow Cartesian products of simple selections along each axes.

The text was updated successfully, but these errors were encountered:

petlist · 2024-12-30T14:43:22Z

Hello @1uc ,

I was about to ask a related question when noticed your post here. I am a user with the use case you're mentioning: I would like to effectively write data to a data set by element in a loop. Right now, doing so with the provided dump_extent interface is catastrophically slow.

To elaborate a bit more on the use-case, the data is stored in a custom container and is not contiguous in memory. The size of the data set is known, however, at the time I am dumping it to a file. What is the recommended way of writing data set by element? At the moment, I am trying to implement a solution using the lower level API, and avoid resizing the data set on every write.

Thanks,
Petr

1uc · 2024-12-31T13:05:19Z

I would try copying the discontiguous data into a contiguous buffer; and write from there. This buffer could either be the full size, or something sufficiently large (candidates are 4kb, 1MB, 4MB). I suspect, sorting the dataset by index (while creating the buffer) will pay off.

If the elements are contiguous in the file, then that should be sufficient. If they are not, then HDF5 supports a selection mechanism. They fall into two groups: hyperslabs for somewhat structured selections; and unstructured by index (efficient if there's no structure). HighFive supports a couple of options:

{Regular,}HyperSlab for HDF5 hyperslab API,
ProductSet for Cartesian products of (unions of) index intervals (think slices),
ElementSet for unstructured selections.
(There should be examples for each of the three, tests too; and hopefully documentation. The first two use HDF5 hyperslabs API internally, the third uses H5Sselect_element.)

The selection would be used to pick the elements of the dataset in the file (not in memory). If you want to select in memory (because you don't want to copy), then all I can say is that HDF5 supports this, but HighFive doesn't. One can use getId() to get the HDF5 HID of any HighFive object, and then use HDF5 C API directly for any missing functionality.

Please note that today is the last day any HighFive devs can expect to have write access to this repository. I'd like to keep HighFive alive past the end of the BlueBrainProject and indent to maintain it at https://github.com/highfive-devs/highfive. In the event that this repository is made read-only, please feel free to continue the discussion there.

petlist · 2025-01-04T16:23:25Z

Hello @1uc,

Thank you for the quick and very detailed reply! Just to clarify your second point a little bit, since this is exactly what I would like to do I think. Right now, I tried something very simple with HighFive, this function just writes a bunch of double values to a file:

// allocate necessary space for the data set since I know the size of my data chunk
const HighFive::DataSpace dataspace = HighFive::DataSpace({buffer->size()});

const std::vector<hsize_t> chunks(1, std::min((size_t)50, buffer->size()));
HighFive::DataSetCreateProps props;
props.add(HighFive::Chunking(chunks));
HighFive::DataSet dataset = fid_.createDataSet(key, dataspace, HighFive::AtomicType<double>(), props, {}, true);

size_t counter = 0;
const auto write = [&counter, &options, &dataset, &mask](const double& entry) {
  dataset.select({counter}, mask).write(entry);
  counter++;
};

// this function thread-safely applies writing on all elements in the container
cache_[idx]->applyOnAll(write);

I understand that this line: dataset.select({counter}, mask).write(entry); will write to a chosen slice in the dataset. Is it possible to write a fairly large chunk (4kb and more as you mentioned) given a pointer to memory and the chuck size (or position in a vector and number of elements I would like to write instead of writing the whole vector)? Or do you say it is currently impossible and the user should be using the native HDF5 C API?

Sorry again for the question, I struggled a bit with understanding the API.

Thanks!
Petr

1uc · 2025-01-05T10:38:45Z

I think you're asking about how to change the callback so it can write more than one double. I'm guessing mask is a vector of column indices and therefore you're trying to write certain columns of row counter.

You could use an std::span:

#include <highfive/span.hpp>

const auto write = [&counter, &options, &dataset, &mask](std::span<double> data) {
  dataset.select({counter}, mask).write(data);
  counter++;
};

or a pointer:

const auto write = [&counter, &options, &dataset, &mask](double const * data) {
  dataset.select({counter}, mask).write_raw(data);
  counter++;
};

Here data is the address of the first double (e.g. container.data() + offset); and write_raw will read as many doubles (starting from that address) as there are elements selected by select(...) (i.e. mask.size()).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review H5Easy "extend/part" API. #1018

Review H5Easy "extend/part" API. #1018

1uc commented Jun 10, 2024

petlist commented Dec 30, 2024

1uc commented Dec 31, 2024

petlist commented Jan 4, 2025

1uc commented Jan 5, 2025

Review H5Easy "extend/part" API. #1018

Review H5Easy "extend/part" API. #1018

Comments

1uc commented Jun 10, 2024

petlist commented Dec 30, 2024

1uc commented Dec 31, 2024

petlist commented Jan 4, 2025

1uc commented Jan 5, 2025