I/O performance improvements #100

iguinn · 2024-07-11T20:42:55Z

Use low level h5py API to for read. Note that this resulted in moving the handling of lists of files from _serializers.read.composite to store.read and core.read. Only the _serializers functions are using the low level API; read itself is still using h5py.File to open the file get the top level group.
Require h5py >= 3.10. Not sure why, but there is a noticeable performance improvement starting at this version
Encode string attributes to utf-8 before writing. This is something that happens implicitly anyway, so I'm not sure why doing this helps, but it does
Write files using paged aggregation. This is a setting that you use when opening a file to write. Right now I have it hard-coded to use 64 kB pages, which seemed to be optimal in terms of file size and read speed, although this was only tested using a file with many small datasets. Also use latest version of file for writing
Among these changes, the low level API made the biggest difference (a factor of almost 2), while the other two changes combine for an improvement of maybe 1.5 or so. The low level API makes a difference without reprocessing our files, while the other two changes happen at file write time, so will only be noticed after reprocessing

codecov · 2024-07-11T20:45:05Z

Codecov Report

Attention: Patch coverage is 81.13208% with 40 lines in your changes missing coverage. Please review.

Project coverage is 76.69%. Comparing base (7e2c9ee) to head (c134065).
Report is 36 commits behind head on main.

Files with missing lines	Patch %	Lines
...rc/lgdo/lh5/_serializers/read/vector_of_vectors.py	59.09%	9 Missing ⚠️
src/lgdo/lh5/_serializers/read/composite.py	69.56%	7 Missing ⚠️
src/lgdo/lh5/_serializers/read/ndarray.py	75.86%	7 Missing ⚠️
src/lgdo/lh5/store.py	82.92%	7 Missing ⚠️
src/lgdo/lh5/_serializers/read/utils.py	84.21%	3 Missing ⚠️
src/lgdo/lh5/core.py	91.42%	3 Missing ⚠️
src/lgdo/lh5/_serializers/read/encoded.py	83.33%	2 Missing ⚠️
src/lgdo/lh5/_serializers/read/array.py	88.88%	1 Missing ⚠️
src/lgdo/lh5/_serializers/read/scalar.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   76.28%   76.69%   +0.41%     
==========================================
  Files          46       46              
  Lines        3036     3133      +97     
==========================================
+ Hits         2316     2403      +87     
- Misses        720      730      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gipert · 2024-07-12T12:15:45Z

@oschulz anything to be worried about from the Julia side?

oschulz · 2024-07-12T15:25:52Z

@oschulz anything to be worried about from the Julia side?

Probably not, but can you give a test output file to @apmypb for testing?

oschulz · 2024-07-12T15:27:45Z

Probably not, but can you give a test output file to @apmypb for testing?

The best thing would be to replace the HDF5 files (read and write again with these improvements) in the legend-testdata repo as part of a PR there. Then we can test on Julia and finally merge into legend-testdata.

iguinn · 2024-07-20T21:51:50Z

Added a change to read files with locking=False (see #78 (comment))

gipert · 2024-08-12T09:56:42Z

I wonder if we really want to hardcode locking=False... Isn't it a useful feature?

iguinn · 2024-08-12T18:18:08Z

Yeah, perhaps we should make this an argument that defaults to True for write and False for read?

There are some other settings that I hardcoded here:

                {
                    "fs_strategy": "page",
                    "fs_page_size": 65536,
                    "fs_persist": True,
                    "fs_threshold": 1,
                    "libver": ("latest", "latest"),
                }

which I can change.

gipert · 2024-08-13T09:04:16Z

I'm a bit reluctant about hardcoding settings that might produce unexpected behavior or sub-optimal performance on a different system than your test one. Looking at the h5py defaults:

https://github.com/h5py/h5py/blob/959604fe7f6ee1f2bdc5328e9036b96c0ec44323/h5py/_hl/files.py#L376

seems like your fs_threshold and libver values are the same? Should we avoid hardcoding then?

Why did you change fs_persist?

About fs_page_size: maybe we should override the page size value in the dataflow only? Or do you feel that this value should be universally good? Again, I'm worried about unexpected behavior in user applications.

iguinn · 2024-08-27T20:13:10Z

I un-hardcoded these things. I added a locking option for read, which will default to False (I think this is ok in most of our use-cases and it helps with using the ro filesystem), and a page_buffer option for write which I am now defaulting to 0

src/lgdo/lh5/core.py

src/lgdo/lh5/store.py

gipert · 2024-09-09T10:26:03Z

src/lgdo/lh5/store.py

+        if page_buffer:
+            file_kwargs.update(
+                {
+                    "fs_strategy": "page",
+                    "fs_page_size": page_buffer,
+                    "fs_persist": True,
+                    "fs_threshold": 1,
+                }
+            )


This code is duplicated in this file and in core.py, can it be defined in only one place?

Back in this pull request #97, I moved a lot of the code that opened files and found the group being read in into store.read and core.read, which created a lot of redundancy. I think both of these functions should be revisited. I actually tried fixing this at some point by having store.read use core.read, but it caused an error that I didn't have time to sort through; as I recall it had to do with how the output is sometimes an lgdo object and some times a tuple.

src/lgdo/lh5/core.py

src/lgdo/lh5/store.py

src/lgdo/lh5/core.py

Co-authored-by: Luigi Pertoldi <[email protected]>

gipert · 2024-09-10T16:17:27Z

Looks good to me, I'll merge. Thanks a lot Ian for all this!

iguinn and others added 9 commits July 6, 2024 14:31

Replace file/name with hdf5 group/dataset when decoding

5f81458

style: pre-commit fixes

bed10c8

Fixed pre-commit thing

b954b88

Use low level h5py API to read LGDO objects

40526b4

Require h5py>=3.10

f0ac76d

Encode string attributes to utf-8 before writing

69b2a1b

Write files using paged aggregation

cae648e

Pre-commit fixes

0db9fc4

Merge branch 'main' of https://github.com/legend-exp/legend-pydataobj

56e334d

iguinn mentioned this pull request Jul 11, 2024

Increase read speed by x20-100 for most data #78

Closed

iguinn and others added 2 commits July 11, 2024 21:14

Fixed bug when reading multiple files

322b250

style: pre-commit fixes

050f964

gipert added performance Code performance lh5 HDF5 I/O labels Jul 12, 2024

Added test_read_multiple_files

ce9c6d0

Read files with file locking off

94037f7

iguinn and others added 4 commits July 25, 2024 09:43

Merge branch 'main' of https://github.com/legend-exp/legend-pydataobj

f035526

Use locking=false for lh5.show

efa2e8e

Bug fix

1b2bef2

style: pre-commit fixes

c19906b

This was linked to issues Aug 12, 2024

Increase read speed by x20-100 for most data #78

Closed

LH5Store.read() performance #71

Closed

iguinn added 2 commits August 23, 2024 10:29

Merge branch 'main' of https://github.com/legend-exp/legend-pydataobj

ee66f44

Allow sets as field_masks

66dc5b5

iguinn mentioned this pull request Aug 25, 2024

Updated pargen.utils.load_data to use LH5Iterator and field_mask to be more memory efficient legend-exp/pygama#589

Open

iguinn and others added 7 commits August 26, 2024 14:17

Added args to set locking when reading files

e741684

Fixed merge problems

142d277

style: pre-commit fixes

8985e01

Improved handling of kwargs for h5py.File. Added page_buffer kwarg

b20bc6b

Merge branch 'main' of https://github.com/iguinn/legend-pydataobj

98a32b3

pre-commit

3137c53

Fixed sphinx error

a87033d

erin717 reviewed Aug 31, 2024

View reviewed changes

src/lgdo/lh5/core.py Show resolved Hide resolved

erin717 reviewed Aug 31, 2024

View reviewed changes

src/lgdo/lh5/store.py Show resolved Hide resolved

iguinn and others added 3 commits September 3, 2024 14:02

Added test of read multiple files with indices

c835b7c

Fixed bugs with read multiple files with indices

968ea92

style: pre-commit fixes

ca08c64

gipert self-requested a review September 7, 2024 08:19

gipert reviewed Sep 9, 2024

View reviewed changes

iguinn and others added 5 commits September 9, 2024 21:36

Update src/lgdo/lh5/store.py

584cb46

Co-authored-by: Luigi Pertoldi <[email protected]>

Update src/lgdo/lh5/store.py

9b0eb51

Co-authored-by: Luigi Pertoldi <[email protected]>

Update src/lgdo/lh5/core.py

e538949

Co-authored-by: Luigi Pertoldi <[email protected]>

Removed unnecessary kwargs that are already default in h5py

a46970d

Fixed bug in last commit

c134065

gipert merged commit cd78184 into legend-exp:main Sep 10, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O performance improvements #100

I/O performance improvements #100

iguinn commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

gipert commented Jul 12, 2024

oschulz commented Jul 12, 2024

oschulz commented Jul 12, 2024

iguinn commented Jul 20, 2024

gipert commented Aug 12, 2024 •

edited

Loading

iguinn commented Aug 12, 2024

gipert commented Aug 13, 2024

iguinn commented Aug 27, 2024

gipert Sep 9, 2024

iguinn Sep 10, 2024

gipert commented Sep 10, 2024

I/O performance improvements #100

I/O performance improvements #100

Conversation

iguinn commented Jul 11, 2024 • edited Loading

codecov bot commented Jul 11, 2024 • edited Loading

Codecov Report

gipert commented Jul 12, 2024

oschulz commented Jul 12, 2024

oschulz commented Jul 12, 2024

iguinn commented Jul 20, 2024

gipert commented Aug 12, 2024 • edited Loading

iguinn commented Aug 12, 2024

gipert commented Aug 13, 2024

iguinn commented Aug 27, 2024

gipert Sep 9, 2024

Choose a reason for hiding this comment

iguinn Sep 10, 2024

Choose a reason for hiding this comment

gipert commented Sep 10, 2024

iguinn commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

gipert commented Aug 12, 2024 •

edited

Loading