DataLoader feature requests #492

iguinn · 2023-05-15T20:33:03Z

Pygama now has a DataLoader that has been extremely helpful in accessing data across multiple runs and different tiers. As this tool has gotten more users, people have also had ideas for new features that will be useful in the future for a variety of analysis tasks. This issue will be used to compile a list of ideas; as people have new ideas, post them here, and I will try to add them to this top-level post. If someone is interested in implementing one or more of these, I can also put your name next to one of these. We can also discuss specifics of implementation in the comments below.

Optional argument to append or overwrite when adding data selection with setters (Grace; Let DataLoader default setters overwrite (not append) + lgdo.Struct.int_dtype #493)
Get data as iterators (currently implemented in two different ways; need to settle on how to best do this) (Ian/Luigi)
Create template waveforms over a selection
Randomly sample waveforms within a selection
~~Access to metadata and analysis parameters~~
"Lazy" construction of the entry list for use with iterators and waveform browsers
Check performance vs loading files on their own; implement optimizations if possible -> DataLoader performance is bad? #521
Add clearer error messages

The text was updated successfully, but these errors were encountered:

gipert · 2023-05-16T08:17:16Z

I also think that, since many new features will come, we should think about how to refactor the code to make it more readable. It's very hard at the moment for new people to understand the code and contribute (data_loader.py is huge!)

gipert · 2023-05-16T08:19:17Z

We should also evaluate the performance and make sure that using the data loader does not add significant overhead to just manually loading the data from a known list of files (with and without filesystem caching)

iguinn · 2023-05-16T16:50:58Z

Another thing I'll add to the list is clearer error messages, and sanity checks as things are loading (with clear error messages)

gipert · 2023-05-17T12:51:04Z

Also: .load() currently fails if the provided entry list (Pandas DataFrame) is not properly indexed (e.g. if it has been previously filtered). This can be fixed on the user side by calling .reset_index(drop=True) on the entry list, but it should be done internally.

Alternatively, the code should be made independent on the DataFrame indices.

erin717 · 2023-05-17T17:11:17Z

For accessing analysis paramters such as the filter parameters or calibration results, do we want that to be handled by the DataLoader or by pylegendmeta?

gipert · 2023-05-18T09:15:15Z

I would argue that we should try to avoid having LEGEND-specific stuff in the DataLoader, such as pylegendmeta.

erin717 · 2023-05-18T14:31:22Z

I agree. I ask because if that's the case, we should remove or rethink the fifth bullet on the list as it broaches beyond the scope of what the data loader should be doing.

bossioel · 2023-07-17T16:12:42Z

Hi all! I spent a couple of days trying to use the data loader to load the SiPM data, and I want to report some feedback (I hope this can be useful for future improvements).

One problem in general (also for the germanium data) is that if you are trying to load a certain variable or cut on a certain variable that doesn’t exist, you don’t get any error message. I think it’s a problem, especially when you are applying some cuts, and you don’t know from the output if this is empty because you cut everything or because you are trying to look for something that doesn’t exist. (but I see there is already a similar point)
Similarly, if you try to load channels not in the files. An example is that I was trying to load some SiPM channels which are not in the hit files (Patrick has to fix an issue with the metadata here), and instead of getting an error, I was getting values in the arrays (energy_in_pe and trigger_pos) that didn’t make any sense (e.g. all nan in the energy and horrible numbers in the trigger_pos of the same event)
Last, working with the data loader for the SiPM data turned out to be quite complicated because of pandas not being so secure when using lists in cells (but here, I am not an expert at all, so I am not sure how much this is my fault or how much it could be made easier). Also, cuts on these variables cannot be set with set_cuts (or I did not find a way to).

jasondet · 2023-10-25T17:58:09Z

at the CD review today, one of the reviewers (Ren Cooper) recommend we look into "FastQuery" -- basically fast data loading with selections for HDF5 files:
https://dav.lbl.gov/archive/Events/SC05/HDF5FastQuery/index.html

jasondet · 2023-12-12T17:42:16Z

Another suggestion from Chris Jillings: https://www.dcache.org

iguinn · 2024-01-03T21:45:34Z

I've been messing around a bit with the dataloader and view_as, and I think they will work really well together:

it = loader.load_iterator()
for tb, i, n in it:
    print(tb.view_as("pd")[:n])

produces an output of:

          trapEftp
0     14890.890625
1     16785.869141
2     13618.189453
3     11785.408203
4     16502.757812
...            ...
3195  18352.054688
3196  10460.607422
3197  11076.259766
3198  14756.632812
3199  17032.175781

[3200 rows x 1 columns]
          trapEftp
0     13427.619141
1     15362.182617
2     17098.033203
3     15356.582031
4     16032.803711
...            ...
3195  10453.760742
3196  14521.578125
3197  14786.087891
3198  13473.371094
3199  16450.130859
...

Which would make an iterative pandas analysis really easy (or awkward or numpy or whatever we expand view_as to use in the future).

I would propose that we keep load_iterator as an internal thing (and perhaps rename it to _load_iterator to hide it), and implement next to wrap load_iterator and use view_as to loop through however you want. This could look like:

for pd in loader.next("pd", options):
    # analyze dataframe

One issue is that the last entry is not going to have the same size as the others. In the above example, I solve this by making a slice of the view. This might not be feasible with every view_as we could imagine. For example, if we did view_as for histograms, we wouldn't be able to slice it in this way. I can think of a few possible solutions to this:

Do it as above and restrict view_as and next to data structures that are truly just views
Give view_as arguments to control the range, so in the loop we do view_as("pd", start=0, stop=n) or something like that
Have the iterator resize the chunked tables; as implemented now, this would require reallocating the memory in the LGDO objects unless we added the ability to have a different capacity from size like in C++ vectors (this already exists for VectorOfVectors). This is somewhat high effort though.

Another issue is that the entrylist still has to get loaded before we can do load_iterator, which takes quite a while (and would be abysmally slow and probably have memory problems if we wanted to load, say, all calibrations in a partition). My proposal here would be to implement a lazy entrylist that calculates maybe one file at a time. If we gave it a list-like interface, it could still work very well with the LH5Iterator. This would also be a much more efficient way of dealing with file caching, I think.

iguinn added enhancement New feature or request help wanted Extra attention is needed flow High-level data management labels May 15, 2023

iguinn assigned jasondet, iguinn, gipert and gracesong312 May 15, 2023

gracesong312 mentioned this issue May 15, 2023

Let DataLoader default setters overwrite (not append) + lgdo.Struct.int_dtype #493

Merged

gipert pinned this issue Oct 12, 2023

gipert unpinned this issue Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader feature requests #492

DataLoader feature requests #492

iguinn commented May 15, 2023 •

edited by jasondet

Loading

gipert commented May 16, 2023

gipert commented May 16, 2023

iguinn commented May 16, 2023

gipert commented May 17, 2023

erin717 commented May 17, 2023

gipert commented May 18, 2023

erin717 commented May 18, 2023

bossioel commented Jul 17, 2023

jasondet commented Oct 25, 2023

jasondet commented Dec 12, 2023

iguinn commented Jan 3, 2024 •

edited by gipert

Loading

DataLoader feature requests #492

DataLoader feature requests #492

Comments

iguinn commented May 15, 2023 • edited by jasondet Loading

gipert commented May 16, 2023

gipert commented May 16, 2023

iguinn commented May 16, 2023

gipert commented May 17, 2023

erin717 commented May 17, 2023

gipert commented May 18, 2023

erin717 commented May 18, 2023

bossioel commented Jul 17, 2023

jasondet commented Oct 25, 2023

jasondet commented Dec 12, 2023

iguinn commented Jan 3, 2024 • edited by gipert Loading

iguinn commented May 15, 2023 •

edited by jasondet

Loading

iguinn commented Jan 3, 2024 •

edited by gipert

Loading