Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader feature requests #492

Open
3 of 8 tasks
iguinn opened this issue May 15, 2023 · 11 comments
Open
3 of 8 tasks

DataLoader feature requests #492

iguinn opened this issue May 15, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request flow High-level data management help wanted Extra attention is needed

Comments

@iguinn
Copy link
Collaborator

iguinn commented May 15, 2023

Pygama now has a DataLoader that has been extremely helpful in accessing data across multiple runs and different tiers. As this tool has gotten more users, people have also had ideas for new features that will be useful in the future for a variety of analysis tasks. This issue will be used to compile a list of ideas; as people have new ideas, post them here, and I will try to add them to this top-level post. If someone is interested in implementing one or more of these, I can also put your name next to one of these. We can also discuss specifics of implementation in the comments below.

  • Optional argument to append or overwrite when adding data selection with setters (Grace; Let DataLoader default setters overwrite (not append) + lgdo.Struct.int_dtype #493)
  • Get data as iterators (currently implemented in two different ways; need to settle on how to best do this) (Ian/Luigi)
  • Create template waveforms over a selection
  • Randomly sample waveforms within a selection
  • Access to metadata and analysis parameters
  • "Lazy" construction of the entry list for use with iterators and waveform browsers
  • Check performance vs loading files on their own; implement optimizations if possible -> DataLoader performance is bad? #521
  • Add clearer error messages
@gipert
Copy link
Member

gipert commented May 16, 2023

I also think that, since many new features will come, we should think about how to refactor the code to make it more readable. It's very hard at the moment for new people to understand the code and contribute (data_loader.py is huge!)

@gipert
Copy link
Member

gipert commented May 16, 2023

We should also evaluate the performance and make sure that using the data loader does not add significant overhead to just manually loading the data from a known list of files (with and without filesystem caching)

@iguinn
Copy link
Collaborator Author

iguinn commented May 16, 2023

Another thing I'll add to the list is clearer error messages, and sanity checks as things are loading (with clear error messages)

@gipert
Copy link
Member

gipert commented May 17, 2023

Also: .load() currently fails if the provided entry list (Pandas DataFrame) is not properly indexed (e.g. if it has been previously filtered). This can be fixed on the user side by calling .reset_index(drop=True) on the entry list, but it should be done internally.

Alternatively, the code should be made independent on the DataFrame indices.

@erin717
Copy link
Contributor

erin717 commented May 17, 2023

For accessing analysis paramters such as the filter parameters or calibration results, do we want that to be handled by the DataLoader or by pylegendmeta?

@gipert
Copy link
Member

gipert commented May 18, 2023

I would argue that we should try to avoid having LEGEND-specific stuff in the DataLoader, such as pylegendmeta.

@erin717
Copy link
Contributor

erin717 commented May 18, 2023

I agree. I ask because if that's the case, we should remove or rethink the fifth bullet on the list as it broaches beyond the scope of what the data loader should be doing.

@bossioel
Copy link

Hi all! I spent a couple of days trying to use the data loader to load the SiPM data, and I want to report some feedback (I hope this can be useful for future improvements).

  • One problem in general (also for the germanium data) is that if you are trying to load a certain variable or cut on a certain variable that doesn’t exist, you don’t get any error message. I think it’s a problem, especially when you are applying some cuts, and you don’t know from the output if this is empty because you cut everything or because you are trying to look for something that doesn’t exist. (but I see there is already a similar point)
  • Similarly, if you try to load channels not in the files. An example is that I was trying to load some SiPM channels which are not in the hit files (Patrick has to fix an issue with the metadata here), and instead of getting an error, I was getting values in the arrays (energy_in_pe and trigger_pos) that didn’t make any sense (e.g. all nan in the energy and horrible numbers in the trigger_pos of the same event)
  • Last, working with the data loader for the SiPM data turned out to be quite complicated because of pandas not being so secure when using lists in cells (but here, I am not an expert at all, so I am not sure how much this is my fault or how much it could be made easier). Also, cuts on these variables cannot be set with set_cuts (or I did not find a way to).

@gipert gipert pinned this issue Oct 12, 2023
@jasondet
Copy link
Collaborator

at the CD review today, one of the reviewers (Ren Cooper) recommend we look into "FastQuery" -- basically fast data loading with selections for HDF5 files:
https://dav.lbl.gov/archive/Events/SC05/HDF5FastQuery/index.html

@jasondet
Copy link
Collaborator

Another suggestion from Chris Jillings: https://www.dcache.org

@iguinn
Copy link
Collaborator Author

iguinn commented Jan 3, 2024

I've been messing around a bit with the dataloader and view_as, and I think they will work really well together:

it = loader.load_iterator()
for tb, i, n in it:
    print(tb.view_as("pd")[:n])

produces an output of:

          trapEftp
0     14890.890625
1     16785.869141
2     13618.189453
3     11785.408203
4     16502.757812
...            ...
3195  18352.054688
3196  10460.607422
3197  11076.259766
3198  14756.632812
3199  17032.175781

[3200 rows x 1 columns]
          trapEftp
0     13427.619141
1     15362.182617
2     17098.033203
3     15356.582031
4     16032.803711
...            ...
3195  10453.760742
3196  14521.578125
3197  14786.087891
3198  13473.371094
3199  16450.130859
...

Which would make an iterative pandas analysis really easy (or awkward or numpy or whatever we expand view_as to use in the future).

I would propose that we keep load_iterator as an internal thing (and perhaps rename it to _load_iterator to hide it), and implement next to wrap load_iterator and use view_as to loop through however you want. This could look like:

for pd in loader.next("pd", options):
    # analyze dataframe

One issue is that the last entry is not going to have the same size as the others. In the above example, I solve this by making a slice of the view. This might not be feasible with every view_as we could imagine. For example, if we did view_as for histograms, we wouldn't be able to slice it in this way. I can think of a few possible solutions to this:

  1. Do it as above and restrict view_as and next to data structures that are truly just views
  2. Give view_as arguments to control the range, so in the loop we do view_as("pd", start=0, stop=n) or something like that
  3. Have the iterator resize the chunked tables; as implemented now, this would require reallocating the memory in the LGDO objects unless we added the ability to have a different capacity from size like in C++ vectors (this already exists for VectorOfVectors). This is somewhat high effort though.

Another issue is that the entrylist still has to get loaded before we can do load_iterator, which takes quite a while (and would be abysmally slow and probably have memory problems if we wanted to load, say, all calibrations in a partition). My proposal here would be to implement a lazy entrylist that calculates maybe one file at a time. If we gave it a list-like interface, it could still work very well with the LH5Iterator. This would also be a much more efficient way of dealing with file caching, I think.

@gipert gipert unpinned this issue Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request flow High-level data management help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants