-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLoader feature requests #492
Comments
I also think that, since many new features will come, we should think about how to refactor the code to make it more readable. It's very hard at the moment for new people to understand the code and contribute ( |
We should also evaluate the performance and make sure that using the data loader does not add significant overhead to just manually loading the data from a known list of files (with and without filesystem caching) |
Another thing I'll add to the list is clearer error messages, and sanity checks as things are loading (with clear error messages) |
Also: Alternatively, the code should be made independent on the DataFrame indices. |
For accessing analysis paramters such as the filter parameters or calibration results, do we want that to be handled by the DataLoader or by pylegendmeta? |
I would argue that we should try to avoid having LEGEND-specific stuff in the DataLoader, such as pylegendmeta. |
I agree. I ask because if that's the case, we should remove or rethink the fifth bullet on the list as it broaches beyond the scope of what the data loader should be doing. |
Hi all! I spent a couple of days trying to use the data loader to load the SiPM data, and I want to report some feedback (I hope this can be useful for future improvements).
|
at the CD review today, one of the reviewers (Ren Cooper) recommend we look into "FastQuery" -- basically fast data loading with selections for HDF5 files: |
Another suggestion from Chris Jillings: https://www.dcache.org |
I've been messing around a bit with the dataloader and view_as, and I think they will work really well together: it = loader.load_iterator()
for tb, i, n in it:
print(tb.view_as("pd")[:n]) produces an output of:
Which would make an iterative pandas analysis really easy (or awkward or numpy or whatever we expand view_as to use in the future). I would propose that we keep load_iterator as an internal thing (and perhaps rename it to _load_iterator to hide it), and implement next to wrap load_iterator and use view_as to loop through however you want. This could look like: for pd in loader.next("pd", options):
# analyze dataframe One issue is that the last entry is not going to have the same size as the others. In the above example, I solve this by making a slice of the view. This might not be feasible with every view_as we could imagine. For example, if we did view_as for histograms, we wouldn't be able to slice it in this way. I can think of a few possible solutions to this:
Another issue is that the entrylist still has to get loaded before we can do load_iterator, which takes quite a while (and would be abysmally slow and probably have memory problems if we wanted to load, say, all calibrations in a partition). My proposal here would be to implement a lazy entrylist that calculates maybe one file at a time. If we gave it a list-like interface, it could still work very well with the LH5Iterator. This would also be a much more efficient way of dealing with file caching, I think. |
Pygama now has a DataLoader that has been extremely helpful in accessing data across multiple runs and different tiers. As this tool has gotten more users, people have also had ideas for new features that will be useful in the future for a variety of analysis tasks. This issue will be used to compile a list of ideas; as people have new ideas, post them here, and I will try to add them to this top-level post. If someone is interested in implementing one or more of these, I can also put your name next to one of these. We can also discuss specifics of implementation in the comments below.
DataLoader
default setters overwrite (not append) +lgdo.Struct.int_dtype
#493)Access to metadata and analysis parametersThe text was updated successfully, but these errors were encountered: