Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need more checks in RDataFrame #17485

Open
acampove opened this issue Jan 22, 2025 · 5 comments
Open

Need more checks in RDataFrame #17485

acampove opened this issue Jan 22, 2025 · 5 comments

Comments

@acampove
Copy link

Explain what you would like to see improved and how.

Hi,

  1. When I do:
from ROOT import RDataFrame

rdf=RDataFrame('tree', [])

I see a crash. That should never happen. I expect a message telling me that the list of files is empty.

  1. When I do:
from ROOT import RDataFrame

rdf=RDataFrame('tree', ['file_1.root', 'file_2.root'])

and I try to save the dataframe with Snapshot, I see a failure. It seems that in my particular case, both files have different number of entries. It took me 3 hours to figure this out. You need a preliminary check for cases like this. Please implement it.

ROOT version

6.34

Installation method

mamba

Operating system

alma9

Additional context

No response

@vepadulano
Copy link
Member

Dear @acampove ,

Thank you for reaching out!

Regarding part 1 of your issue, the check has already been put in place for ROOT 6.34, see

if (fileNameGlobs.size() == 0)
throw std::invalid_argument("RDataFrame: empty list of input files.");
. I can see it working on the C++ prompt, i.e.

root [0] ROOT::RDataFrame df{"events", {}};
Error in <TRint::HandleTermInput()>: std::invalid_argument caught: RDataFrame: empty list of input files.

But for some reason I see there is something wrong on the Python side, I can reproduce a segfault on my machine. Need to investigate.

Regarding part 2 of the issue, I can't understand the problem. The fact both files have different number of entries should not matter. Users of RDataFrame process billions of entries scattered across many files and there is nowhere the requirement that every file has the same amount of entries. Could you give a more detailed, self-contained reproducer?

@acampove
Copy link
Author

acampove commented Jan 22, 2025

Dear @acampove ,

Thank you for reaching out!

Regarding part 1 of your issue, the check has already been put in place for ROOT 6.34, see

root/tree/dataframe/src/RLoopManager.cxx

Lines 1236 to 1237 in 5358492
if (fileNameGlobs.size() == 0)
throw std::invalid_argument("RDataFrame: empty list of input files.");
. I can see it working on the C++ prompt, i.e.

root [0] ROOT::RDataFrame df{"events", {}};
Error in <TRint::HandleTermInput()>: std::invalid_argument caught: RDataFrame: empty list of input files.

But for some reason I see there is something wrong on the Python side, I can reproduce a segfault on my machine. Need to investigate.

Regarding part 2 of the issue, I can't understand the problem. The fact both files have different number of entries should not matter. Users of RDataFrame process billions of entries scattered across many files and there is nowhere the requirement that every file has the same amount of entries. Could you give a more detailed, self-contained reproducer?

Hi @vepadulano

Thanks for your reply. Regarding part 2, there seems to be a misunderstanding. The problem is not that the files have different number of entries. The problem is that the files can have different branches, i.e. the first one can have 20 branches and the second one can have 22.

Regarding the first, issue. There must be a test both for the C++ and python part, every time you implement a feature, like the check you mentioned above. It cannot be, that the first time this breaks is when a user like me runs the code. This has to be tested so that we never see it happening.

Cheers.

@vepadulano
Copy link
Member

vepadulano commented Jan 22, 2025

Dear @acampove ,

Thank you for your reply.

The problem is not that the files have different number of entries. The problem is that the files can have different branches, i.e. the first one can have 20 branches and the second one can have 22.

I understand, but I still do not see how the failure could happen. In principle, if you're using say column x which is present in file 1 but not in file 2, the Snapshot operation should tell you that it didn't find a certain event for column x, when reaching file 2. This can then be addressed by either skipping those files that do not have the column x, or by providing a default value for it. Documentation for this feature is provided at https://root.cern/doc/master/classROOT_1_1RDataFrame.html#missing-values . In any case, since this type of scenario is not that uncommon, could you still provide a reproducer of your issue so that I can further understand how to help?

Regarding the first, issue. There must be a test both for the C++ and python part, every time you implement a feature, like the check you mentioned above. It cannot be, that the first time this breaks is when a user like me runs the code. This has to be tested so that we never see it happening.

Thank you for your input, I completely agree the experience should be equivalent irrespective of how the user approaches the ROOT interfaces. The project uses powerful dynamic C++-Python bindings, so this specific type of error handling is usually transparent, but things might go wrong every once in a while.

@vepadulano
Copy link
Member

I was able to slim down part 1 of this issue into a simple reproducer that does not rely on RDataFrame. You can track the status at the linked issue.

@vepadulano
Copy link
Member

Dear @acampove ,

The underlying Python issue has been fixed, and #17581 is introducing a specific test for the situation you reported in part 1 of this issue.

If you can, I would greatly appreciate a reproducer for part 2 of your issue so that I can make sure your requirements are completely satisfied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants