Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster-based scheduling of RNTuple processing in distributed mode #15152

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

vepadulano
Copy link
Member

The logic in RNTupleDS::PrepareNextRanges and RNTupleDS::GetRanges had to be changed a bit to accommodate the new case, I think it could be streamlined but I didn't want to change too many things in the same PR.

@vepadulano
Copy link
Member Author

Sibling PR at root-project/roottest#1105

Copy link

github-actions bot commented Apr 5, 2024

Test Results

     9 files       9 suites   1d 10h 31m 39s ⏱️
 2 609 tests  2 605 ✅ 0 💤 4 ❌
21 885 runs  21 881 ✅ 0 💤 4 ❌

For more details on these failures, see this check.

Results for commit 45a3cb6.

♻️ This comment has been updated with latest results.

@vepadulano vepadulano force-pushed the distrdf-rntuple branch 2 times, most recently from 2d860b8 to c303cd3 Compare April 9, 2024 13:35
This function may be useful anywhere in our code we need information about the
cluster boundaries. As a first example, use it when scheduling MT work in
RNTupleDS.
Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long delay!

One general observation is that we have multiple (too many?) different "entry range" structs. Perhaps we can consolidate them to one REntryRange struct that we put in RNTupleUtil.h. I think I'd have a preference for representing the range with (firstEntryNumber, numberOfEntries) because it makes it easy to represent empty ranges.

It would be certainly nice to combine the code into a single PrepareNextRanges() call. I think, e.g., it would be nicer to always work with fGlobalRange, which has [0..inf] as a default. Let's discuss.

@@ -775,6 +775,8 @@ public:
NTupleSize_t GetNEntries() const { return fNEntries; }
NTupleSize_t GetNElements(DescriptorId_t physicalColumnId) const;

std::vector<std::pair<NTupleSize_t, NTupleSize_t>> GetClusterBoundaries() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer to return a struct with named members. That would also make clear the exact meaning of the boundaries, e.g. fFirstEntry, fLastEntry or similar (looking at the implementation, it's rather "last entry + 1").

Comment on lines +51 to +52
std::pair<std::vector<std::pair<ROOT::Experimental::NTupleSize_t, ROOT::Experimental::NTupleSize_t>>,
ROOT::Experimental::NTupleSize_t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type is a little heavy. Could this become a struct with named members?

Comment on lines +443 to +445
// We consider only the case of 1 process (fNSlots == 1) and 1 or more files to process
// This RNTupleDS instance needs to compute the ranges for the file(s) it is assigned
// at construction time, respecting the user-provided global range boundaries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert fNSlots == 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants