-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster-based scheduling of RNTuple processing in distributed mode #15152
base: master
Are you sure you want to change the base?
Conversation
Sibling PR at root-project/roottest#1105 |
Test Results 9 files 9 suites 1d 10h 31m 39s ⏱️ For more details on these failures, see this check. Results for commit 45a3cb6. ♻️ This comment has been updated with latest results. |
2d860b8
to
c303cd3
Compare
This function may be useful anywhere in our code we need information about the cluster boundaries. As a first example, use it when scheduling MT work in RNTupleDS.
c303cd3
to
45a3cb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the long delay!
One general observation is that we have multiple (too many?) different "entry range" structs. Perhaps we can consolidate them to one REntryRange
struct that we put in RNTupleUtil.h
. I think I'd have a preference for representing the range with (firstEntryNumber, numberOfEntries)
because it makes it easy to represent empty ranges.
It would be certainly nice to combine the code into a single PrepareNextRanges()
call. I think, e.g., it would be nicer to always work with fGlobalRange
, which has [0..inf]
as a default. Let's discuss.
@@ -775,6 +775,8 @@ public: | |||
NTupleSize_t GetNEntries() const { return fNEntries; } | |||
NTupleSize_t GetNElements(DescriptorId_t physicalColumnId) const; | |||
|
|||
std::vector<std::pair<NTupleSize_t, NTupleSize_t>> GetClusterBoundaries() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nicer to return a struct with named members. That would also make clear the exact meaning of the boundaries, e.g. fFirstEntry
, fLastEntry
or similar (looking at the implementation, it's rather "last entry + 1").
std::pair<std::vector<std::pair<ROOT::Experimental::NTupleSize_t, ROOT::Experimental::NTupleSize_t>>, | ||
ROOT::Experimental::NTupleSize_t> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type is a little heavy. Could this become a struct with named members?
// We consider only the case of 1 process (fNSlots == 1) and 1 or more files to process | ||
// This RNTupleDS instance needs to compute the ranges for the file(s) it is assigned | ||
// at construction time, respecting the user-provided global range boundaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert fNSlots == 1
The logic in
RNTupleDS::PrepareNextRanges
andRNTupleDS::GetRanges
had to be changed a bit to accommodate the new case, I think it could be streamlined but I didn't want to change too many things in the same PR.