-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944
base: master
Are you sure you want to change the base?
[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Some smaller comments and I think we should have tests for the encountered issues before merging.
tree/ntuple/v7/src/RNTupleMerger.cxx
Outdated
auto readOpts = RNTupleReadOptions(); | ||
// disable the cluster cache so we can catch the exception that happens on LoadEntry | ||
readOpts.SetClusterCache(RNTupleReadOptions::EClusterCache::kOff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this stay as is? Background reading may be beneficial for merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to remove that line (it was for debugging)
tree/ntuple/v7/src/RNTupleMerger.cxx
Outdated
DescriptorId_t fInputLogicalId; | ||
DescriptorId_t fInputPhysicalId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may not need at all the logical columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm confused: Didn't we recently change this that alias columns (ie where logical column ids matter) are always higher than physical columns, which is what the merger probably should work on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the merger should only work on non-alias columns, so I will remove the distinction and just call it fInputId
tree/ntuple/v7/src/RNTupleMerger.cxx
Outdated
int firstFileCompression = kUnknownCompressionSettings; | ||
while (const auto &pitr = itr()) { | ||
TFile *inFile = dynamic_cast<TFile *>(pitr); | ||
if (firstFileCompression == kUnknownCompressionSettings) | ||
firstFileCompression = inFile->GetCompressionSettings(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one: since we already diverged the RNTuple compression from the TFile compression (default 505 RNTuple vs default 101 TFile), should we interpret this option rather as: "look at the first RNTuple as a reference for the compression settings".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But technically there is no such thing as an "RNTuple compression settings", right? Each column range may in principle have a different compression setting, so which one should we pick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I agree that if I do -ff
when merging RNTuples, I wouldn't expect the output RNTuple to be compressed with 101 if the input was using the default 505 compression...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each column range may in principle have a different compression setting, so which one should we pick?
As a user I would expect each column of the output RNTuple
to have the same compression setting as they have as in the first input file. (i.e. the output structure/meta-data should be a 'clone' of the structure/meta-data of the first input RNTuple
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might decide that's what we want eventually; I propose that initially we just assume that the first column has the same compression as every other column (since we currently don't expose an API to use different compression).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(since we currently don't expose an API to use different compression).
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might decide that's what we want eventually
Humm ... unless we have a strong reason not to, we ought to be similar to TTree
which already does the per column copy of the compression setting (implicitly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
Even though the format supports different compressions per column range, currently from our API you can only set the compression for the entire RNTuple (through RNTupleWriteOptions
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that prevent just the user from customizing the compression settings or does it also prevent the proper implementation inside the merger? [related issue: we are likely to 'forget' to update the Merger when/if we expand the API to allow the per column customization]
f8ee145
to
c79e5ba
Compare
Test Results 18 files 18 suites 3d 21h 52m 54s ⏱️ Results for commit 879f6e7. ♻️ This comment has been updated with latest results. |
main/src/hadd.cxx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message seems to have two title lines, is this intentional?
tree/ntuple/v7/src/RNTupleMerger.cxx
Outdated
int compression = kUnknownCompressionSettings; | ||
if (firstSrcComp) { | ||
// user passed -ff or -fk: use the same compression as the first RNTuple we find in the sources. | ||
// (do nothing here, the compression will be fetched below) | ||
} else if (!defaultComp) { | ||
// compression was explicitly passed by the user: use it. | ||
compression = outFile->GetCompressionSettings(); | ||
} else { | ||
// user passed no compression-related options: use default | ||
compression = RCompressionSetting::EDefaults::kUseGeneralPurpose; | ||
Info("RNTuple::Merge", "Using the default compression: %d", compression); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me what is the default if I just do hadd out.root in1.root in2.root
? From a users perspective, I would not expect this to change the compression / recompress, but the code seems to suggest that I have to pass -ff
or -fk
to get "fast" merging?
tree/ntuple/v7/src/RNTupleMerger.cxx
Outdated
DescriptorId_t fInputLogicalId; | ||
DescriptorId_t fInputPhysicalId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm confused: Didn't we recently change this that alias columns (ie where logical column ids matter) are always higher than physical columns, which is what the merger probably should work on?
tree/ntuple/v7/src/RNTupleMerger.cxx
Outdated
std::cerr << "Adding column " << info.fColumnName << "with id " << srcColumnId | ||
<< " (phys: " << srcColumn.GetPhysicalId() << ")\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this debug output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I thought I removed it but apparently not from all git timelines. Thanks for spotting it
if (srcFieldDesc.IsProjectedField()) | ||
continue; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message says to "add some extra verbose messages", but this is clearly a functional change (and should probably have a unit test, not just an integration test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I split the changes in multiple commits after the fact and this must have slipped into the wrong one. I can put it into the proper commit and add a unit test for it
47efe21
to
bbe3192
Compare
Since the merger never operates on alias columns, logical and physical id are always the same
bbe3192
to
879f6e7
Compare
I rebased this PR to not depend on #16949 |
Depends on #16949
More fixes to the merger, enough to successfully merge two RNTuples converted from CMS Open Data:
-ff
flag). This is an important fix because right now we can write corrupted data when-ff
is used (e.g. we can merge non-split columns as-is but write their type as split in the descriptor. This is a recoverable corruption using custom user code, but it makes the RNTuple return garbage data through the regular API and it's not easily identifiable).