Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

silverweed
Copy link
Contributor

@silverweed silverweed commented Nov 14, 2024

Depends on #16949

More fixes to the merger, enough to successfully merge two RNTuples converted from CMS Open Data:

  • use physical ids, not logical ids, in the API that require them
  • don't try to merge aliased columns, reconstruct the projections instead
  • fix the way we handle the output file's compression settings in case of fast merging (hadd's -ff flag). This is an important fix because right now we can write corrupted data when -ff is used (e.g. we can merge non-split columns as-is but write their type as split in the descriptor. This is a recoverable corruption using custom user code, but it makes the RNTuple return garbage data through the regular API and it's not easily identifiable).

Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some smaller comments and I think we should have tests for the encountered issues before merging.

Comment on lines 111 to 113
auto readOpts = RNTupleReadOptions();
// disable the cluster cache so we can catch the exception that happens on LoadEntry
readOpts.SetClusterCache(RNTupleReadOptions::EClusterCache::kOff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this stay as is? Background reading may be beneficial for merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to remove that line (it was for debugging)

Comment on lines 201 to 244
DescriptorId_t fInputLogicalId;
DescriptorId_t fInputPhysicalId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may not need at all the logical columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm confused: Didn't we recently change this that alias columns (ie where logical column ids matter) are always higher than physical columns, which is what the merger probably should work on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the merger should only work on non-alias columns, so I will remove the distinction and just call it fInputId

Comment on lines 89 to 93
int firstFileCompression = kUnknownCompressionSettings;
while (const auto &pitr = itr()) {
TFile *inFile = dynamic_cast<TFile *>(pitr);
if (firstFileCompression == kUnknownCompressionSettings)
firstFileCompression = inFile->GetCompressionSettings();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one: since we already diverged the RNTuple compression from the TFile compression (default 505 RNTuple vs default 101 TFile), should we interpret this option rather as: "look at the first RNTuple as a reference for the compression settings".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But technically there is no such thing as an "RNTuple compression settings", right? Each column range may in principle have a different compression setting, so which one should we pick?

Copy link
Contributor Author

@silverweed silverweed Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I agree that if I do -ff when merging RNTuples, I wouldn't expect the output RNTuple to be compressed with 101 if the input was using the default 505 compression...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each column range may in principle have a different compression setting, so which one should we pick?

As a user I would expect each column of the output RNTuple to have the same compression setting as they have as in the first input file. (i.e. the output structure/meta-data should be a 'clone' of the structure/meta-data of the first input RNTuple)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might decide that's what we want eventually; I propose that initially we just assume that the first column has the same compression as every other column (since we currently don't expose an API to use different compression).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(since we currently don't expose an API to use different compression).

What do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might decide that's what we want eventually

Humm ... unless we have a strong reason not to, we ought to be similar to TTree which already does the per column copy of the compression setting (implicitly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Even though the format supports different compressions per column range, currently from our API you can only set the compression for the entire RNTuple (through RNTupleWriteOptions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that prevent just the user from customizing the compression settings or does it also prevent the proper implementation inside the merger? [related issue: we are likely to 'forget' to update the Merger when/if we expand the API to allow the per column customization]

@silverweed
Copy link
Contributor Author

silverweed commented Nov 14, 2024

@jblomer since the fix to the compression became more involved than this small change, I opened a separate PR for it: #16949 and rebased this onto it.

Copy link

github-actions bot commented Nov 14, 2024

Test Results

    18 files      18 suites   3d 21h 52m 54s ⏱️
 2 679 tests  2 679 ✅ 0 💤 0 ❌
46 360 runs  46 360 ✅ 0 💤 0 ❌

Results for commit 879f6e7.

♻️ This comment has been updated with latest results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message seems to have two title lines, is this intentional?

Comment on lines 89 to 97
int compression = kUnknownCompressionSettings;
if (firstSrcComp) {
// user passed -ff or -fk: use the same compression as the first RNTuple we find in the sources.
// (do nothing here, the compression will be fetched below)
} else if (!defaultComp) {
// compression was explicitly passed by the user: use it.
compression = outFile->GetCompressionSettings();
} else {
// user passed no compression-related options: use default
compression = RCompressionSetting::EDefaults::kUseGeneralPurpose;
Info("RNTuple::Merge", "Using the default compression: %d", compression);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me what is the default if I just do hadd out.root in1.root in2.root? From a users perspective, I would not expect this to change the compression / recompress, but the code seems to suggest that I have to pass -ff or -fk to get "fast" merging?

Comment on lines 201 to 244
DescriptorId_t fInputLogicalId;
DescriptorId_t fInputPhysicalId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm confused: Didn't we recently change this that alias columns (ie where logical column ids matter) are always higher than physical columns, which is what the merger probably should work on?

Comment on lines 806 to 807
std::cerr << "Adding column " << info.fColumnName << "with id " << srcColumnId
<< " (phys: " << srcColumn.GetPhysicalId() << ")\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this debug output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I thought I removed it but apparently not from all git timelines. Thanks for spotting it

Comment on lines +802 to +759
if (srcFieldDesc.IsProjectedField())
continue;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message says to "add some extra verbose messages", but this is clearly a functional change (and should probably have a unit test, not just an integration test?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I split the changes in multiple commits after the fact and this must have slipped into the wrong one. I can put it into the proper commit and add a unit test for it

@silverweed
Copy link
Contributor Author

I rebased this PR to not depend on #16949
Sorry for the confusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants