[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944

silverweed · 2024-11-14T12:12:25Z

Depends on #16949

More fixes to the merger, enough to successfully merge two RNTuples converted from CMS Open Data:

use physical ids, not logical ids, in the API that require them
don't try to merge aliased columns, reconstruct the projections instead
fix the way we handle the output file's compression settings in case of fast merging (hadd's -ff flag). This is an important fix because right now we can write corrupted data when -ff is used (e.g. we can merge non-split columns as-is but write their type as split in the descriptor. This is a recoverable corruption using custom user code, but it makes the RNTuple return garbage data through the regular API and it's not easily identifiable).

jblomer

Nice! Some smaller comments and I think we should have tests for the encountered issues before merging.

jblomer · 2024-11-14T12:44:59Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+      auto readOpts = RNTupleReadOptions();
+      // disable the cluster cache so we can catch the exception that happens on LoadEntry
+      readOpts.SetClusterCache(RNTupleReadOptions::EClusterCache::kOff);


Should this stay as is? Background reading may be beneficial for merging.

Sorry, I forgot to remove that line (it was for debugging)

jblomer · 2024-11-14T12:46:51Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+   DescriptorId_t fInputLogicalId;
+   DescriptorId_t fInputPhysicalId;


I think we may not need at all the logical columns.

Hm, I'm confused: Didn't we recently change this that alias columns (ie where logical column ids matter) are always higher than physical columns, which is what the merger probably should work on?

Yes, the merger should only work on non-alias columns, so I will remove the distinction and just call it fInputId

jblomer · 2024-11-14T12:49:19Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+   int firstFileCompression = kUnknownCompressionSettings;
   while (const auto &pitr = itr()) {
      TFile *inFile = dynamic_cast<TFile *>(pitr);
+      if (firstFileCompression == kUnknownCompressionSettings)
+         firstFileCompression = inFile->GetCompressionSettings();


Not sure about this one: since we already diverged the RNTuple compression from the TFile compression (default 505 RNTuple vs default 101 TFile), should we interpret this option rather as: "look at the first RNTuple as a reference for the compression settings".

But technically there is no such thing as an "RNTuple compression settings", right? Each column range may in principle have a different compression setting, so which one should we pick?

That said, I agree that if I do -ff when merging RNTuples, I wouldn't expect the output RNTuple to be compressed with 101 if the input was using the default 505 compression...

Each column range may in principle have a different compression setting, so which one should we pick?

As a user I would expect each column of the output RNTuple to have the same compression setting as they have as in the first input file. (i.e. the output structure/meta-data should be a 'clone' of the structure/meta-data of the first input RNTuple)

We might decide that's what we want eventually; I propose that initially we just assume that the first column has the same compression as every other column (since we currently don't expose an API to use different compression).

(since we currently don't expose an API to use different compression).

What do you mean?

We might decide that's what we want eventually

Humm ... unless we have a strong reason not to, we ought to be similar to TTree which already does the per column copy of the compression setting (implicitly).

What do you mean?

Even though the format supports different compressions per column range, currently from our API you can only set the compression for the entire RNTuple (through RNTupleWriteOptions)

Does that prevent just the user from customizing the compression settings or does it also prevent the proper implementation inside the merger? [related issue: we are likely to 'forget' to update the Merger when/if we expand the API to allow the per column customization]

silverweed · 2024-11-14T14:51:47Z

@jblomer since the fix to the compression became more involved than this small change, I opened a separate PR for it: #16949 and rebased this onto it.

github-actions · 2024-11-14T20:36:56Z

Test Results

18 files 18 suites 3d 21h 52m 54s ⏱️
2 679 tests 2 679 ✅ 0 💤 0 ❌
46 360 runs 46 360 ✅ 0 💤 0 ❌

Results for commit 879f6e7.

♻️ This comment has been updated with latest results.

hahnjo · 2024-11-15T08:23:36Z

main/src/hadd.cxx

The commit message seems to have two title lines, is this intentional?

hahnjo · 2024-11-15T08:28:10Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+   int compression = kUnknownCompressionSettings;
+   if (firstSrcComp) {
+      // user passed -ff or -fk: use the same compression as the first RNTuple we find in the sources.
+      // (do nothing here, the compression will be fetched below)
+   } else if (!defaultComp) {
+      // compression was explicitly passed by the user: use it.
+      compression = outFile->GetCompressionSettings();
+   } else {
+      // user passed no compression-related options: use default
+      compression = RCompressionSetting::EDefaults::kUseGeneralPurpose;
+      Info("RNTuple::Merge", "Using the default compression: %d", compression);
   }


Can you remind me what is the default if I just do hadd out.root in1.root in2.root? From a users perspective, I would not expect this to change the compression / recompress, but the code seems to suggest that I have to pass -ff or -fk to get "fast" merging?

hahnjo · 2024-11-15T08:31:11Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+   DescriptorId_t fInputLogicalId;
+   DescriptorId_t fInputPhysicalId;


Hm, I'm confused: Didn't we recently change this that alias columns (ie where logical column ids matter) are always higher than physical columns, which is what the merger probably should work on?

hahnjo · 2024-11-15T08:31:29Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+      std::cerr << "Adding column " << info.fColumnName << "with id " << srcColumnId
+                << " (phys: " << srcColumn.GetPhysicalId() << ")\n";


Is this debug output?

Correct. I thought I removed it but apparently not from all git timelines. Thanks for spotting it

hahnjo · 2024-11-15T08:33:13Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+      if (srcFieldDesc.IsProjectedField())
+         continue;
+


The commit message says to "add some extra verbose messages", but this is clearly a functional change (and should probably have a unit test, not just an integration test?)

Yes, I split the changes in multiple commits after the fact and this must have slipped into the wrong one. I can put it into the proper commit and add a unit test for it

Since the merger never operates on alias columns, logical and physical id are always the same

silverweed · 2024-11-15T11:05:12Z

I rebased this PR to not depend on #16949
Sorry for the confusion

silverweed added the in:RNTuple label Nov 14, 2024

silverweed requested review from hahnjo, dpiparo, vepadulano and enirolf November 14, 2024 12:12

silverweed self-assigned this Nov 14, 2024

silverweed requested a review from jblomer as a code owner November 14, 2024 12:12

jblomer reviewed Nov 14, 2024

View reviewed changes

silverweed force-pushed the ntuple_merge_more_fixes branch from f8ee145 to c79e5ba Compare November 14, 2024 15:30

silverweed mentioned this pull request Nov 14, 2024

[hadd] add test for merging CMS open data RNTuples root-project/roottest#1219

Open

hahnjo requested changes Nov 15, 2024

View reviewed changes

hahnjo mentioned this pull request Nov 15, 2024

[ntuple] Merger: fix handling of compression-related options #16949

Open

2 tasks

silverweed force-pushed the ntuple_merge_more_fixes branch 3 times, most recently from 47efe21 to bbe3192 Compare November 15, 2024 09:11

silverweed added 3 commits November 15, 2024 12:03

[ntuple] merger: only use physical ids in RColumnMergeInfo

dc2f4bb

Since the merger never operates on alias columns, logical and physical id are always the same

[ntuple] merger: add some extra verbose messages

e08f6a5

[ntuple] merger: don't try and merge alias columns

879f6e7

silverweed force-pushed the ntuple_merge_more_fixes branch from bbe3192 to 879f6e7 Compare November 15, 2024 11:04

silverweed requested a review from hahnjo November 15, 2024 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944

[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944

silverweed commented Nov 14, 2024 •

edited

Loading

jblomer left a comment

jblomer Nov 14, 2024

silverweed Nov 14, 2024

jblomer Nov 14, 2024

hahnjo Nov 15, 2024

silverweed Nov 15, 2024

jblomer Nov 14, 2024

silverweed Nov 14, 2024

silverweed Nov 14, 2024 •

edited

Loading

pcanal Nov 14, 2024

silverweed Nov 14, 2024

pcanal Nov 14, 2024

pcanal Nov 14, 2024

silverweed Nov 14, 2024

pcanal Nov 15, 2024

silverweed commented Nov 14, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024 •

edited

Loading

hahnjo Nov 15, 2024

hahnjo Nov 15, 2024

hahnjo Nov 15, 2024

hahnjo Nov 15, 2024

silverweed Nov 15, 2024

hahnjo Nov 15, 2024

silverweed Nov 15, 2024

silverweed commented Nov 15, 2024

		DescriptorId_t fInputLogicalId;
		DescriptorId_t fInputPhysicalId;

		std::cerr << "Adding column " << info.fColumnName << "with id " << srcColumnId
		<< " (phys: " << srcColumn.GetPhysicalId() << ")\n";

[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944

Are you sure you want to change the base?

[ntuple] Merger: fix merging RNTuples with projected fields and handling of the output file compression #16944

Conversation

silverweed commented Nov 14, 2024 • edited Loading

jblomer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silverweed Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silverweed commented Nov 14, 2024 • edited Loading

github-actions bot commented Nov 14, 2024 • edited Loading

Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silverweed commented Nov 15, 2024

silverweed commented Nov 14, 2024 •

edited

Loading

silverweed Nov 14, 2024 •

edited

Loading

silverweed commented Nov 14, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024 •

edited

Loading