Handle null values appropriately during segment reload for new derived columns #13212

yashmayya · 2024-05-23T15:19:28Z

Currently, if a transform function returns a null value for a document during segment reload for a new derived column, it results in a NullPointerException here causing the segment reload to fail.
This is a bug and is not the expected behavior. If a transform function for a derived column returns null for a document during segment creation itself (as opposed to reload), the value for the derived column is set to the default null value for the field. Also, if null handling is enabled for the table / column, a null vector index is also created and the doc ID is added to it (to support query-time null handling).
This patch fixes the behavior during segment reload to be in-line with the above behavior.
There's also some minor refactoring for the BaseDefaultColumnHandler::createDerivedColumnV1Indices which is becoming overly large and convoluted.
Another additional change being made here is that if a transform function execution results in an error, the value for the derived column is set to null if error on column build failure is set to false in the index loading config.
Also, the first error is logged, for instance:

WARN [BaseDefaultColumnHandler] [main] Caught 100000 exceptions while evaluating derived column: newWrongArgDateTruncDerivedColumn with function: dateTrunc('abcd', column1). The first input value tuple that led to an error is: [890282370]
...
Caused by: java.lang.IllegalArgumentException: 'abcd' is not a valid Timestamp field
...

codecov-commenter · 2024-05-23T15:58:05Z

Codecov Report

Attention: Patch coverage is 28.26087% with 99 lines in your changes missing coverage. Please review.

Project coverage is 62.11%. Comparing base (59551e4) to head (c7a116d).
Report is 742 commits behind head on master.

Files	Patch %	Lines
...loader/defaultcolumn/BaseDefaultColumnHandler.java	26.86%	88 Missing and 10 partials ⚠️
...egment/local/function/GroovyFunctionEvaluator.java	50.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #13212      +/-   ##
============================================
+ Coverage     61.75%   62.11%   +0.36%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2558     +122     
  Lines        133233   141034    +7801     
  Branches      20636    21887    +1251     
============================================
+ Hits          82274    87607    +5333     
- Misses        44911    46787    +1876     
- Partials       6048     6640     +592

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`62.05% <28.26%> (+0.34%)`	⬆️
java-21	`61.97% <28.26%> (+0.34%)`	⬆️
skip-bytebuffers-false	`62.09% <28.26%> (+0.34%)`	⬆️
skip-bytebuffers-true	`61.94% <28.26%> (+34.21%)`	⬆️
temurin	`62.11% <28.26%> (+0.36%)`	⬆️
unittests	`62.11% <28.26%> (+0.36%)`	⬆️
unittests1	`46.64% <0.00%> (-0.25%)`	⬇️
unittests2	`27.69% <28.26%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Jackie-Jiang · 2024-05-23T22:12:21Z

#12763 is doing similar fix. Can you take a look at that one and see how should we proceed?

yashmayya · 2024-05-24T04:37:46Z

Ah damn, I hadn't seen that. I think that the other PR has some issues though - most importantly, it doesn't handle the case where the transform function returns null for some, but not all docs (I think even you've pointed that out). Also, it doesn't create the null vector index even if the table / column has null handling enabled. On the other hand, this PR is only handling the null value case, whereas the other PR is also handling the case where the transform function throws an exception (this is a very small change though).

…d columns

yashmayya · 2024-07-11T13:31:17Z

@Jackie-Jiang @xiangfu0 to resolve this impasse, I've also added support in this PR for handling transform function errors (largely taken from Xiang's PR - #12763). We can add @xiangfu0 as a co-author (ref) if / when this PR is merged.

One minor difference here is that I'm only handling the case where the invocation itself results in an error, and not when there is an unsupported output value class which seems more like a hard error. Also, I've written some additional tests.

… errors

Jackie-Jiang

Looks good in general!
@xiangfu0 can you also take a look?

Jackie-Jiang · 2024-07-17T22:12:55Z

.../apache/pinot/segment/local/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java

+        }
+
+        if (outputValue == null) {
+          outputValue = fieldSpec.getDefaultNullValue();


Should we consider not setting outputValue, but just leave it as null? In the following handling, we can simply check null and fill with default value

I think it's better to handle it here in a single place rather than adding null handling in multiple branches for each type? Any particular reason for doing it that way instead?

Currently we rely on instanceof check to handle the case of default value. I was thinking to explicitly check for null, and always do the type conversion for non-null value (closer to the current logic), but the new code can avoid unnecessary type conversions, which is even better

Jackie-Jiang · 2024-07-17T22:14:02Z

.../apache/pinot/segment/local/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java

+   * @return the converted output value (either an Integer, an Integer[] or an int[])
+   */
+  private Object getIntOutputValue(Object outputValue, boolean isSingleValue, PinotDataType outputValueType,
+      Integer defaultNullValue, boolean dictionary) {


Pass in primitive int for better performance. Same for other methods

The default null value obtained from the field spec will be of the primitive object wrapper class type, and the value that we're returning here is also an Object since that's the type for the outputValues array. Won't using a primitive int here simply result in a redundant additional autoboxing and unboxing?

Good point. Only one code path is using the primitive type, so I guess we can keep the boxed value. In the future we may consider changing the object array to primitive array to save memory, but that is out of the scope of this PR

yashmayya marked this pull request as ready for review May 23, 2024 16:09

yashmayya force-pushed the derived-column-null-segment-reload branch from 4393cfd to e53524f Compare May 23, 2024 17:08

Jackie-Jiang added ingestion bugfix null support labels May 23, 2024

yashmayya added 2 commits July 11, 2024 16:52

Handle null values appropriately during segment reload for new derive…

d0edec9

…d columns

Further minor refactors

d9969ba

yashmayya force-pushed the derived-column-null-segment-reload branch from e53524f to 13d8a2c Compare July 11, 2024 13:31

yashmayya force-pushed the derived-column-null-segment-reload branch from 13d8a2c to 0331e65 Compare July 11, 2024 13:35

Fill derived column with the default null value on transform function…

c7a116d

… errors

yashmayya force-pushed the derived-column-null-segment-reload branch from 0331e65 to c7a116d Compare July 11, 2024 13:38

yashmayya requested review from xiangfu0 and Jackie-Jiang July 11, 2024 13:39

Jackie-Jiang reviewed Jul 17, 2024

View reviewed changes

yashmayya requested a review from Jackie-Jiang August 8, 2024 17:05

Jackie-Jiang approved these changes Aug 13, 2024

View reviewed changes

Jackie-Jiang merged commit f65ce5d into apache:master Aug 13, 2024
20 checks passed

This was referenced Aug 13, 2024

[null support] Fill the derived column with the default null value when transform function failed to execute #12763

Closed

[null support] Transform function not handle null value #12762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle null values appropriately during segment reload for new derived columns #13212

Handle null values appropriately during segment reload for new derived columns #13212

yashmayya commented May 23, 2024 •

edited

Loading

codecov-commenter commented May 23, 2024 •

edited

Loading

Jackie-Jiang commented May 23, 2024

yashmayya commented May 24, 2024

yashmayya commented Jul 11, 2024

Jackie-Jiang left a comment

Jackie-Jiang Jul 17, 2024

yashmayya Jul 18, 2024

Jackie-Jiang Aug 13, 2024

Jackie-Jiang Jul 17, 2024

yashmayya Jul 18, 2024

Jackie-Jiang Aug 13, 2024

Handle null values appropriately during segment reload for new derived columns #13212

Handle null values appropriately during segment reload for new derived columns #13212

Conversation

yashmayya commented May 23, 2024 • edited Loading

codecov-commenter commented May 23, 2024 • edited Loading

Codecov Report

Jackie-Jiang commented May 23, 2024

yashmayya commented May 24, 2024

yashmayya commented Jul 11, 2024

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Jul 17, 2024

Choose a reason for hiding this comment

yashmayya Jul 18, 2024

Choose a reason for hiding this comment

Jackie-Jiang Aug 13, 2024

Choose a reason for hiding this comment

Jackie-Jiang Jul 17, 2024

Choose a reason for hiding this comment

yashmayya Jul 18, 2024

Choose a reason for hiding this comment

Jackie-Jiang Aug 13, 2024

Choose a reason for hiding this comment

yashmayya commented May 23, 2024 •

edited

Loading

codecov-commenter commented May 23, 2024 •

edited

Loading