Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle null values appropriately during segment reload for new derived columns #13212

Merged

Conversation

yashmayya
Copy link
Collaborator

@yashmayya yashmayya commented May 23, 2024

  • Currently, if a transform function returns a null value for a document during segment reload for a new derived column, it results in a NullPointerException here causing the segment reload to fail.
  • This is a bug and is not the expected behavior. If a transform function for a derived column returns null for a document during segment creation itself (as opposed to reload), the value for the derived column is set to the default null value for the field. Also, if null handling is enabled for the table / column, a null vector index is also created and the doc ID is added to it (to support query-time null handling).
  • This patch fixes the behavior during segment reload to be in-line with the above behavior.
  • There's also some minor refactoring for the BaseDefaultColumnHandler::createDerivedColumnV1Indices which is becoming overly large and convoluted.
  • Another additional change being made here is that if a transform function execution results in an error, the value for the derived column is set to null if error on column build failure is set to false in the index loading config.
  • Also, the first error is logged, for instance:
WARN [BaseDefaultColumnHandler] [main] Caught 100000 exceptions while evaluating derived column: newWrongArgDateTruncDerivedColumn with function: dateTrunc('abcd', column1). The first input value tuple that led to an error is: [890282370]
...
Caused by: java.lang.IllegalArgumentException: 'abcd' is not a valid Timestamp field
...

@codecov-commenter
Copy link

codecov-commenter commented May 23, 2024

Codecov Report

Attention: Patch coverage is 28.26087% with 99 lines in your changes missing coverage. Please review.

Project coverage is 62.11%. Comparing base (59551e4) to head (c7a116d).
Report is 742 commits behind head on master.

Files Patch % Lines
...loader/defaultcolumn/BaseDefaultColumnHandler.java 26.86% 88 Missing and 10 partials ⚠️
...egment/local/function/GroovyFunctionEvaluator.java 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13212      +/-   ##
============================================
+ Coverage     61.75%   62.11%   +0.36%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2558     +122     
  Lines        133233   141034    +7801     
  Branches      20636    21887    +1251     
============================================
+ Hits          82274    87607    +5333     
- Misses        44911    46787    +1876     
- Partials       6048     6640     +592     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.05% <28.26%> (+0.34%) ⬆️
java-21 61.97% <28.26%> (+0.34%) ⬆️
skip-bytebuffers-false 62.09% <28.26%> (+0.34%) ⬆️
skip-bytebuffers-true 61.94% <28.26%> (+34.21%) ⬆️
temurin 62.11% <28.26%> (+0.36%) ⬆️
unittests 62.11% <28.26%> (+0.36%) ⬆️
unittests1 46.64% <0.00%> (-0.25%) ⬇️
unittests2 27.69% <28.26%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya marked this pull request as ready for review May 23, 2024 16:09
@yashmayya yashmayya force-pushed the derived-column-null-segment-reload branch from 4393cfd to e53524f Compare May 23, 2024 17:08
@Jackie-Jiang
Copy link
Contributor

#12763 is doing similar fix. Can you take a look at that one and see how should we proceed?

@yashmayya
Copy link
Collaborator Author

Ah damn, I hadn't seen that. I think that the other PR has some issues though - most importantly, it doesn't handle the case where the transform function returns null for some, but not all docs (I think even you've pointed that out). Also, it doesn't create the null vector index even if the table / column has null handling enabled. On the other hand, this PR is only handling the null value case, whereas the other PR is also handling the case where the transform function throws an exception (this is a very small change though).

@yashmayya yashmayya force-pushed the derived-column-null-segment-reload branch from e53524f to 13d8a2c Compare July 11, 2024 13:31
@yashmayya
Copy link
Collaborator Author

@Jackie-Jiang @xiangfu0 to resolve this impasse, I've also added support in this PR for handling transform function errors (largely taken from Xiang's PR - #12763). We can add @xiangfu0 as a co-author (ref) if / when this PR is merged.

One minor difference here is that I'm only handling the case where the invocation itself results in an error, and not when there is an unsupported output value class which seems more like a hard error. Also, I've written some additional tests.

@yashmayya yashmayya force-pushed the derived-column-null-segment-reload branch from 13d8a2c to 0331e65 Compare July 11, 2024 13:35
@yashmayya yashmayya force-pushed the derived-column-null-segment-reload branch from 0331e65 to c7a116d Compare July 11, 2024 13:38
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general!
@xiangfu0 can you also take a look?

}

if (outputValue == null) {
outputValue = fieldSpec.getDefaultNullValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider not setting outputValue, but just leave it as null? In the following handling, we can simply check null and fill with default value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to handle it here in a single place rather than adding null handling in multiple branches for each type? Any particular reason for doing it that way instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we rely on instanceof check to handle the case of default value. I was thinking to explicitly check for null, and always do the type conversion for non-null value (closer to the current logic), but the new code can avoid unnecessary type conversions, which is even better

* @return the converted output value (either an Integer, an Integer[] or an int[])
*/
private Object getIntOutputValue(Object outputValue, boolean isSingleValue, PinotDataType outputValueType,
Integer defaultNullValue, boolean dictionary) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass in primitive int for better performance. Same for other methods

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default null value obtained from the field spec will be of the primitive object wrapper class type, and the value that we're returning here is also an Object since that's the type for the outputValues array. Won't using a primitive int here simply result in a redundant additional autoboxing and unboxing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Only one code path is using the primitive type, so I guess we can keep the boxed value. In the future we may consider changing the object array to primitive array to save memory, but that is out of the scope of this PR

@yashmayya yashmayya requested a review from Jackie-Jiang August 8, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants