-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle null values appropriately during segment reload for new derived columns #13212
Handle null values appropriately during segment reload for new derived columns #13212
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13212 +/- ##
============================================
+ Coverage 61.75% 62.11% +0.36%
+ Complexity 207 198 -9
============================================
Files 2436 2558 +122
Lines 133233 141034 +7801
Branches 20636 21887 +1251
============================================
+ Hits 82274 87607 +5333
- Misses 44911 46787 +1876
- Partials 6048 6640 +592
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
4393cfd
to
e53524f
Compare
#12763 is doing similar fix. Can you take a look at that one and see how should we proceed? |
Ah damn, I hadn't seen that. I think that the other PR has some issues though - most importantly, it doesn't handle the case where the transform function returns null for some, but not all docs (I think even you've pointed that out). Also, it doesn't create the null vector index even if the table / column has null handling enabled. On the other hand, this PR is only handling the |
e53524f
to
13d8a2c
Compare
@Jackie-Jiang @xiangfu0 to resolve this impasse, I've also added support in this PR for handling transform function errors (largely taken from Xiang's PR - #12763). We can add @xiangfu0 as a co-author (ref) if / when this PR is merged. One minor difference here is that I'm only handling the case where the invocation itself results in an error, and not when there is an unsupported output value class which seems more like a hard error. Also, I've written some additional tests. |
13d8a2c
to
0331e65
Compare
0331e65
to
c7a116d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general!
@xiangfu0 can you also take a look?
} | ||
|
||
if (outputValue == null) { | ||
outputValue = fieldSpec.getDefaultNullValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider not setting outputValue
, but just leave it as null
? In the following handling, we can simply check null
and fill with default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to handle it here in a single place rather than adding null
handling in multiple branches for each type? Any particular reason for doing it that way instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we rely on instanceof
check to handle the case of default value. I was thinking to explicitly check for null
, and always do the type conversion for non-null value (closer to the current logic), but the new code can avoid unnecessary type conversions, which is even better
* @return the converted output value (either an Integer, an Integer[] or an int[]) | ||
*/ | ||
private Object getIntOutputValue(Object outputValue, boolean isSingleValue, PinotDataType outputValueType, | ||
Integer defaultNullValue, boolean dictionary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass in primitive int
for better performance. Same for other methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default null value obtained from the field spec will be of the primitive object wrapper class type, and the value that we're returning here is also an Object
since that's the type for the outputValues
array. Won't using a primitive int
here simply result in a redundant additional autoboxing and unboxing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Only one code path is using the primitive type, so I guess we can keep the boxed value. In the future we may consider changing the object array to primitive array to save memory, but that is out of the scope of this PR
null
value for a document during segment reload for a new derived column, it results in aNullPointerException
here causing the segment reload to fail.null
for a document during segment creation itself (as opposed to reload), the value for the derived column is set to the default null value for the field. Also, if null handling is enabled for the table / column, a null vector index is also created and the doc ID is added to it (to support query-timenull
handling).BaseDefaultColumnHandler::createDerivedColumnV1Indices
which is becoming overly large and convoluted.