-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve struct pruning logic for Parquet Schema Filtering #2733
Improve struct pruning logic for Parquet Schema Filtering #2733
Conversation
Signed-off-by: Niranjan Artal <[email protected]>
Signed-off-by: Niranjan Artal <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me. Are there more changes needed on the java side? There were issues there too, if I remember correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, @nartal1. LGTM! A couple of minor nits picked.
// We add the Struct to schema_map (with a placeholder for child count). | ||
// But we might remove it later if no children match. This ensures parent-before-children | ||
// ordering in schema_map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
// No match was found so skip the child. | ||
// No match was found so skip the child |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolute nit: We are inconsistent with punctuation, in this change. Note that line#235 has a period, but we opt to remove the period here.
My vote would be to punctuate deliberately, and add the period back.
++schema_num_children[our_num_children_index]; | ||
// Record the current size of schema_map before processing the child | ||
// This helps determine if the child was retained after processing. | ||
std::size_t before_child_size = schema_map.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We can get away with making this const
.
std::size_t before_child_size = schema_map.size(); | |
auto const before_child_size = schema_map.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mythrocks for the review. Updated it. PTAL.
Signed-off-by: Niranjan Artal <[email protected]>
Thanks @revans2 for the review. For struct schema evolution issues, we don't need any changes on the java side. I tested with multiple nested structs and it works fine. We can enable the tests mentioned in the description once this PR is merged and we have a new jar including this. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for making the code changes suggested in the review.
This PR refines the logic used to prune Parquet schemas when filtering out columns that do not match the query. Previously, an entire struct could be added prematurely to the schema_map and remain there even if none of its children matched. This could result in an incorrect schema when reading Parquet files.
This pull request includes updates to the
NativeParquetJni.cpp
file to improve the handling of schema pruning in thecolumn_pruner
class. The changes focus on ensuring accurate child count updates and proper struct inclusion or exclusion based on the presence of matching children.Improvements to schema pruning logic:
schema_map
.schema_map
before and after processing each child. This helps determine if a child was retained and accurately updates the child count.schema_map
andschema_num_children
if none of their children are included, ensuring that only relevant structs are retained.I tested this PR with these issues - NVIDIA/spark-rapids#11619, NVIDIA/spark-rapids#11620, NVIDIA/spark-rapids#11621, NVIDIA/spark-rapids#11628 and NVIDIA/spark-rapids#11629. I don't see the test failures anymore. I will close those issues after enabling the tests which would require the PR in spark-rapids.