Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve struct pruning logic for Parquet Schema Filtering #2733

Merged

Conversation

nartal1
Copy link
Contributor

@nartal1 nartal1 commented Jan 7, 2025

This PR refines the logic used to prune Parquet schemas when filtering out columns that do not match the query. Previously, an entire struct could be added prematurely to the schema_map and remain there even if none of its children matched. This could result in an incorrect schema when reading Parquet files.

This pull request includes updates to the NativeParquetJni.cpp file to improve the handling of schema pruning in the column_pruner class. The changes focus on ensuring accurate child count updates and proper struct inclusion or exclusion based on the presence of matching children.

Improvements to schema pruning logic:

  • Updated the logic to save the current struct index and handle child count placeholders accurately. This ensures that parent-before-children ordering is maintained in schema_map.
  • Introduced a mechanism to track the size of schema_map before and after processing each child. This helps determine if a child was retained and accurately updates the child count.
  • Added logic to remove structs from schema_map and schema_num_children if none of their children are included, ensuring that only relevant structs are retained.

I tested this PR with these issues - NVIDIA/spark-rapids#11619, NVIDIA/spark-rapids#11620, NVIDIA/spark-rapids#11621, NVIDIA/spark-rapids#11628 and NVIDIA/spark-rapids#11629. I don't see the test failures anymore. I will close those issues after enabling the tests which would require the PR in spark-rapids.

@revans2
Copy link
Collaborator

revans2 commented Jan 7, 2025

build

revans2
revans2 previously approved these changes Jan 7, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Are there more changes needed on the java side? There were issues there too, if I remember correctly.

mythrocks
mythrocks previously approved these changes Jan 7, 2025
Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, @nartal1. LGTM! A couple of minor nits picked.

Comment on lines +212 to +214
// We add the Struct to schema_map (with a placeholder for child count).
// But we might remove it later if no children match. This ensures parent-before-children
// ordering in schema_map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Comment on lines 235 to 253
// No match was found so skip the child.
// No match was found so skip the child
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute nit: We are inconsistent with punctuation, in this change. Note that line#235 has a period, but we opt to remove the period here.

My vote would be to punctuate deliberately, and add the period back.

++schema_num_children[our_num_children_index];
// Record the current size of schema_map before processing the child
// This helps determine if the child was retained after processing.
std::size_t before_child_size = schema_map.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can get away with making this const.

Suggested change
std::size_t before_child_size = schema_map.size();
auto const before_child_size = schema_map.size();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mythrocks for the review. Updated it. PTAL.

@sameerz sameerz added the bug Something isn't working label Jan 7, 2025
Signed-off-by: Niranjan Artal <[email protected]>
@nartal1 nartal1 dismissed stale reviews from mythrocks and revans2 via a564c9e January 7, 2025 21:33
@nartal1
Copy link
Contributor Author

nartal1 commented Jan 7, 2025

Are there more changes needed on the java side? There were issues there too, if I remember correctly.

Thanks @revans2 for the review. For struct schema evolution issues, we don't need any changes on the java side. I tested with multiple nested structs and it works fine. We can enable the tests mentioned in the description once this PR is merged and we have a new jar including this.
I suppose there could still be some issues if there are maps or lists involved i.e empty structs within the maps/lists but have to do some testing to be certain.

@nartal1
Copy link
Contributor Author

nartal1 commented Jan 7, 2025

build

Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making the code changes suggested in the review.

@mythrocks mythrocks merged commit 260e9f3 into NVIDIA:branch-25.02 Jan 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants