Improve struct pruning logic for Parquet Schema Filtering #2733

nartal1 · 2025-01-07T02:06:59Z

This PR refines the logic used to prune Parquet schemas when filtering out columns that do not match the query. Previously, an entire struct could be added prematurely to the schema_map and remain there even if none of its children matched. This could result in an incorrect schema when reading Parquet files.

This pull request includes updates to the NativeParquetJni.cpp file to improve the handling of schema pruning in the column_pruner class. The changes focus on ensuring accurate child count updates and proper struct inclusion or exclusion based on the presence of matching children.

Improvements to schema pruning logic:

Updated the logic to save the current struct index and handle child count placeholders accurately. This ensures that parent-before-children ordering is maintained in schema_map.
Introduced a mechanism to track the size of schema_map before and after processing each child. This helps determine if a child was retained and accurately updates the child count.
Added logic to remove structs from schema_map and schema_num_children if none of their children are included, ensuring that only relevant structs are retained.

I tested this PR with these issues - NVIDIA/spark-rapids#11619, NVIDIA/spark-rapids#11620, NVIDIA/spark-rapids#11621, NVIDIA/spark-rapids#11628 and NVIDIA/spark-rapids#11629. I don't see the test failures anymore. I will close those issues after enabling the tests which would require the PR in spark-rapids.

Signed-off-by: Niranjan Artal <[email protected]>

…o nested_fix

Signed-off-by: Niranjan Artal <[email protected]>

revans2 · 2025-01-07T14:20:26Z

build

revans2

The changes look good to me. Are there more changes needed on the java side? There were issues there too, if I remember correctly.

mythrocks

Good work, @nartal1. LGTM! A couple of minor nits picked.

mythrocks · 2025-01-07T18:28:57Z

src/main/cpp/src/NativeParquetJni.cpp

+    // We add the Struct to schema_map (with a placeholder for child count).
+    // But we might remove it later if no children match. This ensures parent-before-children
+    // ordering in schema_map.


mythrocks · 2025-01-07T18:36:00Z

src/main/cpp/src/NativeParquetJni.cpp

-        // No match was found so skip the child.
+        // No match was found so skip the child


Absolute nit: We are inconsistent with punctuation, in this change. Note that line#235 has a period, but we opt to remove the period here.

My vote would be to punctuate deliberately, and add the period back.

mythrocks · 2025-01-07T18:43:14Z

src/main/cpp/src/NativeParquetJni.cpp

-        ++schema_num_children[our_num_children_index];
+        // Record the current size of schema_map before processing the child
+        // This helps determine if the child was retained after processing.
+        std::size_t before_child_size = schema_map.size();


Nit: We can get away with making this const.

Suggested change

std::size_t before_child_size = schema_map.size();

auto const before_child_size = schema_map.size();

Thanks @mythrocks for the review. Updated it. PTAL.

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 · 2025-01-07T21:38:31Z

Are there more changes needed on the java side? There were issues there too, if I remember correctly.

Thanks @revans2 for the review. For struct schema evolution issues, we don't need any changes on the java side. I tested with multiple nested structs and it works fine. We can enable the tests mentioned in the description once this PR is merged and we have a new jar including this.
I suppose there could still be some issues if there are maps or lists involved i.e empty structs within the maps/lists but have to do some testing to be certain.

nartal1 · 2025-01-07T21:39:13Z

build

mythrocks

LGTM. Thanks for making the code changes suggested in the review.

nartal1 added 5 commits December 31, 2024 14:36

fix for nested type

5f717a9

Signed-off-by: Niranjan Artal <[email protected]>

Merge branch 'branch-25.02' of github.com:NVIDIA/spark-rapids-jni int…

aba257d

…o nested_fix

Improve Struct Pruning Logic for Parquet Schema Filtering

21fb745

Signed-off-by: Niranjan Artal <[email protected]>

Update comments

fa9907d

Update clang format

5d69536

revans2 previously approved these changes Jan 7, 2025

View reviewed changes

mythrocks previously approved these changes Jan 7, 2025

View reviewed changes

sameerz added the bug Something isn't working label Jan 7, 2025

addressed review comments

a564c9e

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 dismissed stale reviews from mythrocks and revans2 via a564c9e January 7, 2025 21:33

mythrocks approved these changes Jan 9, 2025

View reviewed changes

mythrocks merged commit 260e9f3 into NVIDIA:branch-25.02 Jan 9, 2025
4 checks passed

nartal1 mentioned this pull request Jan 11, 2025

Enable tests in RapidsParquetSchemaPruningSuite NVIDIA/spark-rapids#11956

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve struct pruning logic for Parquet Schema Filtering #2733

Improve struct pruning logic for Parquet Schema Filtering #2733

nartal1 commented Jan 7, 2025 •

edited

Loading

revans2 commented Jan 7, 2025

revans2 left a comment

mythrocks left a comment

mythrocks Jan 7, 2025

mythrocks Jan 7, 2025

mythrocks Jan 7, 2025

nartal1 Jan 7, 2025

nartal1 commented Jan 7, 2025

nartal1 commented Jan 7, 2025

mythrocks left a comment

		// No match was found so skip the child.
		// No match was found so skip the child

	std::size_t before_child_size = schema_map.size();
	auto const before_child_size = schema_map.size();

Improve struct pruning logic for Parquet Schema Filtering #2733

Improve struct pruning logic for Parquet Schema Filtering #2733

Conversation

nartal1 commented Jan 7, 2025 • edited Loading

revans2 commented Jan 7, 2025

revans2 left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks Jan 7, 2025

Choose a reason for hiding this comment

mythrocks Jan 7, 2025

Choose a reason for hiding this comment

mythrocks Jan 7, 2025

Choose a reason for hiding this comment

nartal1 Jan 7, 2025

Choose a reason for hiding this comment

nartal1 commented Jan 7, 2025

nartal1 commented Jan 7, 2025

mythrocks left a comment

Choose a reason for hiding this comment

nartal1 commented Jan 7, 2025 •

edited

Loading