Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a single row_count column during predicate pruning instead of one per column #14295

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Jan 25, 2025

Closes #13836

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Jan 25, 2025
@adriangb
Copy link
Contributor Author

@alamb this seems to be a simple enough change that is almost self contained: the only breaking change that I see would be in PredicateRewriter which we recently introduced in #12850. I can't be sure of course but I'd guess we (Pydantic) are the only ones using this, we'd be happy to absorb the breaking change. The alternative would be to add an option to PredicateRewriter to control this behavior, which requires more code, etc. Up to you 😄.

@adriangb
Copy link
Contributor Author

I want to point out that this works because of how the Recordbatch is generated:

for (column, statistics_type, stat_field) in required_columns.iter() {
let column = Column::from_name(column.name());
let data_type = stat_field.data_type();
let num_containers = statistics.num_containers();
let array = match statistics_type {
StatisticsType::Min => statistics.min_values(&column),
StatisticsType::Max => statistics.max_values(&column),
StatisticsType::NullCount => statistics.null_counts(&column),
StatisticsType::RowCount => statistics.row_counts(&column),
};
let array = array.unwrap_or_else(|| new_null_array(data_type, num_containers));
if num_containers != array.len() {
return internal_err!(
"mismatched statistics length. Expected {}, got {}",
num_containers,
array.len()
);
}
// cast statistics array to required data type (e.g. parquet
// provides timestamp statistics as "Int64")
let array = arrow::compute::cast(&array, data_type)?;
fields.push(stat_field.clone());
arrays.push(array);
}

Since it's generated based on the columns tracked by RequiredColumns we can just rename it internally with no consequences.

This should also save some work in creating the array, make scanning the record batch faster, etc.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 25, 2025
@adriangb adriangb force-pushed the only-require-a-single-row-count-2 branch from adf5e7c to da0f264 Compare January 25, 2025 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why does PruningPredicate reference a row_count for each column?
1 participant