Extend predicate pushdown to discriminate between buckets created by `update_dataset_from_ddf` #276

lr4d · 2020-04-09T13:10:44Z

Scenario

We use update_dataset_from_ddf with shuffle=True.
We bucket_by the column c1, with a specific n_buckets.

Imagine we now store the value of bucket_by and num_buckets alongside the Parquet metadata for each created partition.
Now a user performs a query with predicates c1 == 2. When reading the Parquet metadata, we can do the following:

Since we know the minimum value for c1 , from the Parquet statistics, we can obtain the value of hash(c1.min) % n_buckets.
We can also calculate the information created by the hashing function on the value provided by the user in their query, i.e. hash(2) % n_buckets
By comparing these values, we know whether we actually need to read the row group or not.

Rationale

The key here is that to be able to calculate the identifier of each distinct logical partition created using bucket_by, we only need a single prior value for the columns passed to bucket_by, as that unique logical partition identifier will be hash(c1.min) % n_buckets (or equivalently, hash(c1.max) % n_buckets).

Value

Performance: during predicate pushdown, we can discard row groups where the value of hash(c1.min) % n_buckets does not match the value of the user-provided predicate (hash(2) % n_buckets, assuming 2 is the user-provided value).
If more than one column is provided to bucket_by, we can no longer rely on the Parquet statistics as the min/max statistics of different columns will not necessarily belong to the same row.

Comments

@fjetter :
ok, I get it. You’d still need to ensure backwards/forwards compat with the hash function AND collect the metadata up front

the idea about using the statistic to infer the bucket is good but we’d still need to have the prerequisites set up

this could be used to decide whether or not the file needs to be read but wouldn’t help us for query planing. that would work similar to predicate pushdown. The amount of infrastructure we’d need to set up to get the information to where we need it to be is quite big, though (collecting and passing the relevant information to the metpartitions during dispatch)

The text was updated successfully, but these errors were encountered:

lr4d added the performance label Apr 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend predicate pushdown to discriminate between buckets created by `update_dataset_from_ddf` #276

Extend predicate pushdown to discriminate between buckets created by `update_dataset_from_ddf` #276

lr4d commented Apr 9, 2020

Extend predicate pushdown to discriminate between buckets created by update_dataset_from_ddf #276

Extend predicate pushdown to discriminate between buckets created by update_dataset_from_ddf #276

Comments

lr4d commented Apr 9, 2020

Scenario

Rationale

Value

Comments

Extend predicate pushdown to discriminate between buckets created by `update_dataset_from_ddf` #276

Extend predicate pushdown to discriminate between buckets created by `update_dataset_from_ddf` #276