Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend predicate pushdown to discriminate between buckets created by update_dataset_from_ddf #276

Open
lr4d opened this issue Apr 9, 2020 · 0 comments

Comments

@lr4d
Copy link
Collaborator

lr4d commented Apr 9, 2020

Scenario

We use update_dataset_from_ddf with shuffle=True.
We bucket_by the column c1, with a specific n_buckets.

Imagine we now store the value of bucket_by and num_buckets alongside the Parquet metadata for each created partition.
Now a user performs a query with predicates c1 == 2. When reading the Parquet metadata, we can do the following:

  • Since we know the minimum value for c1 , from the Parquet statistics, we can obtain the value of hash(c1.min) % n_buckets.
  • We can also calculate the information created by the hashing function on the value provided by the user in their query, i.e. hash(2) % n_buckets
  • By comparing these values, we know whether we actually need to read the row group or not.

Rationale

The key here is that to be able to calculate the identifier of each distinct logical partition created using bucket_by, we only need a single prior value for the columns passed to bucket_by, as that unique logical partition identifier will be hash(c1.min) % n_buckets (or equivalently, hash(c1.max) % n_buckets).

Value

Performance: during predicate pushdown, we can discard row groups where the value of hash(c1.min) % n_buckets does not match the value of the user-provided predicate (hash(2) % n_buckets, assuming 2 is the user-provided value).
If more than one column is provided to bucket_by, we can no longer rely on the Parquet statistics as the min/max statistics of different columns will not necessarily belong to the same row.

Comments

@fjetter :
ok, I get it. You’d still need to ensure backwards/forwards compat with the hash function AND collect the metadata up front

the idea about using the statistic to infer the bucket is good but we’d still need to have the prerequisites set up

this could be used to decide whether or not the file needs to be read but wouldn’t help us for query planing. that would work similar to predicate pushdown. The amount of infrastructure we’d need to set up to get the information to where we need it to be is quite big, though (collecting and passing the relevant information to the metpartitions during dispatch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant