You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use update_dataset_from_ddf with shuffle=True.
We bucket_by the column c1, with a specific n_buckets.
Imagine we now store the value of bucket_by and num_buckets alongside the Parquet metadata for each created partition.
Now a user performs a query with predicates c1 == 2. When reading the Parquet metadata, we can do the following:
Since we know the minimum value for c1 , from the Parquet statistics, we can obtain the value of hash(c1.min) % n_buckets.
We can also calculate the information created by the hashing function on the value provided by the user in their query, i.e. hash(2) % n_buckets
By comparing these values, we know whether we actually need to read the row group or not.
Rationale
The key here is that to be able to calculate the identifier of each distinct logical partition created using bucket_by, we only need a single prior value for the columns passed to bucket_by, as that unique logical partition identifier will be hash(c1.min) % n_buckets (or equivalently, hash(c1.max) % n_buckets).
Value
Performance: during predicate pushdown, we can discard row groups where the value of hash(c1.min) % n_buckets does not match the value of the user-provided predicate (hash(2) % n_buckets, assuming 2 is the user-provided value).
If more than one column is provided to bucket_by, we can no longer rely on the Parquet statistics as the min/max statistics of different columns will not necessarily belong to the same row.
Comments
@fjetter :
ok, I get it. You’d still need to ensure backwards/forwards compat with the hash function AND collect the metadata up front
the idea about using the statistic to infer the bucket is good but we’d still need to have the prerequisites set up
this could be used to decide whether or not the file needs to be read but wouldn’t help us for query planing. that would work similar to predicate pushdown. The amount of infrastructure we’d need to set up to get the information to where we need it to be is quite big, though (collecting and passing the relevant information to the metpartitions during dispatch)
The text was updated successfully, but these errors were encountered:
Scenario
We use
update_dataset_from_ddf
withshuffle=True
.We
bucket_by
the columnc1
, with a specificn_buckets
.Imagine we now store the value of
bucket_by
andnum_buckets
alongside the Parquet metadata for each created partition.Now a user performs a query with predicates
c1 == 2
. When reading the Parquet metadata, we can do the following:hash(c1.min) % n_buckets
.hash(2) % n_buckets
Rationale
The key here is that to be able to calculate the identifier of each distinct logical partition created using bucket_by, we only need a single prior value for the columns passed to bucket_by, as that unique logical partition identifier will be hash(c1.min) % n_buckets (or equivalently, hash(c1.max) % n_buckets).
Value
Performance: during predicate pushdown, we can discard row groups where the value of
hash(c1.min) % n_buckets
does not match the value of the user-provided predicate (hash(2) % n_buckets
, assuming2
is the user-provided value).If more than one column is provided to
bucket_by
, we can no longer rely on the Parquet statistics as the min/max statistics of different columns will not necessarily belong to the same row.Comments
@fjetter :
ok, I get it. You’d still need to ensure backwards/forwards compat with the hash function AND collect the metadata up front
the idea about using the statistic to infer the bucket is good but we’d still need to have the prerequisites set up
this could be used to decide whether or not the file needs to be read but wouldn’t help us for query planing. that would work similar to predicate pushdown. The amount of infrastructure we’d need to set up to get the information to where we need it to be is quite big, though (collecting and passing the relevant information to the metpartitions during dispatch)
The text was updated successfully, but these errors were encountered: