-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write predicate involving in
operator to use disjunction of ==
terms
#325
Comments
in
operator to used disjunction of ==
in
operator to use disjunction of ==
terms
For the benchmark you should evaluate larger Datasets since for these small Datasets the overhead of the operations are dominant and not the predicate evaluation. The most performance critical part is probably not the index filtering but rather the filtering on partitions themselves after loading data. In any case you should be able to construct benchmarks using the filter_array_like function of kartothek.serialization since this is the part where this matters most The point where I expect a significant drawback of the rewrite is when there are many elements in the value, not just four. What's the motivation for rewriting this? |
Sure.
Less performance-critical code maintenance. And I wonder how this would affect performance. |
We are building predicates automatically from a dataframe of partitions. The naive approach resulted in predicates which are disjunctions of above 1000 (sometimes 10000) conjunctions. Think [
[
("a", "in", [f"value_{x}" for x in range(8)]),
("b", "in", [2012, 2013, 2014, 2015, 2016, 2017, 2018]),
("c", "in", [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]),
]
] which would, if translated, result in 728 conjunctions with |
Problem description
We use the
in
operator internally in predicate parsing, but we can just re-write the predicates to use a disjunction of==
terms.e.g.
[[('A', 'in', [1, 4, 9, 13])]] -> [[('A', '==', 1)], [('A', '==', 4)], [('A', '==', 9)], [('A', '==', 13)]]
We could implement this re-write when a user passes predicates involving
in
, before the predicates are evaluated. This seems to be as fast as or faster than our current evaluation of predicates in micro-benchmarks (see below).Example code (ideally copy-pastable)
The text was updated successfully, but these errors were encountered: