-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#7230: Don't preserve bad partition for merge
#7229
Conversation
b6a887e
to
5b9a236
Compare
Signed-off-by: Anatoly Myachev <[email protected]>
5b9a236
to
f93e88a
Compare
merge
with mock.patch.object( | ||
left._query_compiler, "repartition", return_value=return_value | ||
) as repartition: | ||
_ = left.merge(right) | ||
repartition.assert_called_once_with(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That in case of bad partitioning repartition
was called (when performing merge
operation).
if ( | ||
left._modin_frame._partitions.shape[0] < 0.3 * NPartitions.get() | ||
# to avoid empty partitions after repartition; can materialize index | ||
and len(left._modin_frame) > NPartitions.get() * MinPartitionSize.get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change slow down some of our benchmarks?
What do these changes do?
merge
preserves the row splitting, but sometimes it's better to repartition. One extremely bad case is when there is a sequence of heavyweight operations where inefficient partitioning persists from the first to the last. For example, sequence ofmerge
operations, where the left operand of the very first operation in the chain has only one partition. For example:Results: 5.07 sec (on main) vs 2.56 sec (in the PR)
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
merge
#7230docs/development/architecture.rst
is up-to-date