Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#7230: Don't preserve bad partition for merge #7229

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Apr 28, 2024

What do these changes do?

merge preserves the row splitting, but sometimes it's better to repartition. One extremely bad case is when there is a sequence of heavyweight operations where inefficient partitioning persists from the first to the last. For example, sequence of merge operations, where the left operand of the very first operation in the chain has only one partition. For example:

import modin.pandas as pd
import numpy as np
from time import time
from modin.utils import execute
import modin.config as cfg

cfg.NPartitions.put(8)


pd.DataFrame(np.arange(cfg.NPartitions.get() * cfg.MinPartitionSize.get())).to_numpy()

for _ in range(3):
    np.random.seed(42)
    # small
    df1 = pd.DataFrame(np.random.randint(100, size=(32, 16)))
    assert df1._query_compiler._modin_frame._partitions.shape == (1, 1)

    # big
    df2 = pd.DataFrame(np.random.randint(100, size=(16_000, 16)))
    df3 = pd.DataFrame(np.random.randint(100, size=(200_000, 16)))

    start = time()

    res = df1.merge(df2, on=2)
    res = res.merge(df3, left_on='3_x', right_on=3)
    print(res.shape)
    execute(res)

    print(f"time: {time()-start}")

Results: 5.07 sec (on main) vs 2.56 sec (in the PR)

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Don't preserve bad partition for merge #7230
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev changed the title PERF-#0000: repartition if needed for merge PERF-#7230: repartition if needed for merge Apr 29, 2024
@anmyachev anmyachev changed the title PERF-#7230: repartition if needed for merge PERF-#7230: Don't preserve bad partition for merge Apr 29, 2024
@anmyachev anmyachev marked this pull request as ready for review April 29, 2024 13:02
Comment on lines +842 to +846
with mock.patch.object(
left._query_compiler, "repartition", return_value=return_value
) as repartition:
_ = left.merge(right)
repartition.assert_called_once_with(axis=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we test here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That in case of bad partitioning repartition was called (when performing merge operation).

if (
left._modin_frame._partitions.shape[0] < 0.3 * NPartitions.get()
# to avoid empty partitions after repartition; can materialize index
and len(left._modin_frame) > NPartitions.get() * MinPartitionSize.get()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change slow down some of our benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't preserve bad partition for merge
2 participants