Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing value of OMP_NUM_THREADS reduces performance even when controlling for n_workers and threads_per_worker #8985

Open
Obi-Wan opened this issue Jan 13, 2025 · 2 comments

Comments

@Obi-Wan
Copy link

Obi-Wan commented Jan 13, 2025

Description:

I am working with external libraries that rely on NumPy's internal parallelization for certain heavy operations (like .dot between large matrices). I would like to distribute a certain number of these large calculations with distributed, but I encounter bad performance. In particular, when I change OMP_NUM_THREADS (and related variables) to the number of desired threads, performance gets worse!
I make sure not to oversubscribe the CPU, because I explicitly balance the number of OMP threads with the number of workers/worker threads, to match the system number of cores.

The example here below shows using one worker with one thread.
The performance of distributed is already underwhelming with one thread (even compared to the ProcessPoolExecutor), but with more threads it gets progressively worse.

As a side note, the CPU utilization on the system does indeed raise to the number of threads selected, despite the lower performance.
I am not aware of any other issue open on this subject or similar subjects.

Minimal Complete Verifiable Example:

(The commented lines were used to make sure the correct value of the variables were used, and yes, tqdm should not be used for performance assessment, but the difference is pretty clear, it is just a convenience tool)

from distributed import Client, get_worker, LocalCluster
from mkl import get_max_threads
import numpy as np
from tqdm.auto import tqdm
from numpy.typing import NDArray
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

NUM_THREADS = 16


def get_env(num_threads: int = NUM_THREADS) -> dict[str, str]:
    return {var: f"{num_threads}" for var in ["OMP_NUM_THREADS", "MKL_NUM_THREADS", "OPENBLAS_NUM_THREADS"]}


def op(M: NDArray, y: NDArray, ii: int) -> float:
    # try:
    #     print(f"{ii = } - {get_worker() = }, {get_max_threads() = }")
    # except:
    #     print(f"{ii = } - {get_max_threads() = }")
    x = np.zeros(5_000)
    n1 = np.abs(M).dot(np.ones_like(x))
    n2 = np.abs(M).T.dot(np.ones_like(y))
    for _ in range(1_000):
        x += M.T.dot((y - M.dot(x)) / n1) / n2
    return float(np.linalg.norm(y - M.dot(x)))


if __name__ == "__main__":
    M = np.random.randn(500, 5_000)
    y = np.random.randn(500)

    N_TRIES = 11

    res_f = [op(M, y, ii) for ii in tqdm(range(N_TRIES), desc="For loop")]

    with ThreadPoolExecutor(max_workers=1) as executor:
        futures = [executor.submit(op, M, y, ii) for ii in range(N_TRIES)]
        res_d = [f.result() for f in tqdm(futures, desc=f"ThreadPoolExecutor ({NUM_THREADS})", total=N_TRIES)]

    with ProcessPoolExecutor(max_workers=1) as executor:
        futures = [executor.submit(op, M, y, ii) for ii in range(N_TRIES)]
        res_d = [f.result() for f in tqdm(futures, desc=f"ProcessPoolExecutor ({NUM_THREADS})", total=N_TRIES)]

    with LocalCluster(n_workers=1, threads_per_worker=1) as cluster:
        with Client(cluster) as client:
            print(client.dashboard_link)
            M_dd = client.scatter(M, broadcast=True)
            y_dd = client.scatter(y, broadcast=True)
            futures = [client.submit(op, M_dd, y_dd, ii) for ii in range(N_TRIES)]
            res_d = [f.result() for f in tqdm(futures, desc="Distributed (1)", total=N_TRIES)]

    with LocalCluster(n_workers=1, threads_per_worker=1, env=get_env()) as cluster:
        with Client(cluster) as client:
            print(client.dashboard_link)
            M_dd = client.scatter(M, broadcast=True)
            y_dd = client.scatter(y, broadcast=True)
            futures = [client.submit(op, M_dd, y_dd, ii) for ii in range(N_TRIES)]
            res_d = [f.result() for f in tqdm(futures, desc=f"Distributed ({NUM_THREADS})", total=N_TRIES)]

Output of the script:

ThreadPoolExecutor (16): 100%|██████████████████████████████████████████████████████| 11/11 [00:02<00:00,  5.14it/s]
ProcessPoolExecutor (16): 100%|█████████████████████████████████████████████████████| 11/11 [00:03<00:00,  2.98it/s]
http://127.0.0.1:8787/status
Distributed (1): 100%|██████████████████████████████████████████████████████████████| 11/11 [02:30<00:00, 13.72s/it]
http://127.0.0.1:8787/status
Distributed (16): 100%|█████████████████████████████████████████████████████████████| 11/11 [03:16<00:00, 17.88s/it]

Environment:

  • Dask version: 2024.8.2
  • Python version: any from 3.10 until 3.12 included
  • Operating System: Linux (ubuntu 2020.4)
  • Install method (conda, pip, source): conda
@Obi-Wan Obi-Wan changed the title Raising value of OMP_NUM_THREADS reduces performance even when controlling for n_workers and threads_per_worker Increasing value of OMP_NUM_THREADS reduces performance even when controlling for n_workers and threads_per_worker Jan 13, 2025
@Obi-Wan
Copy link
Author

Obi-Wan commented Jan 13, 2025

Small update: I've done a few more tests.

With the following calls, I obtain the same performance as a for loop:

with LocalCluster(n_workers=1, threads_per_worker=1, processes=False) as cluster:

and

with LocalCluster(n_workers=1, threads_per_worker=1, worker_class=Worker) as cluster:

If instead I use

with LocalCluster(n_workers=1, threads_per_worker=1, worker_class=Worker, processes=True) as cluster:

I get faster performance than with the nanny, but still way slower (CPU at 1550% -> so using the 16 cores in principle):

Distributed (16): 100%|██████████████████████████████████████| 11/11 [01:31<00:00,  8.36s/it]

So, not really sure what is going on here. The nanny process seems to have an impact but not the biggest.

The use of the nanny seems not a problem when using SLURMCluster from dask_jobqueue.

@Obi-Wan
Copy link
Author

Obi-Wan commented Jan 13, 2025

Yet another update: it seems that to be able to properly set the OMP_NUM_THREADS (and Co.) variable should be done from :

from dask import config as dd_config
dd_config.set({"distributed.nanny.pre-spawn-environ": get_env()})

with LocalCluster(n_workers=1, threads_per_worker=1) as cluster:

This improves the performance of using the nanny to the same level as skipping the nanny (third cluster configuration from previous update):

with LocalCluster(n_workers=1, threads_per_worker=1, worker_class=Worker, processes=True) as cluster:

This means that the performance is still crippled, but a just a bit less.
Update on this last thing: It seems that the better performance is due to the unsetting of the variable MALLOC_TRIM_THRESHOLD_, which was detrimental for performance.

So, the ultimate question is: Why does using multi-threaded NumPy interfere so much with spawned dask processes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant