Increasing value of OMP_NUM_THREADS
reduces performance even when controlling for n_workers
and threads_per_worker
#8985
Labels
Description:
I am working with external libraries that rely on NumPy's internal parallelization for certain heavy operations (like
.dot
between large matrices). I would like to distribute a certain number of these large calculations withdistributed
, but I encounter bad performance. In particular, when I changeOMP_NUM_THREADS
(and related variables) to the number of desired threads, performance gets worse!I make sure not to oversubscribe the CPU, because I explicitly balance the number of OMP threads with the number of workers/worker threads, to match the system number of cores.
The example here below shows using one worker with one thread.
The performance of
distributed
is already underwhelming with one thread (even compared to theProcessPoolExecutor
), but with more threads it gets progressively worse.As a side note, the CPU utilization on the system does indeed raise to the number of threads selected, despite the lower performance.
I am not aware of any other issue open on this subject or similar subjects.
Minimal Complete Verifiable Example:
(The commented lines were used to make sure the correct value of the variables were used, and yes, tqdm should not be used for performance assessment, but the difference is pretty clear, it is just a convenience tool)
Output of the script:
Environment:
The text was updated successfully, but these errors were encountered: