Severe performance implication by setting global OMP_NUM_THREADS=1 #185

kftse-ust-hk · 2024-03-30T23:38:54Z

enroot/conf/hooks/extra/50-slurm-pytorch.sh

Lines 39 to 42 in 0d85f8d

    
           # Matching the behavior of torch.distributed.run: https://github.com/pytorch/pytorch/blob/v1.9.0/torch/distributed/run.py#L521-L532 
        
           if [ "${SLURM_STEP_NUM_TASKS:-1}" -gt "${SLURM_STEP_NUM_NODES:-1}" ] && ! grep -q "^OMP_NUM_THREADS=" "${ENROOT_ENVIRON}"; then 
        
               printf "OMP_NUM_THREADS=1\n" >> "${ENROOT_ENVIRON}" 
        
           fi

These lines are designed to "mimic" torch.distributed.run, but consider the

variety of methods to run distributed training, which does not depends on torch.distributed.run
data preprocessing on the fly with writeback / caching
a pytorch container can end up NOT to run torch.distributed.run
nproc shows 1 if OMP_NUM_THREADS=1, which does not represent how many CPU can be used.

I would argue that

The method used above is not a match to the pytorch config, these two code blocks below are very different.

export OMP_NUM_THREADS=1;
// do a bunch of stuff
// run distributed on some pytorch based framework

 // do a bunch of stuff;
 // run distributed on some pytorch based framework;
 if (torch.distributed.run) { export OMP_NUM_THREADS=1 }

OMP_NUM_THREADS=1 is a poor performance trade-off, various better trade-off can be made, e.g.
- OMP_DYNAMIC=true, which let OpenMP runtime decide how many threads to use (which may depends on the data size to process, and the load of the system. In my experience, intel omp provide an acceptable trade-off if we don't want to think of how utilize / preserve system resources.)
- calculate or directly taken from
  - OMP_NUM_THREAD=$((SLURM_JOB_CPUS_PER_NODE/SLURM_STEP_NUM_TASKS))
  - SLURM_CPUS_PER_TASK

As there is no visible notification if torch.distributed.run is never used, I suggest to remove or modify these lines here accordingly to some more sensible defaults following slurm's resource allocation convention.

The text was updated successfully, but these errors were encountered:

flx42 · 2024-04-01T15:47:40Z

Hello @kftse-ust-hk, this was discussed recently in #174

The same argument still apply, internally the feedback I got was that OMP_NUM_THREADS=1 is the safest option. But it's easy to override this value in your copy of the hook.

kftse-ust-hk · 2024-04-02T03:12:30Z

Hello @flx42 I would like to further bring your attention

the high percentage of affected
the subtle but resource wasting behavior
the inconsistency and confusion

of such default, and please reconsider the issue in such context.

There are quite a lot ML researchers who didn't write a line of C or OpenMP, has no idea what is OMP_NUM_THREADS, and how their data loader+preprocessor (which the C++ implementation may have made use of OMP threading) might have been affected by this default.

What users can feel is the training maybe somewhat slower, and also hard to quantify the difference without a comparison (e.g. to bare metal)

Affected Many ML users

We estimated 15% of jobs had been set OMP_NUM_THREADS=1 silently on our Nvidia Superpod grade cluster.

As we support 3 ways of launch: bare-metal, apptainer and enroot/pyxis, 15% of all jobs is a pretty high figure.

Subtle Resource Wasting

Clue 1: nproc reporting 1 regardless of how SLURM is set

According to man nproc

nproc --version
nproc (GNU coreutils) 8.32
nproc - print the number of processing units available
Print  the  number  of  processing units available to the current process, which may be less than the number of online processors

OMP_NUM_THREADS is only governing OpenMP spawned thread, and does not govern other kind of forking mechanism, e.g. pthread / subprocess, nor proprietary library threading e.g. MKL_NUM_THREADS.

I am unable to evaluate how this value of nproc is propagated, but I suppose most users would rely on nproc for number of usable processor, instead of counting taskset or cgroup of their shell pid.

Clue 2: Unable to fully utilize the "cores" user have requested and allocated

We observe (without rigid proof) pytorch dataloader setting large num_workers behaving as heavy context-switching summing to 100% cpu.

Inconsistency and Confusing

I personally spent 4 hours testing different SLURM config, all SLURM parameters / defaults, before finally ended-up a conclusion that it is the FROM container image itself being part of the culprit.

And please note it is already 2 months after I suspect there is a performance issue after performing a large-scale hands-on model training technical demonstration via both bare-metal and container, as part of the system administrating team.

And this depends on which base container it is derived from, not whether pytorch is actually installed.
Here are 4 dockerfiles, all results in pytorch and NCCL being installed, I guess FROM ..../pytorch is a popular starting point among our users, and that contribute to the significant 15% of affected jobs.

Not working

FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN pip install torch

All below are fine

# Working
FROM nvcr.io/nvidia/cuda:12.3.2-devel-ubi8
RUN python3 -m pip install torch

# Working
FROM nvcr.io/nvidia/nvhpc:24.3-devel-cuda_multi-ubuntu22.04
RUN python3 -m pip install torch

# Working, the env is called PYTORCH_BUILD_VERSION
FROM nvcr.io/partners/gridai/pytorch-lightning:v1.4.0

flx42 · 2024-04-02T04:58:00Z

and how their data loader+preprocessor (which the C++ implementation may have made use of OMP threading) might have been affected by this default.

I'm curious, which data loader? As far as I know, the pytorch data loader is not impacted, and DALI isn't either.

nproc reporting 1 regardless of how SLURM is set

I really don't understand why nproc decided to do this, I'm not going to ask coreutils maintainers to change it, but honestly I've never seen nproc used outside of:

$ make -j$(nproc)

I am unable to evaluate how this value of nproc is propagated, but I suppose most users would rely on nproc for number of usable processor, instead of counting taskset or cgroup of their shell pid.

This is in fact a hard problem that is not fully handled correctly or fully by most applications / libraries, you can take a look at https://fosdem.org/2024/schedule/event/fosdem-2024-3033-libamicontained-a-low-level-library-for-reasoning-about-resource-restriction/

We observe (without rigid proof) pytorch dataloader setting large num_workers behaving as heavy context-switching summing to 100% cpu.

If they sum to no more than 100%, it usually means that your job is actually restricted to one CPU core, either through CPU shares or through a CPU set. And if that's the case, then OMP_NUM_THREADS=1 is the best choice, it would be worse with more OpenMP threads (but again, the pytorch data loader doesn't use OpenMP I believe).

3XX0 · 2024-04-02T05:34:09Z

of such default

BTW, this hook is not enabled by default, it's up to the administrator to understand the consequences of doing so, and modify it if needed for his environment.

kftse-ust-hk · 2024-04-02T05:35:13Z

of such default

BTW, this hook is not enabled by default, it's up to the administrator to understand the consequences of doing so, and modify it if needed for his environment.

Unfortunately it is delivered as is to Nvidia Superpod-grade machine for some unknown reason

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe performance implication by setting global OMP_NUM_THREADS=1 #185

Severe performance implication by setting global OMP_NUM_THREADS=1 #185

kftse-ust-hk commented Mar 30, 2024 •

edited

Loading

flx42 commented Apr 1, 2024

kftse-ust-hk commented Apr 2, 2024 •

edited

Loading

flx42 commented Apr 2, 2024

3XX0 commented Apr 2, 2024

kftse-ust-hk commented Apr 2, 2024

Severe performance implication by setting global OMP_NUM_THREADS=1 #185

Severe performance implication by setting global OMP_NUM_THREADS=1 #185

Comments

kftse-ust-hk commented Mar 30, 2024 • edited Loading

flx42 commented Apr 1, 2024

kftse-ust-hk commented Apr 2, 2024 • edited Loading

Affected Many ML users

Subtle Resource Wasting

Inconsistency and Confusing

flx42 commented Apr 2, 2024

3XX0 commented Apr 2, 2024

kftse-ust-hk commented Apr 2, 2024

kftse-ust-hk commented Mar 30, 2024 •

edited

Loading

kftse-ust-hk commented Apr 2, 2024 •

edited

Loading