Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance implication by setting global OMP_NUM_THREADS=1 #185

Open
kftse-ust-hk opened this issue Mar 30, 2024 · 5 comments
Open

Comments

@kftse-ust-hk
Copy link

kftse-ust-hk commented Mar 30, 2024

# Matching the behavior of torch.distributed.run: https://github.com/pytorch/pytorch/blob/v1.9.0/torch/distributed/run.py#L521-L532
if [ "${SLURM_STEP_NUM_TASKS:-1}" -gt "${SLURM_STEP_NUM_NODES:-1}" ] && ! grep -q "^OMP_NUM_THREADS=" "${ENROOT_ENVIRON}"; then
printf "OMP_NUM_THREADS=1\n" >> "${ENROOT_ENVIRON}"
fi

These lines are designed to "mimic" torch.distributed.run, but consider the

  • variety of methods to run distributed training, which does not depends on torch.distributed.run
  • data preprocessing on the fly with writeback / caching
  • a pytorch container can end up NOT to run torch.distributed.run
  • nproc shows 1 if OMP_NUM_THREADS=1, which does not represent how many CPU can be used.

I would argue that

  1. The method used above is not a match to the pytorch config, these two code blocks below are very different.

    • export OMP_NUM_THREADS=1;
      // do a bunch of stuff
      // run distributed on some pytorch based framework
      
    •  // do a bunch of stuff;
       // run distributed on some pytorch based framework;
       if (torch.distributed.run) { export OMP_NUM_THREADS=1 }
      
  2. OMP_NUM_THREADS=1 is a poor performance trade-off, various better trade-off can be made, e.g.

    • OMP_DYNAMIC=true, which let OpenMP runtime decide how many threads to use (which may depends on the data size to process, and the load of the system. In my experience, intel omp provide an acceptable trade-off if we don't want to think of how utilize / preserve system resources.)
    • calculate or directly taken from
      • OMP_NUM_THREAD=$((SLURM_JOB_CPUS_PER_NODE/SLURM_STEP_NUM_TASKS))
      • SLURM_CPUS_PER_TASK

As there is no visible notification if torch.distributed.run is never used, I suggest to remove or modify these lines here accordingly to some more sensible defaults following slurm's resource allocation convention.

@flx42
Copy link
Member

flx42 commented Apr 1, 2024

Hello @kftse-ust-hk, this was discussed recently in #174

The same argument still apply, internally the feedback I got was that OMP_NUM_THREADS=1 is the safest option. But it's easy to override this value in your copy of the hook.

@kftse-ust-hk
Copy link
Author

kftse-ust-hk commented Apr 2, 2024

Hello @flx42 I would like to further bring your attention

  • the high percentage of affected
  • the subtle but resource wasting behavior
  • the inconsistency and confusion

of such default, and please reconsider the issue in such context.

There are quite a lot ML researchers who didn't write a line of C or OpenMP, has no idea what is OMP_NUM_THREADS, and how their data loader+preprocessor (which the C++ implementation may have made use of OMP threading) might have been affected by this default.

What users can feel is the training maybe somewhat slower, and also hard to quantify the difference without a comparison (e.g. to bare metal)

Affected Many ML users

We estimated 15% of jobs had been set OMP_NUM_THREADS=1 silently on our Nvidia Superpod grade cluster.

As we support 3 ways of launch: bare-metal, apptainer and enroot/pyxis, 15% of all jobs is a pretty high figure.

Subtle Resource Wasting

Clue 1: nproc reporting 1 regardless of how SLURM is set

According to man nproc

nproc --version
nproc (GNU coreutils) 8.32
nproc - print the number of processing units available
Print  the  number  of  processing units available to the current process, which may be less than the number of online processors

OMP_NUM_THREADS is only governing OpenMP spawned thread, and does not govern other kind of forking mechanism, e.g. pthread / subprocess, nor proprietary library threading e.g. MKL_NUM_THREADS.

I am unable to evaluate how this value of nproc is propagated, but I suppose most users would rely on nproc for number of usable processor, instead of counting taskset or cgroup of their shell pid.

Clue 2: Unable to fully utilize the "cores" user have requested and allocated

We observe (without rigid proof) pytorch dataloader setting large num_workers behaving as heavy context-switching summing to 100% cpu.

Inconsistency and Confusing

I personally spent 4 hours testing different SLURM config, all SLURM parameters / defaults, before finally ended-up a conclusion that it is the FROM container image itself being part of the culprit.

And please note it is already 2 months after I suspect there is a performance issue after performing a large-scale hands-on model training technical demonstration via both bare-metal and container, as part of the system administrating team.

And this depends on which base container it is derived from, not whether pytorch is actually installed.
Here are 4 dockerfiles, all results in pytorch and NCCL being installed, I guess FROM ..../pytorch is a popular starting point among our users, and that contribute to the significant 15% of affected jobs.

Not working

FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN pip install torch

All below are fine

# Working
FROM nvcr.io/nvidia/cuda:12.3.2-devel-ubi8
RUN python3 -m pip install torch
# Working
FROM nvcr.io/nvidia/nvhpc:24.3-devel-cuda_multi-ubuntu22.04
RUN python3 -m pip install torch
# Working, the env is called PYTORCH_BUILD_VERSION
FROM nvcr.io/partners/gridai/pytorch-lightning:v1.4.0

@flx42
Copy link
Member

flx42 commented Apr 2, 2024

and how their data loader+preprocessor (which the C++ implementation may have made use of OMP threading) might have been affected by this default.

I'm curious, which data loader? As far as I know, the pytorch data loader is not impacted, and DALI isn't either.

nproc reporting 1 regardless of how SLURM is set

I really don't understand why nproc decided to do this, I'm not going to ask coreutils maintainers to change it, but honestly I've never seen nproc used outside of:

$ make -j$(nproc)

I am unable to evaluate how this value of nproc is propagated, but I suppose most users would rely on nproc for number of usable processor, instead of counting taskset or cgroup of their shell pid.

This is in fact a hard problem that is not fully handled correctly or fully by most applications / libraries, you can take a look at https://fosdem.org/2024/schedule/event/fosdem-2024-3033-libamicontained-a-low-level-library-for-reasoning-about-resource-restriction/

We observe (without rigid proof) pytorch dataloader setting large num_workers behaving as heavy context-switching summing to 100% cpu.

If they sum to no more than 100%, it usually means that your job is actually restricted to one CPU core, either through CPU shares or through a CPU set. And if that's the case, then OMP_NUM_THREADS=1 is the best choice, it would be worse with more OpenMP threads (but again, the pytorch data loader doesn't use OpenMP I believe).

@3XX0
Copy link
Member

3XX0 commented Apr 2, 2024

of such default

BTW, this hook is not enabled by default, it's up to the administrator to understand the consequences of doing so, and modify it if needed for his environment.

@kftse-ust-hk
Copy link
Author

of such default

BTW, this hook is not enabled by default, it's up to the administrator to understand the consequences of doing so, and modify it if needed for his environment.

Unfortunately it is delivered as is to Nvidia Superpod-grade machine for some unknown reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants