Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving error: The number of required GPUs exceeds the total number of available GPUs in the placement group #1044

Open
saurabhbbjain opened this issue Oct 22, 2024 · 2 comments
Assignees

Comments

@saurabhbbjain
Copy link

I am executing ifeval_like_data.py file with 8 A100 GPUs and receiving the following error:

`
[10/21/24 06:53:04] ERROR ['distilabel.pipeline'] ❌ Failed to local.py:302
load step
'i_f_eval_kwargs_assignator_0': Step
load failed: The number of required
GPUs exceeds the total number of
available GPUs in the placement group.

                         For further information visit                      
                         'https://distilabel.argilla.io/latest/             
                         api/pipeline/step_wrapper'                         

[10/21/24 06:53:05] ERROR ['distilabel.pipeline'] ❌ Failed to local.py:302
load step
'i_f_eval_instruction_id_list_assignat
or_0': Step load failed: The number of
required GPUs exceeds the total number
of available GPUs in the placement
group.

                         For further information visit                      
                         'https://distilabel.argilla.io/latest/             
                         api/pipeline/step_wrapper'                         
                ERROR    ['distilabel.pipeline'] ❌ Failed to   local.py:302
                         load step 'magpie_generator_0': Step               
                         load failed: The number of required                
                         GPUs exceeds the total number of                   
                         available GPUs in the placement group.   
                         For further information visit                      
                         'https://distilabel.argilla.io/latest/             
                         api/pipeline/step_wrapper'                         
                ERROR    ['distilabel.pipeline'] ❌ Failed to   base.py:1201
                         load all the steps of stage 0                      

*** SIGTERM received at time=1729518785 on cpu 126 ***
*** SIGTERM received at time=1729518785 on cpu 62 ***
*** SIGTERM received at time=1729518785 on cpu 195 ***
PC: @ 0x5a9437 (unknown) _PyEval_EvalFrameDefault
@ 0x7ffff7e0f090 (unknown) (unknown)
@ ... and at least 3 more frames
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 62 ***
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: PC: @ 0x5a9437 (unknown) _PyEval_EvalFrameDefault
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: @ 0x7ffff7e0f090 (unknown) (unknown)
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: @ ... and at least 3 more frames
PC: @ 0x5f9269 (unknown) _PyObject_GetMethod
PC: @ 0x5a96dc (unknown) _PyEval_EvalFrameDefault
@ 0x7ffff7e0f090 72985216 (unknown)
@ 0x7ffff7e0f090 (unknown) (unknown)
@ ... and at least 4 more frames
[2024-10-21 06:53:05,994 E 260 260] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 126 ***
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: PC: @ 0x5a96dc (unknown) _PyEval_EvalFrameDefault
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: @ 0x7ffff7e0f090 (unknown) (unknown)
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: @ ... and at least 4 more frames
@ 0x94eca0 (unknown) (unknown)
[2024-10-21 06:53:06,000 E 261 261] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 195 ***
[2024-10-21 06:53:06,000 E 261 261] logging.cc:440: PC: @ 0x5f9269 (unknown) _PyObject_GetMethod
[2024-10-21 06:53:06,004 E 261 261] logging.cc:440: @ 0x7ffff7e0f090 72985216 (unknown)
[2024-10-21 06:53:06,009 E 261 261] logging.cc:440: @ 0x94eca0 (unknown) (unknown)
╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dataset = None │ │
│ │ distiset = None │ │
│ │ logging_handlers = None │ │
│ │ manager = <multiprocessing.managers.SyncManager object at │ │
│ │ 0x7ffe41227f40> │ │
│ │ num_processes = 3 │ │
│ │ parameters = None │ │
│ │ pool = <distilabel.pipeline.local._NoDaemonPool │ │
│ │ state=TERMINATE pool_size=3> │ │
│ │ self = <distilabel.pipeline.local.Pipeline object at │ │
│ │ 0x7ffe46a00df0> │ │
│ │ storage_parameters = None │ │
│ │ use_cache = False │ │
│ │ use_fs_to_pass_data = False │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to load all the steps. Could not run pipeline.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 980, in _bootstrap_inner`

I am not able to find why I am receiving this error despite providing 8 GPUs. I am using Llama-3.2-1B-Instruct model.

@plaguss
Copy link
Contributor

plaguss commented Oct 22, 2024

Hi @saurabhbbjain, how are you running the pipeline? It's a pipeline using ray, we normally run them in a slurm cluster that controls ray, as can be seen here in the docs. I will let @gabrielmbmb answer it in case he has access to how the pipeline was run.

@gabrielmbmb
Copy link
Member

gabrielmbmb commented Oct 22, 2024

Hi @saurabhbbjain, the original code is using 8 GPUs per step as we're using vLLM with tensor_parallel_size==8, and as @plaguss mentions, it's also using the RayPipeline which you don't need to use with a single machine setup. I've updated the pipeline to work with your step (haven't tested):

  1. We remove the .ray() to use the Pipeline instead of the RayPipeline
  2. We update the tensor_parallel_size in all the vLLMs of the pipeline. MagpieGenerator will use 4 GPUs and the other two steps, 2 GPUs each.

https://gist.github.com/gabrielmbmb/2df9a1041a649783efb3c3cf0ffb1376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants