You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am executing ifeval_like_data.py file with 8 A100 GPUs and receiving the following error:
`
[10/21/24 06:53:04] ERROR ['distilabel.pipeline'] ❌ Failed to local.py:302
load step
'i_f_eval_kwargs_assignator_0': Step
load failed: The number of required
GPUs exceeds the total number of
available GPUs in the placement group.
For further information visit
'https://distilabel.argilla.io/latest/
api/pipeline/step_wrapper'
[10/21/24 06:53:05] ERROR ['distilabel.pipeline'] ❌ Failed to local.py:302
load step
'i_f_eval_instruction_id_list_assignat
or_0': Step load failed: The number of
required GPUs exceeds the total number
of available GPUs in the placement
group.
For further information visit
'https://distilabel.argilla.io/latest/
api/pipeline/step_wrapper'
ERROR ['distilabel.pipeline'] ❌ Failed to local.py:302
load step 'magpie_generator_0': Step
load failed: The number of required
GPUs exceeds the total number of
available GPUs in the placement group.
For further information visit
'https://distilabel.argilla.io/latest/
api/pipeline/step_wrapper'
ERROR ['distilabel.pipeline'] ❌ Failed to base.py:1201
load all the steps of stage 0
*** SIGTERM received at time=1729518785 on cpu 126 ***
*** SIGTERM received at time=1729518785 on cpu 62 ***
*** SIGTERM received at time=1729518785 on cpu 195 ***
PC: @ 0x5a9437 (unknown) _PyEval_EvalFrameDefault
@ 0x7ffff7e0f090 (unknown) (unknown)
@ ... and at least 3 more frames
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 62 ***
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: PC: @ 0x5a9437 (unknown) _PyEval_EvalFrameDefault
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: @ 0x7ffff7e0f090 (unknown) (unknown)
[2024-10-21 06:53:05,994 E 262 262] logging.cc:440: @ ... and at least 3 more frames
PC: @ 0x5f9269 (unknown) _PyObject_GetMethod
PC: @ 0x5a96dc (unknown) _PyEval_EvalFrameDefault
@ 0x7ffff7e0f090 72985216 (unknown)
@ 0x7ffff7e0f090 (unknown) (unknown)
@ ... and at least 4 more frames
[2024-10-21 06:53:05,994 E 260 260] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 126 ***
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: PC: @ 0x5a96dc (unknown) _PyEval_EvalFrameDefault
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: @ 0x7ffff7e0f090 (unknown) (unknown)
[2024-10-21 06:53:05,995 E 260 260] logging.cc:440: @ ... and at least 4 more frames
@ 0x94eca0 (unknown) (unknown)
[2024-10-21 06:53:06,000 E 261 261] logging.cc:440: *** SIGTERM received at time=1729518785 on cpu 195 ***
[2024-10-21 06:53:06,000 E 261 261] logging.cc:440: PC: @ 0x5f9269 (unknown) _PyObject_GetMethod
[2024-10-21 06:53:06,004 E 261 261] logging.cc:440: @ 0x7ffff7e0f090 72985216 (unknown)
[2024-10-21 06:53:06,009 E 261 261] logging.cc:440: @ 0x94eca0 (unknown) (unknown)
╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dataset = None │ │
│ │ distiset = None │ │
│ │ logging_handlers = None │ │
│ │ manager = <multiprocessing.managers.SyncManager object at │ │
│ │ 0x7ffe41227f40> │ │
│ │ num_processes = 3 │ │
│ │ parameters = None │ │
│ │ pool = <distilabel.pipeline.local._NoDaemonPool │ │
│ │ state=TERMINATE pool_size=3> │ │
│ │ self = <distilabel.pipeline.local.Pipeline object at │ │
│ │ 0x7ffe46a00df0> │ │
│ │ storage_parameters = None │ │
│ │ use_cache = False │ │
│ │ use_fs_to_pass_data = False │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to load all the steps. Could not run pipeline.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 980, in _bootstrap_inner`
I am not able to find why I am receiving this error despite providing 8 GPUs. I am using Llama-3.2-1B-Instruct model.
The text was updated successfully, but these errors were encountered:
Hi @saurabhbbjain, how are you running the pipeline? It's a pipeline using ray, we normally run them in a slurm cluster that controls ray, as can be seen here in the docs. I will let @gabrielmbmb answer it in case he has access to how the pipeline was run.
Hi @saurabhbbjain, the original code is using 8 GPUs per step as we're using vLLM with tensor_parallel_size==8, and as @plaguss mentions, it's also using the RayPipeline which you don't need to use with a single machine setup. I've updated the pipeline to work with your step (haven't tested):
We remove the .ray() to use the Pipeline instead of the RayPipeline
We update the tensor_parallel_size in all the vLLMs of the pipeline. MagpieGenerator will use 4 GPUs and the other two steps, 2 GPUs each.
I am executing ifeval_like_data.py file with 8 A100 GPUs and receiving the following error:
I am not able to find why I am receiving this error despite providing 8 GPUs. I am using Llama-3.2-1B-Instruct model.
The text was updated successfully, but these errors were encountered: