PP-related issues #771
Labels
bug
Something isn't working
release_blocking
Issues that are blocking the milestone / release completion
Milestone
I found the below issues when debugging FSDP + CP + PP loss converging. I used a seed checkpoint, on the debug model, with at most 8 GPUs.
training.deterministic
: NaN within 5 steps (gone after [Pipelining] Fix PP grad scaling pytorch#144352)training.mixed_precision_param = "float32"
: see error log belowerror log
traceback : Traceback (most recent call last): File "/home/lty/local/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/lty/local/torchtitan/train.py", line 287, in main pp_schedule.step(input_ids) File "/home/lty/local/pytorch/torch/distributed/pipelining/schedules.py", line 503, in step self._step_microbatches(args_split, kwargs_split, targets_split, losses) File "/home/lty/local/pytorch/torch/distributed/pipelining/schedules.py", line 671, in _step_microbatches self._initialize_stage(arg_mbs[0], kwarg_mbs[0]) File "/home/lty/local/pytorch/torch/distributed/pipelining/schedules.py", line 473, in _initialize_stage self._stage._prepare_forward_infra(self._n_microbatches, args, kwargs) File "/home/lty/local/pytorch/torch/distributed/pipelining/stage.py", line 1421, in _prepare_forward_infra outputs = self._shape_inference(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/distributed/pipelining/stage.py", line 1362, in _shape_inference outputs = self.submod(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1845, in _call_impl return inner() ^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1793, in inner result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/torchtitan/torchtitan/models/llama/model.py", line 442, in forward h = layer(h, self.freqs_cis) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1845, in _call_impl return inner() ^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1793, in inner result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/torchtitan/torchtitan/models/llama/model.py", line 323, in forward h = x + self.attention(self.attention_norm(x), freqs_cis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/torchtitan/torchtitan/models/llama/model.py", line 209, in forward output = F.scaled_dot_product_attention(xq, xk, xv, is_causal=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/distributed/tensor/experimental/_attention.py", line 907, in inner_fn output = target_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/_compile.py", line 32, in inner return disable_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/_dynamo/eval_frame.py", line 751, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/distributed/tensor/_api.py", line 343, in __torch_dispatch__ return DTensor._op_dispatcher.dispatch( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/distributed/tensor/_dispatch.py", line 164, in dispatch return self._custom_op_handlers[op_call](op_call, args, kwargs) # type: ignore[operator] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/distributed/tensor/experimental/_attention.py", line 555, in _sdpa_handler local_results = _scaled_dot_product_ring_efficient_attention( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lty/local/pytorch/torch/distributed/tensor/experimental/_attention.py", line 239, in _scaled_dot_product_ring_efficient_attention raise NotImplementedError("compute_log_sumexp must be set") NotImplementedError: compute_log_sumexp must be setThe text was updated successfully, but these errors were encountered: