[fix] SDP syncing buffers during gradient accumulation #1049

blefaudeux · 2022-07-29T11:08:33Z

What does this PR do?

Fixes #1041. I just had a minute or two, hoping that it's enough :)

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

blefaudeux · 2022-07-29T11:09:17Z

cc @min-xu-ai

min-xu-ai

lgtm！

blefaudeux · 2022-07-31T12:23:36Z

I've not followed the CI changes on Fairscale @min-xu-ai, looks like the breakages are not related (but they do touch OSS) but are pytorch version dependent ?

min-xu-ai · 2022-08-01T17:47:50Z

I trigged a rerun. The cpu test failure seems unrelated. But the GPU test failure seems to be in OSS?

torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/circleci/fairscale/tests/optim/test_oss.py", line 956, in run_ddp_parity
    check_optimizer_equivalence(opt, change_train_graph=change_train_graph)
  File "/home/circleci/fairscale/tests/optim/test_oss.py", line 947, in check_optimizer_equivalence
    check_step()
  File "/home/circleci/fairscale/tests/optim/test_oss.py", line 912, in check_step
    loss_sharded_optim = cast(torch.Tensor, sharded_optimizer.step(closure=closure_sharded))
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/home/circleci/fairscale/fairscale/optim/oss.py", line 232, in step
    loss = self.optim.step(closure=closure, **kwargs)  # type: ignore
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 157, in step
    adam(params_with_grad,
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 213, in adam
    func(params,
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.
change_train_graph = True, backend = 'nccl', broadcast_fp16 = False

min-xu-ai · 2022-08-01T19:51:03Z

it is interesting, all the failures seem to be the same. What the best way for me to run your branch? Do I git remote add your repo & branch?

ruanslv · 2022-08-04T14:09:00Z

These OSS tests are failing in the main branch (and in other unrelated PRs too). Agree that it's probably related to recent pytorch upgrade as I don't remember them failing last week.

I'm seeing all failures from here in a FSDP PR (#1052), including the CPU and the "test_parity3d_checkpoint_syncbn" ones.

min-xu-ai · 2022-09-23T21:22:20Z

somehow this PR hitting unrelated test errors. I am replacing it with: #1075

Thanks Ben!

tentative one minute fix

5f0ea90

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2022

min-xu-ai approved these changes Jul 29, 2022

View reviewed changes

min-xu-ai closed this Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] SDP syncing buffers during gradient accumulation #1049

[fix] SDP syncing buffers during gradient accumulation #1049

blefaudeux commented Jul 29, 2022 •

edited

Loading

blefaudeux commented Jul 29, 2022

min-xu-ai left a comment

blefaudeux commented Jul 31, 2022

min-xu-ai commented Aug 1, 2022

min-xu-ai commented Aug 1, 2022

ruanslv commented Aug 4, 2022

min-xu-ai commented Sep 23, 2022

[fix] SDP syncing buffers during gradient accumulation #1049

[fix] SDP syncing buffers during gradient accumulation #1049

Conversation

blefaudeux commented Jul 29, 2022 • edited Loading

What does this PR do?

Before submitting

PR review

blefaudeux commented Jul 29, 2022

min-xu-ai left a comment

Choose a reason for hiding this comment

blefaudeux commented Jul 31, 2022

min-xu-ai commented Aug 1, 2022

min-xu-ai commented Aug 1, 2022

ruanslv commented Aug 4, 2022

min-xu-ai commented Sep 23, 2022

blefaudeux commented Jul 29, 2022 •

edited

Loading