Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] SDP syncing buffers during gradient accumulation #1049

Closed
wants to merge 1 commit into from

Conversation

blefaudeux
Copy link
Contributor

@blefaudeux blefaudeux commented Jul 29, 2022

What does this PR do?

Fixes #1041. I just had a minute or two, hoping that it's enough :)

Before submitting

  • Did you have fun?
    • Make sure you had fun coding 🙃
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • N/A
  • Did you make sure to update the docs?
    • N/A
  • Did you write any new necessary tests?
    • N/A
  • Did you update the changelog? (if needed)
    • N/A

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2022
@blefaudeux
Copy link
Contributor Author

cc @min-xu-ai

Copy link
Contributor

@min-xu-ai min-xu-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@blefaudeux
Copy link
Contributor Author

I've not followed the CI changes on Fairscale @min-xu-ai, looks like the breakages are not related (but they do touch OSS) but are pytorch version dependent ?

@min-xu-ai
Copy link
Contributor

I trigged a rerun. The cpu test failure seems unrelated. But the GPU test failure seems to be in OSS?

torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/circleci/fairscale/tests/optim/test_oss.py", line 956, in run_ddp_parity
    check_optimizer_equivalence(opt, change_train_graph=change_train_graph)
  File "/home/circleci/fairscale/tests/optim/test_oss.py", line 947, in check_optimizer_equivalence
    check_step()
  File "/home/circleci/fairscale/tests/optim/test_oss.py", line 912, in check_step
    loss_sharded_optim = cast(torch.Tensor, sharded_optimizer.step(closure=closure_sharded))
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/home/circleci/fairscale/fairscale/optim/oss.py", line 232, in step
    loss = self.optim.step(closure=closure, **kwargs)  # type: ignore
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 157, in step
    adam(params_with_grad,
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 213, in adam
    func(params,
  File "/home/circleci/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.
change_train_graph = True, backend = 'nccl', broadcast_fp16 = False

@min-xu-ai
Copy link
Contributor

it is interesting, all the failures seem to be the same. What the best way for me to run your branch? Do I git remote add your repo & branch?

@ruanslv
Copy link
Contributor

ruanslv commented Aug 4, 2022

These OSS tests are failing in the main branch (and in other unrelated PRs too). Agree that it's probably related to recent pytorch upgrade as I don't remember them failing last week.

I'm seeing all failures from here in a FSDP PR (#1052), including the CPU and the "test_parity3d_checkpoint_syncbn" ones.

@min-xu-ai
Copy link
Contributor

somehow this PR hitting unrelated test errors. I am replacing it with: #1075

Thanks Ben!

@min-xu-ai min-xu-ai closed this Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SDP sync_buffers() during gradient accumulation
4 participants