-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix backward pass on GPU in PyTorch interface #1426
Conversation
Hello. You may have forgotten to update the changelog!
|
if not jac_res.is_cuda: | ||
jac_res = jac_res.cuda() | ||
|
||
vjp = dyv @ jac_res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌 thanks for fixing this @glassnotes!
Quick question, will this need to be fixed up on line 141 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a really good question. When I ran the example, this was the section that actually threw the error. I've tested on both the transfer learning demo, and the small example in the docs #1225 and both work with the fix. Do you know of a code example that would lead to the earlier part being called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other backward method is called for Hessian computations. That is, if you have
hess = torch.autograd.functional.hessian(cost_fn, params)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I approve, I'd like to know if we get the same errors on calculating the Hessian, as Josh mentioned.
I know this is a fairly small change, but I assume we still need a changelog for it?
Codecov Report
@@ Coverage Diff @@
## master #1426 +/- ##
=======================================
Coverage 98.23% 98.23%
=======================================
Files 160 160
Lines 11966 11966
=======================================
Hits 11755 11755
Misses 211 211
Continue to review full report at Codecov.
|
Changelog updated! Computing the Hessian went smoothly - I didn't have to update anything. It looks like there is some logic already in that section to keep things on the GPU (I'm genuinely confused why this doesn't happen for the backward pass though, I wonder if the actual issue is within the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @glassnotes! Looks good on my end. Approving with the caveat that I am unable to test it locally as I do not have a GPU, so recommend getting approval from someone that does.
# When using CUDA, dyv seems to remain on the GPU, while the result | ||
# of jac_res is returned on CPU, even though the saved_tensors arguments are | ||
# themselves on the GPU. Check whether this has happened, and move things | ||
# back to the GPU if required. | ||
if dyv.is_cuda or jac_res.is_cuda: | ||
if not dyv.is_cuda: | ||
dyv = torch.as_tensor(dyv, device=jac_res.get_device()) | ||
if not jac_res.is_cuda: | ||
jac_res = torch.as_tensor(jac_res, device=dyv.get_device()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that we could factor this out as a utility function, so that this logic can be re-used where needed in the future:
def match_gpu_device(tensors):
"""Ensure that tensors are placed on the same device.
This function checks whether any tensor within the input list of
tensors is on a GPU. If this is the case, all tensors are moved
to the location of the *first CUDA tensor in the list*.
If no tensors are on a GPU, the tensor locations remain unchanged.
Args:
tensors (sequence[torch.Tensor]): list of input tensors
"""
device = None
for t in tensors:
if t.is_cuda:
device = t.get_device()
break
if device is None:
return
for t in tensors:
if t.get_device() is not device:
torch.as_tensor(t, device=device)
Just a quick comment, I have successfully run the tutorial on my device (GTX 1060) running ArchLinux, CUDA 11.3, Driver v465.31, PyTorch 1.9.0+CUDA11.1 using this branch. PyTorch 1.8.1 fails with cuBlas errors. Output as below, along with the prediction windows:
I can confirm the GPU was used during this demo also as the memory for above process was allocated and freed. Happy to run more if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mlxd for checking that it executes :)
Thanks for figuring this out :)
Context: A user has reported an error through the quantum transfer learning demo; when running on the GPU, the line
in
torch.py
yields:The reason for the error was discovered to be that the result of
ctx.jacobian.apply()
was on the CPU, even when the tensors it acted on were on the GPU.Description of the Change: Updated the
backward
method in the torch interface to split the above operation into two parts, and check the device of the results. If at least one of them was on the GPU, the other is moved to the GPU if it was not already there.Note: I needed to use torch 1.9 to test this, due to the issue described here.
Benefits: Running on the GPU with the torch interface should now work.
Possible Drawbacks: While this fix enables the demo to be run on the GPU, we cannot fully test it.
Related GitHub Issues: I believe this is the same issue reported in #1290, so merging this should enable us to close that.