Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM in "def adapt" #21

Open
sunwoo76 opened this issue Oct 19, 2021 · 7 comments
Open

OOM in "def adapt" #21

sunwoo76 opened this issue Oct 19, 2021 · 7 comments

Comments

@sunwoo76
Copy link

Hi!

In the function , which name is "def adapt" that indicates inner loop in the maml, the memory in the GPU progressively increase following "num_adaptation_steps" is increase.

Finally, it makes the code stop.

I think the params are not need to be accumulated. The model parameters are need to be copied for each task.
so, i think this GPU increasing is not proper.

Is this problem caused by my code or from the library?

thanks:)

@Hugo101
Copy link

Hugo101 commented Oct 20, 2021

Hi Sunshower76,

I have the same problem and the same concern here.

The memory in the GPU progressively increases when I improve the "num_adaptation_steps". I agree with you that changing this parameter should not increase the GPU memory.

Hope the author could have a better solution.

@tristandeleu
Copy link
Owner

Hi! When you say that the GPU memory progressively increases, is that over the course of training (e.g. it works fine at first, but crashes after a number of iterations), or is that at the very first training iteration?

If it is over the course of training this might be the sign of a memory leak (I don't know where it could come from though), but if it crashes at the very beginning of training because you increased num_adaptation_steps, this is expected: the amount of memory required in MAML to backpropagate through the gradient updates scales linearly with the number of gradient steps of adaptation (here num_adaptation_steps). You should be able to run with 5 steps of gradient descent for adaptation (num_adaptation_steps=5, matching the setting from the paper), but anything beyond that may indeed lead to OOM.

One option if you'd like to increase num_adaptation_steps is to run it with the first-order approximation (first_order=True), which doesn't suffer from the same problem.

@Hugo101
Copy link

Hugo101 commented Oct 21, 2021

Hi tristandeleu, thanks for your reply.
From my side, when the num_adaptation_steps in the inner loop is 5, it works. When I improve it to be 10, the code crashes during the inner loop process (the attached code shows it crashed in the inner step 8) in the first iteration.

++++++ At Inner Step 0:

++++++ At Inner Step 1:

++++++ At Inner Step 2:

++++++ At Inner Step 3:

++++++ At Inner Step 4:

++++++ At Inner Step 5:

++++++ At Inner Step 6:

++++++ At Inner Step 7:

++++++ At Inner Step 8:

Traceback (most recent call last):
  
  File "/data/maml.py", line 188, in train_iter
    sub_progress=num_batches)
  File "/data/maml.py", line 225, in get_outer_loss
    coef=self.coef, progress=progress, sub_progress=sub_progress)
  File "/data/maml/metalearners/maml.py", line 349, in adapt
    first_order=(not self.model.training) or first_order)
  File "/home/cxl173430/anaconda3/lib/python3.7/site-packages/torchmeta/utils/gradient_based.py", line 51, in gradient_update_parameters
    create_graph=not first_order)
  File "/home/cxl173430/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 228, in grad
    inputs, allow_unused, accumulate_grad=False)
RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 9; 10.76 GiB total capacity; 9.37 GiB already allocated; 73.12 MiB free; 9.53 GiB reserved in total by PyTorch)

Previously, I thought the number of gradient updates has nothing to do with the memory usage considering the regular DNN model. However, this is meta learning which is a bilevel optimization so this might not be the case I thought. I need to think about this deeper according to your comments.

Thanks.

@sunwoo76
Copy link
Author

@tristandeleu @Hugo101
Hi, Thanks for your replying.

I changed the code as below in the adapt function:

            new_params = gradient_update_parameters(self.network, inner_loss, step_size=1e-4, params=params, first_order=True)

            if params is None:
                params = new_params
            else:
                for key in list(params.keys()):
                    #del val
                    del params[key]
                del params
                gc.collect()
                torch.cuda.empty_cache()
                params = new_params

Because, I found that the ids of the variable "params" and output from "gradient_update_paramters" function are diffrent.
I think that this is not the main issue.
It is enough to change the first_order=True in my case.

@Hugo101
Copy link

Hugo101 commented Oct 21, 2021

Hi @sunshower76 , I think I check the difference you mentioned between params and new params (output from gradient_udpate_parameters). The difference is that:

  • params contains batch normalization
  • new params does not have that.

As you mentioned, this is not a main issue.

I think this issue also talked about the batch normalization thing: #19 (comment)

@brando90
Copy link

brando90 commented Nov 5, 2021

Hi @sunshower76 , I think I check the difference you mentioned between params and new params (output from gradient_udpate_parameters). The difference is that:

  • params contains batch normalization
  • new params does not have that.

As you mentioned, this is not a main issue.

I think this issue also talked about the batch normalization thing: #19 (comment)

note the issues you reference these I mostly closed it. For meta-learning batch stats during evaluation is what should be done, so the flag should be .train() + not tracking running stats. It's subtle and confusing and recommend you to see that issues and extended discussions and links it links to to understand BN.

@Hugo101
Copy link

Hugo101 commented Nov 6, 2021

Hi @brando90 , thanks for your comments. I have the same confusion with you regarding BN in the current MAML code. Thanks for the reference from stackflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants