OOM in "def adapt" #21

sunwoo76 · 2021-10-19T17:27:16Z

Hi!

In the function , which name is "def adapt" that indicates inner loop in the maml, the memory in the GPU progressively increase following "num_adaptation_steps" is increase.

Finally, it makes the code stop.

I think the params are not need to be accumulated. The model parameters are need to be copied for each task.
so, i think this GPU increasing is not proper.

Is this problem caused by my code or from the library?

thanks:)

Hugo101 · 2021-10-20T22:05:22Z

Hi Sunshower76,

I have the same problem and the same concern here.

The memory in the GPU progressively increases when I improve the "num_adaptation_steps". I agree with you that changing this parameter should not increase the GPU memory.

Hope the author could have a better solution.

tristandeleu · 2021-10-20T22:41:51Z

Hi! When you say that the GPU memory progressively increases, is that over the course of training (e.g. it works fine at first, but crashes after a number of iterations), or is that at the very first training iteration?

If it is over the course of training this might be the sign of a memory leak (I don't know where it could come from though), but if it crashes at the very beginning of training because you increased num_adaptation_steps, this is expected: the amount of memory required in MAML to backpropagate through the gradient updates scales linearly with the number of gradient steps of adaptation (here num_adaptation_steps). You should be able to run with 5 steps of gradient descent for adaptation (num_adaptation_steps=5, matching the setting from the paper), but anything beyond that may indeed lead to OOM.

One option if you'd like to increase num_adaptation_steps is to run it with the first-order approximation (first_order=True), which doesn't suffer from the same problem.

Hugo101 · 2021-10-21T01:19:12Z

Hi tristandeleu, thanks for your reply.
From my side, when the num_adaptation_steps in the inner loop is 5, it works. When I improve it to be 10, the code crashes during the inner loop process (the attached code shows it crashed in the inner step 8) in the first iteration.

++++++ At Inner Step 0:

++++++ At Inner Step 1:

++++++ At Inner Step 2:

++++++ At Inner Step 3:

++++++ At Inner Step 4:

++++++ At Inner Step 5:

++++++ At Inner Step 6:

++++++ At Inner Step 7:

++++++ At Inner Step 8:

Traceback (most recent call last):
  
  File "/data/maml.py", line 188, in train_iter
    sub_progress=num_batches)
  File "/data/maml.py", line 225, in get_outer_loss
    coef=self.coef, progress=progress, sub_progress=sub_progress)
  File "/data/maml/metalearners/maml.py", line 349, in adapt
    first_order=(not self.model.training) or first_order)
  File "/home/cxl173430/anaconda3/lib/python3.7/site-packages/torchmeta/utils/gradient_based.py", line 51, in gradient_update_parameters
    create_graph=not first_order)
  File "/home/cxl173430/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 228, in grad
    inputs, allow_unused, accumulate_grad=False)
RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 9; 10.76 GiB total capacity; 9.37 GiB already allocated; 73.12 MiB free; 9.53 GiB reserved in total by PyTorch)

Previously, I thought the number of gradient updates has nothing to do with the memory usage considering the regular DNN model. However, this is meta learning which is a bilevel optimization so this might not be the case I thought. I need to think about this deeper according to your comments.

Thanks.

sunwoo76 · 2021-10-21T01:26:34Z

@tristandeleu @Hugo101
Hi, Thanks for your replying.

I changed the code as below in the adapt function:

            new_params = gradient_update_parameters(self.network, inner_loss, step_size=1e-4, params=params, first_order=True)

            if params is None:
                params = new_params
            else:
                for key in list(params.keys()):
                    #del val
                    del params[key]
                del params
                gc.collect()
                torch.cuda.empty_cache()
                params = new_params

Because, I found that the ids of the variable "params" and output from "gradient_update_paramters" function are diffrent.
I think that this is not the main issue.
It is enough to change the first_order=True in my case.

Hugo101 · 2021-10-21T03:11:29Z

Hi @sunshower76 , I think I check the difference you mentioned between params and new params (output from gradient_udpate_parameters). The difference is that:

params contains batch normalization
new params does not have that.

As you mentioned, this is not a main issue.

I think this issue also talked about the batch normalization thing: #19 (comment)

brando90 · 2021-11-05T21:36:22Z

Hi @sunshower76 , I think I check the difference you mentioned between params and new params (output from gradient_udpate_parameters). The difference is that:

params contains batch normalization

new params does not have that.

As you mentioned, this is not a main issue.

I think this issue also talked about the batch normalization thing: #19 (comment)

note the issues you reference these I mostly closed it. For meta-learning batch stats during evaluation is what should be done, so the flag should be .train() + not tracking running stats. It's subtle and confusing and recommend you to see that issues and extended discussions and links it links to to understand BN.

Hugo101 · 2021-11-06T15:21:10Z

Hi @brando90 , thanks for your comments. I have the same confusion with you regarding BN in the current MAML code. Thanks for the reference from stackflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM in "def adapt" #21

OOM in "def adapt" #21

sunwoo76 commented Oct 19, 2021

Hugo101 commented Oct 20, 2021

tristandeleu commented Oct 20, 2021

Hugo101 commented Oct 21, 2021 •

edited

Loading

sunwoo76 commented Oct 21, 2021

Hugo101 commented Oct 21, 2021

brando90 commented Nov 5, 2021

Hugo101 commented Nov 6, 2021

OOM in "def adapt" #21

OOM in "def adapt" #21

Comments

sunwoo76 commented Oct 19, 2021

Hugo101 commented Oct 20, 2021

tristandeleu commented Oct 20, 2021

Hugo101 commented Oct 21, 2021 • edited Loading

sunwoo76 commented Oct 21, 2021

Hugo101 commented Oct 21, 2021

brando90 commented Nov 5, 2021

Hugo101 commented Nov 6, 2021

Hugo101 commented Oct 21, 2021 •

edited

Loading