You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have a feeling that layerwise optimizer, by creating numerous networks is not freeing past networks and using more GPU memory than it should. I'm having a heck of time doing layerwise training
I get the following error after training on layer hid1 and hid2 once it tries to train on hid3 it borks at validation.
I 2015-09-08 12:26:42 downhill.base:402 patience elapsed!
I 2015-09-08 12:26:42 theanets.layers.base:303 layer Feedforward "lwout": (hid3:out)2730 ->
1025, linear, 2799275 parameters
I 2015-09-08 12:26:42 theanets.trainer:250 layerwise: training in -> hid1 -> hid2 -> hid3 ->
lwout
I 2015-09-08 12:26:43 downhill.base:378 -- patience = 6
I 2015-09-08 12:26:43 downhill.base:379 -- validate_every = 10
I 2015-09-08 12:26:43 downhill.base:380 -- min_improvement = 0.1
I 2015-09-08 12:26:43 downhill.base:381 -- max_gradient_norm = 0
I 2015-09-08 12:26:43 downhill.base:382 -- max_gradient_elem = 0
I 2015-09-08 12:26:43 downhill.base:383 -- learning_rate = 0.001
I 2015-09-08 12:26:43 downhill.base:384 -- momentum = 0.9
I 2015-09-08 12:26:43 downhill.base:385 -- nesterov = False
I 2015-09-08 12:26:43 downhill.adaptive:220 -- rms_halflife = 14
I 2015-09-08 12:26:43 downhill.adaptive:221 -- rms_regularizer = 1e-08
I 2015-09-08 12:26:43 downhill.base:112 compiling evaluation function
I 2015-09-08 12:26:43 downhill.base:118 compiling RMSProp function
Error allocating 11193000 bytes of device memory (out of memory). Driver report 966656 bytes
free and 4294246400 bytes total
Traceback (most recent call last):
File "stft-theanet.py", line 62, in <module>
momentum=0.9)
File "build/bdist.linux-x86_64/egg/theanets/graph.py", line 400, in train
File "build/bdist.linux-x86_64/egg/theanets/graph.py", line 376, in itertrain
File "build/bdist.linux-x86_64/egg/theanets/trainer.py", line 253, in itertrain
File "build/bdist.linux-x86_64/egg/theanets/trainer.py", line 66, in itertrain
File "/usr/local/lib/python2.7/dist-packages/downhill/base.py", line 388, in iterate
self._compile()
File "/usr/local/lib/python2.7/dist-packages/downhill/base.py", line 119, in _compile
updates = list(self._updates) + list(self._get_updates())
File "/usr/local/lib/python2.7/dist-packages/downhill/base.py", line 134, in _get_updates
for var, expr in self._get_updates_for(param, grad):
File "/usr/local/lib/python2.7/dist-packages/downhill/adaptive.py", line 226, in _get_upda
tes_for
g2_tm1 = shared_like(param, 'g2_ewma')
File "/usr/local/lib/python2.7/dist-packages/downhill/util.py", line 45, in shared_like
File "/usr/local/lib/python2.7/dist-packages/theano/compile/sharedvalue.py", line 208, in
shared
allow_downcast=allow_downcast, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/var.py", line 203, in flo
at32_shared_constructor
deviceval = type_support_filter(value, type.broadcastable, False, None)
MemoryError: ('Error allocating 11193000 bytes of device memory (out of memory).', "you migh
t consider using 'theano.shared(..., borrow=True)'")
Yet if I just do training it works fine. It does use a lot of GPU memory, it's a big network and I have a lot of training examples.
My theory is that shared variables and whatnot are not being freed appropriately. I was looking at the code and new layers are being created but I cannot tell how much sharing or copying is being done.
The text was updated successfully, but these errors were encountered:
Yes, I wouldn't be surprised, theanets doesn't try to do any memory management at all, so it's up to Python/Theano to clean up things that have disappeared from the active set. There's probably a bunch that could be done within theanets to help with this, though.
Hi, I have a feeling that layerwise optimizer, by creating numerous networks is not freeing past networks and using more GPU memory than it should. I'm having a heck of time doing layerwise training
With this network:
With the following pretraining:
I get the following error after training on layer hid1 and hid2 once it tries to train on hid3 it borks at validation.
Yet if I just do training it works fine. It does use a lot of GPU memory, it's a big network and I have a lot of training examples.
My theory is that shared variables and whatnot are not being freed appropriately. I was looking at the code and new layers are being created but I cannot tell how much sharing or copying is being done.
The text was updated successfully, but these errors were encountered: