training process is killed because OOM #48

fancyerii · 2017-08-29T02:01:18Z

I have trained neon for librispeech data. But it's always killed because OOM. My machine has 24GB memory and GeForce GTX 1070 card of 8G memory.

I found this msg by dmsg
[3017506.733819] Out of memory: Kill process 25635 (python) score 974 or sacrifice child
[3017506.736861] Killed process 25635 (python) total-vm:55518724kB, anon-rss:23902876kB, file-rss:154436kB

is neon leaking memory or it require more memory to train?

The command I run is:
python train.py --manifest train:/bigdata/lili/deepspeech/librispeech/train-clean-100/train-manifest.csv --manifest val:/bigdata/lili/deepspeech/librispeech/train-clean-100/val-manifest.csv -e 20 -z 16 -s models -b gpu

gardenia22 · 2017-08-29T16:58:26Z

Try reduce batch size.

fancyerii · 2017-09-04T02:10:03Z

I changed batch_size to 8 but it's still killed.
[3256824.391743] Killed process 9666 (python) total-vm:53893188kB, anon-rss:23892380kB, file-rss:152808kB

it use too much memory

Neuroschemata · 2017-09-06T21:14:54Z

I suspect the source of the problem is unrelated to the model size. With the default parameters using the command you posted above, I get the following:

batch size	GPU memory footprint
`32`	`6949 GB`
`16`	`3915 GB`
`8`	`2415 GB`

So your 8GB GPU has the capacity to handle a batch size of up to 32.

fancyerii · 2017-09-22T06:24:17Z

so what's wrong? From the /var/log. it seems this python process used 23892380kB(23GB) cpu memory(not gpu memory).

[3256824.391743] Killed process 9666 (python) total-vm:53893188kB, anon-rss:23892380kB, file-rss:152808kB

tyler-nervana assigned Neuroschemata Sep 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training process is killed because OOM #48

training process is killed because OOM #48

fancyerii commented Aug 29, 2017 •

edited

Loading

gardenia22 commented Aug 29, 2017

fancyerii commented Sep 4, 2017 •

edited

Loading

Neuroschemata commented Sep 6, 2017

fancyerii commented Sep 22, 2017

training process is killed because OOM #48

training process is killed because OOM #48

Comments

fancyerii commented Aug 29, 2017 • edited Loading

gardenia22 commented Aug 29, 2017

fancyerii commented Sep 4, 2017 • edited Loading

Neuroschemata commented Sep 6, 2017

fancyerii commented Sep 22, 2017

fancyerii commented Aug 29, 2017 •

edited

Loading

fancyerii commented Sep 4, 2017 •

edited

Loading