NaN after training for a while #52

jameshball · 2023-02-23T08:04:09Z

Hi!

I'm having an issue training with the basic model provided in the README. After training on the LibriSpeech dataset for about 20 epochs, I start getting NaN losses returned from the model, and when sampling and saving to a file I just get silent audio randomly.

I had a go at debugging but couldn't really find the issue, other than the first NaN in the forward pass I could find was the input to the ResNet block. Not sure if this is helpful, but I've added my debug output here: output.log. The prints were just me testing in various forward functions whether the inputs and outputs were NaN but I didn't isolate any lines.

My training script is pretty short and I'm not doing anything particularly weird that would cause this I don't think! You can have a look here: https://github.com/jameshball/audio-diffusion/blob/master/train.py

Also here's a snipper of my output from training where it turns to nan loss: nan.txt

I should also be able to follow up with a google drive link to download the checkpoint so you can test it more easily - you might need to modify train.py to remove some wandb calls functions and just load the checkpoint from disk but should be straightforward. Alternatively, I also get NaNs when just sampling from the model here: https://github.com/jameshball/audio-diffusion/blob/master/sample.py

Please let me know if there's anything I can help with as this would be great to fix!

James

The text was updated successfully, but these errors were encountered:

jameshball · 2023-02-23T08:18:13Z

Checkpoint file: https://drive.google.com/file/d/1ndzRNNOZrY9jY6nIhSrVxQrVuBiAXojT/view?usp=sharing

jameshball · 2023-02-26T20:24:07Z

Closing as I haven't experienced the issue since and have also made this repo private. Thanks for the library!!

Tinglok · 2023-03-16T06:43:05Z

Hi @jameshball , thanks for sharing the code! I wonder if you changed anything when you fixed this issue.

jameshball · 2023-03-16T08:54:06Z

I didn't change anything :/ just reran and haven't experienced it since so this could probably be reopened if you've experienced it too

Tinglok · 2023-03-16T17:58:04Z

Oh cool, thanks for clarifying!

fred-dev · 2023-09-19T08:16:36Z

Did anyone else get this nan loss? Do either of your know how many iterations it took for you to get this? It happened for me at 840 iterations. I tried to go back and restart training from the last checkpoint before this but ended up with Nan again. Was anyone who had this issue able to restart from an earlier checkpoint and move past this?

**Update: I ran back to earlier checkpoints and ended up at Nan at exactly the same place on several occasions. I am winding back further, but this may be that there is an error with the model early on, that does not express itslef until a number of iterations.

0417keito · 2023-12-05T13:34:28Z

@fred-dev What was the quality of the sample output from the model from which the NaN loss was output? Did it result in reasonable speech data?

fred-dev · 2023-12-05T14:07:10Z

@0417keito I was not using voice, but a waveform dataset. The generation was OK, not nearly as good as the published results. I never worked out how to actually avoid the NAN loss. I wonder if anyone has sucessfully used this outside of the publishers?

fmiotello · 2024-06-10T21:53:21Z

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

fred-dev · 2024-06-11T07:38:11Z

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

Nope. I got a bit further using the version of this repo that matches the publication. In the end I switched to work with stable audio tools.

jameshball mentioned this issue Feb 23, 2023

could provide a example recipe? #51

Open

jameshball closed this as completed Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN after training for a while #52

NaN after training for a while #52

jameshball commented Feb 23, 2023

jameshball commented Feb 23, 2023

jameshball commented Feb 26, 2023

Tinglok commented Mar 16, 2023 •

edited

Loading

jameshball commented Mar 16, 2023

Tinglok commented Mar 16, 2023

fred-dev commented Sep 19, 2023 •

edited

Loading

0417keito commented Dec 5, 2023

fred-dev commented Dec 5, 2023

fmiotello commented Jun 10, 2024

fred-dev commented Jun 11, 2024

NaN after training for a while #52

NaN after training for a while #52

Comments

jameshball commented Feb 23, 2023

jameshball commented Feb 23, 2023

jameshball commented Feb 26, 2023

Tinglok commented Mar 16, 2023 • edited Loading

jameshball commented Mar 16, 2023

Tinglok commented Mar 16, 2023

fred-dev commented Sep 19, 2023 • edited Loading

0417keito commented Dec 5, 2023

fred-dev commented Dec 5, 2023

fmiotello commented Jun 10, 2024

fred-dev commented Jun 11, 2024

Tinglok commented Mar 16, 2023 •

edited

Loading

fred-dev commented Sep 19, 2023 •

edited

Loading