Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN after training for a while #52

Closed
jameshball opened this issue Feb 23, 2023 · 10 comments
Closed

NaN after training for a while #52

jameshball opened this issue Feb 23, 2023 · 10 comments

Comments

@jameshball
Copy link

Hi!

I'm having an issue training with the basic model provided in the README. After training on the LibriSpeech dataset for about 20 epochs, I start getting NaN losses returned from the model, and when sampling and saving to a file I just get silent audio randomly.

I had a go at debugging but couldn't really find the issue, other than the first NaN in the forward pass I could find was the input to the ResNet block. Not sure if this is helpful, but I've added my debug output here: output.log. The prints were just me testing in various forward functions whether the inputs and outputs were NaN but I didn't isolate any lines.

My training script is pretty short and I'm not doing anything particularly weird that would cause this I don't think! You can have a look here: https://github.com/jameshball/audio-diffusion/blob/master/train.py

Also here's a snipper of my output from training where it turns to nan loss: nan.txt

I should also be able to follow up with a google drive link to download the checkpoint so you can test it more easily - you might need to modify train.py to remove some wandb calls functions and just load the checkpoint from disk but should be straightforward. Alternatively, I also get NaNs when just sampling from the model here: https://github.com/jameshball/audio-diffusion/blob/master/sample.py

Please let me know if there's anything I can help with as this would be great to fix!

James

@jameshball
Copy link
Author

@jameshball
Copy link
Author

Closing as I haven't experienced the issue since and have also made this repo private. Thanks for the library!!

@Tinglok
Copy link

Tinglok commented Mar 16, 2023

Hi @jameshball , thanks for sharing the code! I wonder if you changed anything when you fixed this issue.

@jameshball
Copy link
Author

I didn't change anything :/ just reran and haven't experienced it since so this could probably be reopened if you've experienced it too

@Tinglok
Copy link

Tinglok commented Mar 16, 2023

Oh cool, thanks for clarifying!

@fred-dev
Copy link

fred-dev commented Sep 19, 2023

Did anyone else get this nan loss? Do either of your know how many iterations it took for you to get this? It happened for me at 840 iterations. I tried to go back and restart training from the last checkpoint before this but ended up with Nan again. Was anyone who had this issue able to restart from an earlier checkpoint and move past this?

**Update: I ran back to earlier checkpoints and ended up at Nan at exactly the same place on several occasions. I am winding back further, but this may be that there is an error with the model early on, that does not express itslef until a number of iterations.

@0417keito
Copy link

@fred-dev What was the quality of the sample output from the model from which the NaN loss was output? Did it result in reasonable speech data?

@fred-dev
Copy link

fred-dev commented Dec 5, 2023

@0417keito I was not using voice, but a waveform dataset. The generation was OK, not nearly as good as the published results. I never worked out how to actually avoid the NAN loss. I wonder if anyone has sucessfully used this outside of the publishers?

@fmiotello
Copy link

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

@fred-dev
Copy link

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

Nope. I got a bit further using the version of this repo that matches the publication. In the end I switched to work with stable audio tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants