-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN after training for a while #52
Comments
Closing as I haven't experienced the issue since and have also made this repo private. Thanks for the library!! |
Hi @jameshball , thanks for sharing the code! I wonder if you changed anything when you fixed this issue. |
I didn't change anything :/ just reran and haven't experienced it since so this could probably be reopened if you've experienced it too |
Oh cool, thanks for clarifying! |
Did anyone else get this nan loss? Do either of your know how many iterations it took for you to get this? It happened for me at 840 iterations. I tried to go back and restart training from the last checkpoint before this but ended up with Nan again. Was anyone who had this issue able to restart from an earlier checkpoint and move past this? **Update: I ran back to earlier checkpoints and ended up at Nan at exactly the same place on several occasions. I am winding back further, but this may be that there is an error with the model early on, that does not express itslef until a number of iterations. |
@fred-dev What was the quality of the sample output from the model from which the NaN loss was output? Did it result in reasonable speech data? |
@0417keito I was not using voice, but a waveform dataset. The generation was OK, not nearly as good as the published results. I never worked out how to actually avoid the NAN loss. I wonder if anyone has sucessfully used this outside of the publishers? |
Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it? |
Nope. I got a bit further using the version of this repo that matches the publication. In the end I switched to work with stable audio tools. |
Hi!
I'm having an issue training with the basic model provided in the README. After training on the LibriSpeech dataset for about 20 epochs, I start getting NaN losses returned from the model, and when sampling and saving to a file I just get silent audio randomly.
I had a go at debugging but couldn't really find the issue, other than the first NaN in the forward pass I could find was the input to the ResNet block. Not sure if this is helpful, but I've added my debug output here: output.log. The prints were just me testing in various
forward
functions whether the inputs and outputs were NaN but I didn't isolate any lines.My training script is pretty short and I'm not doing anything particularly weird that would cause this I don't think! You can have a look here: https://github.com/jameshball/audio-diffusion/blob/master/train.py
Also here's a snipper of my output from training where it turns to nan loss: nan.txt
I should also be able to follow up with a google drive link to download the checkpoint so you can test it more easily - you might need to modify train.py to remove some wandb calls functions and just load the checkpoint from disk but should be straightforward. Alternatively, I also get NaNs when just sampling from the model here: https://github.com/jameshball/audio-diffusion/blob/master/sample.py
Please let me know if there's anything I can help with as this would be great to fix!
James
The text was updated successfully, but these errors were encountered: