Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: cannot convert float NaN to integer in loss_string += [f"Avg Loss {species}: {round(np.average(epoch_ave_losses[species]))}"] #68

Open
micheladallangelo opened this issue Aug 15, 2024 · 2 comments

Comments

@micheladallangelo
Copy link

Hi Yanay,

I'm running the tutorial /Vignettes/frog_zebrafish_embryogenesis/Train SATURN.ipynb after successfully running the dataloader notebook.

However when I run the command I get this error:

Pretraining...
0%| | 0/200 [01:01<?, ?it/s]
Traceback (most recent call last):
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 1072, in
trainer(args)
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 654, in trainer
pretrain_model = pretrain_saturn(pretrain_model, pretrain_loader, optim_pretrain,
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 258, in pretrain_saturn
loss_string += [f"Avg Loss {species}: {round(np.average(epoch_ave_losses[species]))}"]
ValueError: cannot convert float NaN to integer

I changed the path to the protein embeddings and the torch.device line (I replace mps for cuda since I have a mac) in the script train-saturn.py. Beside this I didn't do any change.

I tried to debug it by adding a line of code which prints the value of loss for each epoch before calculating the average: print(f"Epoch {epoch} - {species} losses: {epoch_ave_losses[species]}").

What I get is:

Pretraining...
0%| | 0/10 [00:00<?, ?it/s]Epoch 1 - frog losses: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan])
Epoch 1 - zebrafish losses: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan])
Epoch 1: L1 Loss 0.0 Rank Loss 9.588663101196289: 10%| | 1/10 [01:03<09:33, 63.^C

There are a lot of nan values for all the following epochs, I don't know if it is normal. On the other side, Rank Loss is a number and always different. So I was wondering why calculating the average loss for each epoch (epoch_ave_losses) and if I can skip this line of code.

I tried to use np.nan_to_num to replace nan with zero before calculating the average:
clean_losses = np.nan_to_num(epoch_ave_losses[species], nan=0.0)
loss_string += [f"Avg Loss {species}: {round(np.average(clean_losses))}"]

The code then run, but I end up having another error at the STARTING METRIC TRAINING

Traceback (most recent call last):
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 1079, in
trainer(args)
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 799, in trainer
train(metric_model, loss_func, mining_func, device,
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 131, in train
loss.backward()
File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'NativeLayerNormBackward0' returned nan values in its 0th output.

I imagine all this is due to these nan values?

I hope the problem is clear, I went through all the issues open to see if someone already raised one similar before asking, but I didn't find any.

Thank you,
Michela

@Yanay1
Copy link
Collaborator

Yanay1 commented Aug 15, 2024

Hi Michela,

There shouldn't be any NaN's during training, it's possible that the change to MPS could have caused this but I am not sure.

Any kind of NaN during training is really concerning and will probably completely throw off the model.

@micheladallangelo
Copy link
Author

Hi Yanay,

Thank you for answering. I run the model in cpu and the epoch loss is fine, there aren't nan values anymore. So you were right, the problem was the change to MPS, even though I don't understand the reason behind it :)

If it's too slow, I will run it in the cluster where we have cuda installed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants