Error while i run training #1

hnnam0906 · 2024-10-24T01:46:09Z

I get the following issue while running the training command in the readme
"python run.py --graph_size 50 --problem mtsp --run_name 'mtsp50' --agent_min 2 --agent_max 10"

assert not torch.isnan(log_p).any()
AssertionError
in _get_log_p method

As far as i know all the log_p are NaNs so it throws the error. Please kindly help to investigate and correct the issue.
Thanks

hnnam0906 · 2024-10-25T03:10:59Z

Hi @Leaveson , @hyeonahkimm,

I hope you are doing well.
Your solution is really nice and i would like to use it. Can you kindly investigate and correct the above issue so that i can train it by myself?

Thanks

hnnam0906 · 2024-10-29T19:27:15Z

Hi @Leaveson, @hyeonahkimm,

Could you help to solve the above issue?
I look forward to hearing from you.

Thanks.

hyeonahkimm · 2024-10-29T20:46:56Z

Hi @hnnam0906 ,

When I trained the model with the same script, the error didn't occur until after finishing one epoch. However, I suspect the error might be due to numerical instability in the log softmax operation. When log p values are too large, the operation can give nan values. Please try adding clipping before the log softmax function. Let me know if you still experience the same problem.

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while i run training #1

Error while i run training #1

hnnam0906 commented Oct 24, 2024 •

edited

Loading

hnnam0906 commented Oct 25, 2024

hnnam0906 commented Oct 29, 2024

hyeonahkimm commented Oct 29, 2024

Error while i run training #1

Error while i run training #1

Comments

hnnam0906 commented Oct 24, 2024 • edited Loading

hnnam0906 commented Oct 25, 2024

hnnam0906 commented Oct 29, 2024

hyeonahkimm commented Oct 29, 2024

hnnam0906 commented Oct 24, 2024 •

edited

Loading