Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float division by zero on CUDA memory outage #234

Open
Linus-XZX opened this issue Jun 17, 2024 · 0 comments
Open

Float division by zero on CUDA memory outage #234

Linus-XZX opened this issue Jun 17, 2024 · 0 comments

Comments

@Linus-XZX
Copy link

Linus-XZX commented Jun 17, 2024

After CUDA runs out of memory, the loss calculation and summary will fail due to a division by zero error. It seems that the batch skipping functionality is not entirely working...?
(Screenshot is taken on batch size 16.)
image
Conda and pip envs are as follows.
pip_env.txt
conda_env.txt

While using a smaller batch size is a valid workaround, any help will be appreciated here.

Edit: It seems that at batch size 4 the skip works properly, but messes with the validation possibly due to skipping 1-sized batches.

@Linus-XZX Linus-XZX changed the title Float division by zero on memory outage Float division by zero on CUDA memory outage Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant