-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues when running yogo test
#153
Comments
Also running out of memory with
|
Interestingly, when I lower the batch size in the
|
Created a branch for this, thanks @zbarry! First two bugs are easy fixes, but I also ran into error when running File "/home/paul.lebel/Documents/github/yogo/yogo/data/yogo_dataloader.py", line 215, in get_dataloader
rank = torch.distributed.get_rank()
File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1746, in get_rank
default_pg = _get_default_group()
File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
@Axel-Jacobsen , should it be checking for Of course, the base problem is that |
PyTorch definitely raises ValueError here 🤔 |
Update on the CUDA OOM error - I lowered my test dataset size to just 10 images down from ~6k, and I'm still getting OOMs during the torchmetrics calculation:
Adding
|
back from vacation - addressing this now! |
Also, how many classes do you have? |
If you could post a link to download your dataset, if it's public, that would be very helpful 😁 |
I have a hunch that the multiclass precision recall metrics are just making a tonne of bins, requiring a tonne of memory. From the traceback above,
I'm fixing these at 500 in |
Hey @Axel-Jacobsen - thanks for following up! I will get around to this more completely tomorrow (before I myself head out on vacation, haha), but for now at least some answers:
|
Sweet! Thank you. I'm waiting for a GPU rental service to approve me 🙄 so I'm a bit delayed w/ reproducing issues. But! Hopefully I'll be able to finally start fixing these issues soon. |
Hi! I ended up figuring out what was happening here - there were too many thresholds on |
Need to manually specify
--wandb
to avoid an error:(this is fixed by force including a
--wandb
in the run command)Models trained without normalization error out during test:
Think this line should be replaced with
normalize_images=cfg.get("normalize_images", False)
The text was updated successfully, but these errors were encountered: