Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when running yogo test #153

Open
zbarry opened this issue Jun 12, 2024 · 13 comments · Fixed by #154
Open

Issues when running yogo test #153

zbarry opened this issue Jun 12, 2024 · 13 comments · Fixed by #154

Comments

@zbarry
Copy link

zbarry commented Jun 12, 2024

Need to manually specify --wandb to avoid an error:


Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 43, in test_model
    log_to_wandb = args.wandb or len(args.wandb_resume_id) > 0
TypeError: object of type 'NoneType' has no len()

(this is fixed by force including a --wandb in the run command)

Models trained without normalization error out during test:

~~wandb snip~~
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 71, in test_model
    normalize_images=cfg["normalize_images"],
KeyError: 'normalize_images'

Think this line should be replaced with normalize_images=cfg.get("normalize_images", False)

@zbarry
Copy link
Author

zbarry commented Jun 12, 2024

Also running out of memory with yogo test - I guess this is a batch size thing? (though it's not possible to specify with the test command). Curious that it was able to train just fine with the given configuration, though!

loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.32it/s]
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@zbarry
Copy link
Author

zbarry commented Jun 12, 2024

Interestingly, when I lower the batch size in the test_model.test_model function to 4, it errors slightly differently:

loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.11it/s]
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@paul-lebel
Copy link
Collaborator

Created a branch for this, thanks @zbarry!

First two bugs are easy fixes, but I also ran into error when running yogo test <path to pth> <path to dataset defn>

 File "/home/paul.lebel/Documents/github/yogo/yogo/data/yogo_dataloader.py", line 215, in get_dataloader
    rank = torch.distributed.get_rank()
  File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1746, in get_rank
    default_pg = _get_default_group()
  File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

@Axel-Jacobsen , should it be checking for ValueError here instead? PyTorch docs does say it should be a RuntimeError though...

Of course, the base problem is that init_process_group is never called in test_model.py?

@paul-lebel paul-lebel linked a pull request Jun 13, 2024 that will close this issue
@paul-lebel
Copy link
Collaborator

PyTorch definitely raises ValueError here 🤔

@zbarry
Copy link
Author

zbarry commented Jun 17, 2024

Update on the CUDA OOM error - I lowered my test dataset size to just 10 images down from ~6k, and I'm still getting OOMs during the torchmetrics calculation:

Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 158, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.02 GiB (GPU 0; 14.58 GiB total capacity; 29.48 MiB already allocated; 14.28 GiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Adding print(fps, fls, fps.shape, fls.shape) just before self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long()) in yogo/metrics.py:

tensor([[ 0.7696,  0.6382,  0.8076,  ...,  0.8735,  2.6758, -2.5820],
        [ 0.4599,  0.7504,  0.4981,  ...,  0.8848,  2.7930, -2.8516],
        [ 0.0086,  0.0440,  0.0473,  ...,  0.6221,  2.8320, -3.1641],
        ...,
        [ 0.9490,  0.9010,  0.9870,  ...,  0.8540,  2.5430, -2.8633],
        [ 0.2406,  0.9212,  0.2791,  ...,  0.5068,  2.4668, -2.8008],
        [ 0.3581,  0.9416,  0.3961,  ...,  0.6982,  2.5469, -3.0234]],
       device='cuda:0') tensor([[1.0000, 0.8255, 0.0111, 0.8634, 0.0491, 0.0000],
        [1.0000, 0.4667, 0.0153, 0.5046, 0.0532, 0.0000],
        [1.0000, 0.0097, 0.0431, 0.0477, 0.0810, 0.0000],
        ...,
        [1.0000, 0.9481, 0.9023, 0.9861, 0.9403, 0.0000],
        [1.0000, 0.2449, 0.9204, 0.2829, 0.9583, 0.0000],
        [1.0000, 0.3593, 0.9398, 0.3972, 0.9778, 0.0000]], device='cuda:0') torch.Size([873, 7]) torch.Size([873, 6])

@Axel-Jacobsen
Copy link
Collaborator

back from vacation - addressing this now!

@Axel-Jacobsen
Copy link
Collaborator

OK looks like an issue w/ torchmetrics. I've found them to be finicky.

@zbarry, it would be very helpful to know some characteristics of your dataset - in #150, you mention your images are 512x512. Roughly how many object per image are you expecting?

@Axel-Jacobsen
Copy link
Collaborator

Also, how many classes do you have?

@Axel-Jacobsen
Copy link
Collaborator

If you could post a link to download your dataset, if it's public, that would be very helpful 😁

@Axel-Jacobsen
Copy link
Collaborator

I have a hunch that the multiclass precision recall metrics are just making a tonne of bins, requiring a tonne of memory. From the traceback above,

...
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
...

I'm fixing these at 500 in add-multiclass-pr-thresholds-limit. Perhaps that'll fix it?

@zbarry
Copy link
Author

zbarry commented Jun 20, 2024

Hey @Axel-Jacobsen - thanks for following up! I will get around to this more completely tomorrow (before I myself head out on vacation, haha), but for now at least some answers:

  • Did some digging around in the source code myself and also came to the hypothesis that there were simply too many bins being created.
  • This is a 2-class problem (really just one class "cell" and an unused second class as a placeholder)
  • I think we're expecting around 100-300 cells per image (possibly significantly more in edge cases).
  • Can ask about sharing some data; tomorrow, I'd like to try your thresholds idea to see if that simply fixes it!

@Axel-Jacobsen
Copy link
Collaborator

Sweet! Thank you. I'm waiting for a GPU rental service to approve me 🙄 so I'm a bit delayed w/ reproducing issues. But! Hopefully I'll be able to finally start fixing these issues soon.

@zbarry
Copy link
Author

zbarry commented Jul 17, 2024

Hi! I ended up figuring out what was happening here - there were too many thresholds on MulticlassROC which was causing a huge increase in memory consumption. I reduced it way down, and that OOM issue went away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants