Issues when running `yogo test` #153

zbarry · 2024-06-12T19:48:35Z

Need to manually specify `--wandb` to avoid an error:


Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 43, in test_model
    log_to_wandb = args.wandb or len(args.wandb_resume_id) > 0
TypeError: object of type 'NoneType' has no len()

(this is fixed by force including a --wandb in the run command)

Models trained without normalization error out during test:

~~wandb snip~~
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 71, in test_model
    normalize_images=cfg["normalize_images"],
KeyError: 'normalize_images'

Think this line should be replaced with normalize_images=cfg.get("normalize_images", False)

The text was updated successfully, but these errors were encountered:

zbarry · 2024-06-12T19:51:32Z

Also running out of memory with yogo test - I guess this is a batch size thing? (though it's not possible to specify with the test command). Curious that it was able to train just fine with the given configuration, though!

loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.32it/s]
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

zbarry · 2024-06-12T20:28:17Z

Interestingly, when I lower the batch size in the test_model.test_model function to 4, it errors slightly differently:

loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.11it/s]
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

paul-lebel · 2024-06-13T16:08:21Z

Created a branch for this, thanks @zbarry!

First two bugs are easy fixes, but I also ran into error when running yogo test <path to pth> <path to dataset defn>

 File "/home/paul.lebel/Documents/github/yogo/yogo/data/yogo_dataloader.py", line 215, in get_dataloader
    rank = torch.distributed.get_rank()
  File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1746, in get_rank
    default_pg = _get_default_group()
  File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

@Axel-Jacobsen , should it be checking for ValueError here instead? PyTorch docs does say it should be a RuntimeError though...

Of course, the base problem is that init_process_group is never called in test_model.py?

paul-lebel · 2024-06-13T16:18:40Z

PyTorch definitely raises ValueError here 🤔

zbarry · 2024-06-17T17:07:40Z

Update on the CUDA OOM error - I lowered my test dataset size to just 10 images down from ~6k, and I'm still getting OOMs during the torchmetrics calculation:

Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 158, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.02 GiB (GPU 0; 14.58 GiB total capacity; 29.48 MiB already allocated; 14.28 GiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Adding print(fps, fls, fps.shape, fls.shape) just before self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long()) in yogo/metrics.py:

tensor([[ 0.7696,  0.6382,  0.8076,  ...,  0.8735,  2.6758, -2.5820],
        [ 0.4599,  0.7504,  0.4981,  ...,  0.8848,  2.7930, -2.8516],
        [ 0.0086,  0.0440,  0.0473,  ...,  0.6221,  2.8320, -3.1641],
        ...,
        [ 0.9490,  0.9010,  0.9870,  ...,  0.8540,  2.5430, -2.8633],
        [ 0.2406,  0.9212,  0.2791,  ...,  0.5068,  2.4668, -2.8008],
        [ 0.3581,  0.9416,  0.3961,  ...,  0.6982,  2.5469, -3.0234]],
       device='cuda:0') tensor([[1.0000, 0.8255, 0.0111, 0.8634, 0.0491, 0.0000],
        [1.0000, 0.4667, 0.0153, 0.5046, 0.0532, 0.0000],
        [1.0000, 0.0097, 0.0431, 0.0477, 0.0810, 0.0000],
        ...,
        [1.0000, 0.9481, 0.9023, 0.9861, 0.9403, 0.0000],
        [1.0000, 0.2449, 0.9204, 0.2829, 0.9583, 0.0000],
        [1.0000, 0.3593, 0.9398, 0.3972, 0.9778, 0.0000]], device='cuda:0') torch.Size([873, 7]) torch.Size([873, 6])

Axel-Jacobsen · 2024-06-19T16:04:45Z

back from vacation - addressing this now!

Axel-Jacobsen · 2024-06-19T16:19:16Z

OK looks like an issue w/ torchmetrics. I've found them to be finicky.

@zbarry, it would be very helpful to know some characteristics of your dataset - in #150, you mention your images are 512x512. Roughly how many object per image are you expecting?

Axel-Jacobsen · 2024-06-19T16:21:58Z

Also, how many classes do you have?

Axel-Jacobsen · 2024-06-19T16:22:22Z

If you could post a link to download your dataset, if it's public, that would be very helpful 😁

Axel-Jacobsen · 2024-06-19T16:45:28Z

I have a hunch that the multiclass precision recall metrics are just making a tonne of bins, requiring a tonne of memory. From the traceback above,

...
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
...

I'm fixing these at 500 in add-multiclass-pr-thresholds-limit. Perhaps that'll fix it?

zbarry · 2024-06-20T20:40:52Z

Hey @Axel-Jacobsen - thanks for following up! I will get around to this more completely tomorrow (before I myself head out on vacation, haha), but for now at least some answers:

Did some digging around in the source code myself and also came to the hypothesis that there were simply too many bins being created.
This is a 2-class problem (really just one class "cell" and an unused second class as a placeholder)
I think we're expecting around 100-300 cells per image (possibly significantly more in edge cases).
Can ask about sharing some data; tomorrow, I'd like to try your thresholds idea to see if that simply fixes it!

Axel-Jacobsen · 2024-06-21T18:40:59Z

Sweet! Thank you. I'm waiting for a GPU rental service to approve me 🙄 so I'm a bit delayed w/ reproducing issues. But! Hopefully I'll be able to finally start fixing these issues soon.

zbarry · 2024-07-17T18:59:49Z

Hi! I ended up figuring out what was happening here - there were too many thresholds on MulticlassROC which was causing a huge increase in memory consumption. I reduced it way down, and that OOM issue went away.

paul-lebel linked a pull request Jun 13, 2024 that will close this issue

153 issues when running yogo test #154

Merged

paul-lebel closed this as completed in #154 Jun 17, 2024

paul-lebel reopened this Jun 17, 2024

paul-lebel mentioned this issue Jun 17, 2024

Update argparsers.py #156

Closed

Axel-Jacobsen mentioned this issue Jun 19, 2024

Add multiclass pr thresholds limit #157

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues when running `yogo test` #153

Issues when running `yogo test` #153

zbarry commented Jun 12, 2024

zbarry commented Jun 12, 2024

zbarry commented Jun 12, 2024

paul-lebel commented Jun 13, 2024

paul-lebel commented Jun 13, 2024

zbarry commented Jun 17, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

zbarry commented Jun 20, 2024 •

edited

Loading

Axel-Jacobsen commented Jun 21, 2024

zbarry commented Jul 17, 2024

Issues when running yogo test #153

Issues when running yogo test #153

Comments

zbarry commented Jun 12, 2024

Need to manually specify --wandb to avoid an error:

Models trained without normalization error out during test:

zbarry commented Jun 12, 2024

zbarry commented Jun 12, 2024

paul-lebel commented Jun 13, 2024

paul-lebel commented Jun 13, 2024

zbarry commented Jun 17, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

Axel-Jacobsen commented Jun 19, 2024

zbarry commented Jun 20, 2024 • edited Loading

Axel-Jacobsen commented Jun 21, 2024

zbarry commented Jul 17, 2024

Issues when running `yogo test` #153

Issues when running `yogo test` #153

Need to manually specify `--wandb` to avoid an error:

zbarry commented Jun 20, 2024 •

edited

Loading