[BUG] FileNotFoundError from checkpoint_saver.py. #2407

meg-huggingface · 2025-01-14T21:31:19Z

Describe the bug
After creating checkpoint 0, 1, and 2, pytorch-image-models/timm/utils/checkpoint_saver.py hits a FileNotFoundError, apparently looking to move a checkpoint that doesn't exist. This stops the training.

To Reproduce
Steps to reproduce the behavior:

Using 4xL40S
Using variables, e.g., random_num=0.0, subset=1over64
./train.sh 4 --dataset hfds/datacomp/imagenet-1k-random-{random_num}-{subset} --log-wandb --experiment ImageNetTraining{random_num}-{subset} --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4

Expected behavior
The appropriate files, with the right filenames, are found when saving checkpoints, and training continues.

Traceback

Traceback (most recent call last):
  File "/app/pytorch-image-models/train.py", line 1236, in <module>
    main()
  File "/app/pytorch-image-models/train.py", line 959, in main
    best_metric, best_epoch = saver.save_checkpoint(epoch, metric=latest_metric)
  File "/app/pytorch-image-models/timm/utils/checkpoint_saver.py", line 125, in save_checkpoint
    self._replace(tmp_save_path, last_save_path)
  File "/app/pytorch-image-models/timm/utils/checkpoint_saver.py", line 73, in _replace
    os.replace(src, dst)
FileNotFoundError: [Errno 2] No such file or directory: './output/train/ImageNetTraining0.0-frac-1over64/tmp.pth.tar' -> './output/train/ImageNetTraining0.0-frac-1over64/last.pth.tar'

Desktop (please complete the following information):

OS: Hugging Face Spaces (Ubuntu, IIRC) FROM docker.io/library/python:3.9@sha256:c1b677ce9a2c118bfaa135709704f66a9beb19f5f21112b88265acf566c87cbb
This repository version: Current repo when running git clone https://github.com/huggingface/pytorch-image-models.git
PyTorch version w/ CUDA/cuDNN : Whatever is specified currently in https://github.com/huggingface/pytorch-image-models/blob/main/requirements.txt. torch>=1.7

Additional context
This must have been introduced very recently, I have been successfully running the code here until last week; I reclone everytime I use it.

The text was updated successfully, but these errors were encountered:

rwightman · 2025-01-14T21:47:32Z

@meg-huggingface humm, that's odd. I was just monkeying around with the saver to allow it to work on filesystems that don't support hardlinks. However I tested on single node, multi node local drive. I tested Google colab saving to local storage and saving to mounted Google Drive. No issues.

So it's running in HF Spaces? I've actually never tried that. I feel this might be related to filesystem, permissions related to that environment. Where there any other preceeding logs about failures? Can you list the contents of the experiment output folders from the Space? Is there enough free space?

rwightman · 2025-01-14T21:56:20Z

it is related to my change too, since it used to work, the question is why...

deb9895

Most of the change was related to reverting back to copy if hardlinks cannot be used. However I did change the use of os.rename -> os.replace .. though on linux OS that shouldn't have changed things, and is more appropriate for Windows I believe

rwightman · 2025-01-14T22:14:19Z

Also, this isn't trying to delete an old file, at least that callstack is indicating it's during the save & swap part. So to avoid corrupting the previous 'last' checkpoint saved if there was a crash or out of space at the wrong moment, the new checkpoint is first saved to a the temp.pth.tar. Then, assuming that didn't throw, it's supposed to rename that temp file to last.pth.tar. It should then go on to hard link that to a numbered checkpoint-{epoch}.pth.tar and possibly best.pth.tar

So, it's failing at a weird spot. It's suggesting that the save to temp failed and the file doesn't exist right after the save returned, or that somehow the mechanics of os.rename vs os.replace are different and one either supressed the error, avoided a race, or was embued with magic powers. I used to use os.rename to move temp -> last, now use os.replace.

rwightman · 2025-01-14T23:20:32Z

@meg-huggingface I tried the following, created a 4xL40S space w/ a Docker Jupyter Notebook config, went to the shell and in the default app folder ran:

pip3 install torch torchvision torchaudio
git clone https://github.com/huggingface/pytorch-image-models
cd pytorch-image-models
pip install -r requirements.txt
pip install datasets wandb
<wandb login>
./distributed_train.sh 4 --dataset hfds/timm/mini-imagenet --log-wandb --experiment ImageNetTraining-0.1-test --model seresnet34 --sched cosine --epochs 30 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --sched-on-updates --warmup-prefix --num-classes 100

It ran to completion, saver was using hardlinks, saved checkpoints look correct. The dataset change shouldn't have any impact.

Note on my arg differences. For backward compat reasons I didn't update some scheduler related config defaults that have different 'best practice' values now... I prefer to run with --sched-on-updates --warmup-prefix to update scheduler with each optimizer step (instead of once per epoch), and avoid a discontinuity at the transitions from warmup -> decay stage and decay stage -> cooldown/end.

I did release a new timm version (1.0.13) recently that includes the checkpoint saving update. Is it possible the train.py script on your space is out of sync with the lib version being used? Though I checked to try and prevent that being an issue with the changes made...

~/app/pytorch-image-models$ ls output/train/ImageNetTraining-0.1-test/ -ltri
total 2018120
5906643511 -rw-r--r-- 1 user user      2609 Jan 15 00:08 args.yaml
5906644672 -rw-r--r-- 1 user user 172211426 Jan 15 00:14 checkpoint-25.pth.tar
5906644673 -rw-r--r-- 1 user user 172211426 Jan 15 00:14 checkpoint-26.pth.tar
5906644677 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-27.pth.tar
5906644675 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-28.pth.tar
5906644674 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-29.pth.tar
5906644676 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-30.pth.tar
5906644679 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-31.pth.tar
5906643515 -rw-r--r-- 1 user user 172211426 Jan 15 00:16 checkpoint-32.pth.tar
5906644678 -rw-r--r-- 1 user user 172211426 Jan 15 00:16 checkpoint-33.pth.tar
5906643513 -rw-r--r-- 1 user user      3399 Jan 15 00:16 summary.csv
5906643519 -rw-r--r-- 3 user user 172211426 Jan 15 00:16 model_best.pth.tar
5906643519 -rw-r--r-- 3 user user 172211426 Jan 15 00:16 last.pth.tar
5906643519 -rw-r--r-- 3 user user 172211426 Jan 15 00:16 checkpoint-34.pth.tar

meg-huggingface · 2025-01-15T02:32:15Z

Thanks so much. After reading this, I tried another version of my data and everything ran to completion, no problem (with the checkpoints & naming being handled appropriately).
I'm wondering if there was a temporary issue on Spaces, and it was just a strange coincidence that it corresponded to what you had been working on.
I'm now trying to rerun with the same version of the data that caused the error for this example, and will circle back with the update.

rwightman · 2025-01-15T04:10:41Z

@meg-huggingface k, sounds good.. behaviour really shouldn't depend on the dataset.

Possible that it was a filesystem hiccup, loss of consistency (the first write wasn't seen by the subsequent rename/replace). That shouldn't happen for a normal filesystem, but crazy cloud storage architectures, possibly under load or abnormal circumstances? Also possible that there's a corner case when out of space where an error isn't logged/propagated on a failed write...

FYI when I was fiddling with timm in Spaces for first time I noticed that the CPU power is a bit on the weak side for image preprocessing. It might benefit from Pillow-SIMD so created a fork of the jupyterlab space with some updates

Python 3.12
Uninstall pillow, build & install pillow-simd
Pre-install timm, wandb, datasets (though can still checkout from git obv)

https://huggingface.co/spaces/rwightman/jupyterlab-timm

rwightman · 2025-01-21T17:23:12Z

@meg-huggingface have you run into this again? can be closed?

meg-huggingface added the bug Something isn't working label Jan 14, 2025

meg-huggingface assigned rwightman Jan 14, 2025

meg-huggingface added a commit to meg-huggingface/pytorch-image-models that referenced this issue Jan 14, 2025

Trying to work around Bug huggingface#2407

dac5919

rwightman closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] FileNotFoundError from checkpoint_saver.py. #2407

[BUG] FileNotFoundError from checkpoint_saver.py. #2407

meg-huggingface commented Jan 14, 2025 •

edited

Loading

rwightman commented Jan 14, 2025

rwightman commented Jan 14, 2025

rwightman commented Jan 14, 2025 •

edited

Loading

rwightman commented Jan 14, 2025 •

edited

Loading

meg-huggingface commented Jan 15, 2025

rwightman commented Jan 15, 2025 •

edited

Loading

rwightman commented Jan 21, 2025

[BUG] FileNotFoundError from checkpoint_saver.py. #2407

[BUG] FileNotFoundError from checkpoint_saver.py. #2407

Comments

meg-huggingface commented Jan 14, 2025 • edited Loading

rwightman commented Jan 14, 2025

rwightman commented Jan 14, 2025

rwightman commented Jan 14, 2025 • edited Loading

rwightman commented Jan 14, 2025 • edited Loading

meg-huggingface commented Jan 15, 2025

rwightman commented Jan 15, 2025 • edited Loading

rwightman commented Jan 21, 2025

meg-huggingface commented Jan 14, 2025 •

edited

Loading

rwightman commented Jan 14, 2025 •

edited

Loading

rwightman commented Jan 14, 2025 •

edited

Loading

rwightman commented Jan 15, 2025 •

edited

Loading