Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FileNotFoundError from checkpoint_saver.py. #2407

Closed
meg-huggingface opened this issue Jan 14, 2025 · 7 comments
Closed

[BUG] FileNotFoundError from checkpoint_saver.py. #2407

meg-huggingface opened this issue Jan 14, 2025 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@meg-huggingface
Copy link

meg-huggingface commented Jan 14, 2025

Describe the bug
After creating checkpoint 0, 1, and 2, pytorch-image-models/timm/utils/checkpoint_saver.py hits a FileNotFoundError, apparently looking to move a checkpoint that doesn't exist. This stops the training.

To Reproduce
Steps to reproduce the behavior:

  1. Using 4xL40S
  2. Using variables, e.g., random_num=0.0, subset=1over64
  3. ./train.sh 4 --dataset hfds/datacomp/imagenet-1k-random-{random_num}-{subset} --log-wandb --experiment ImageNetTraining{random_num}-{subset} --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4

Expected behavior
The appropriate files, with the right filenames, are found when saving checkpoints, and training continues.

Traceback

Traceback (most recent call last):
  File "/app/pytorch-image-models/train.py", line 1236, in <module>
    main()
  File "/app/pytorch-image-models/train.py", line 959, in main
    best_metric, best_epoch = saver.save_checkpoint(epoch, metric=latest_metric)
  File "/app/pytorch-image-models/timm/utils/checkpoint_saver.py", line 125, in save_checkpoint
    self._replace(tmp_save_path, last_save_path)
  File "/app/pytorch-image-models/timm/utils/checkpoint_saver.py", line 73, in _replace
    os.replace(src, dst)
FileNotFoundError: [Errno 2] No such file or directory: './output/train/ImageNetTraining0.0-frac-1over64/tmp.pth.tar' -> './output/train/ImageNetTraining0.0-frac-1over64/last.pth.tar'

Desktop (please complete the following information):

  • OS: Hugging Face Spaces (Ubuntu, IIRC) FROM docker.io/library/python:3.9@sha256:c1b677ce9a2c118bfaa135709704f66a9beb19f5f21112b88265acf566c87cbb
  • This repository version: Current repo when running git clone https://github.com/huggingface/pytorch-image-models.git
  • PyTorch version w/ CUDA/cuDNN : Whatever is specified currently in https://github.com/huggingface/pytorch-image-models/blob/main/requirements.txt. torch>=1.7

Additional context
This must have been introduced very recently, I have been successfully running the code here until last week; I reclone everytime I use it.

@meg-huggingface meg-huggingface added the bug Something isn't working label Jan 14, 2025
meg-huggingface added a commit to meg-huggingface/pytorch-image-models that referenced this issue Jan 14, 2025
@rwightman
Copy link
Collaborator

@meg-huggingface humm, that's odd. I was just monkeying around with the saver to allow it to work on filesystems that don't support hardlinks. However I tested on single node, multi node local drive. I tested Google colab saving to local storage and saving to mounted Google Drive. No issues.

So it's running in HF Spaces? I've actually never tried that. I feel this might be related to filesystem, permissions related to that environment. Where there any other preceeding logs about failures? Can you list the contents of the experiment output folders from the Space? Is there enough free space?

@rwightman
Copy link
Collaborator

it is related to my change too, since it used to work, the question is why...

deb9895

Most of the change was related to reverting back to copy if hardlinks cannot be used. However I did change the use of os.rename -> os.replace .. though on linux OS that shouldn't have changed things, and is more appropriate for Windows I believe

@rwightman
Copy link
Collaborator

rwightman commented Jan 14, 2025

Also, this isn't trying to delete an old file, at least that callstack is indicating it's during the save & swap part. So to avoid corrupting the previous 'last' checkpoint saved if there was a crash or out of space at the wrong moment, the new checkpoint is first saved to a the temp.pth.tar. Then, assuming that didn't throw, it's supposed to rename that temp file to last.pth.tar. It should then go on to hard link that to a numbered checkpoint-{epoch}.pth.tar and possibly best.pth.tar

So, it's failing at a weird spot. It's suggesting that the save to temp failed and the file doesn't exist right after the save returned, or that somehow the mechanics of os.rename vs os.replace are different and one either supressed the error, avoided a race, or was embued with magic powers. I used to use os.rename to move temp -> last, now use os.replace.

@rwightman
Copy link
Collaborator

rwightman commented Jan 14, 2025

@meg-huggingface I tried the following, created a 4xL40S space w/ a Docker Jupyter Notebook config, went to the shell and in the default app folder ran:

pip3 install torch torchvision torchaudio
git clone https://github.com/huggingface/pytorch-image-models
cd pytorch-image-models
pip install -r requirements.txt
pip install datasets wandb
<wandb login>
./distributed_train.sh 4 --dataset hfds/timm/mini-imagenet --log-wandb --experiment ImageNetTraining-0.1-test --model seresnet34 --sched cosine --epochs 30 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --sched-on-updates --warmup-prefix --num-classes 100

It ran to completion, saver was using hardlinks, saved checkpoints look correct. The dataset change shouldn't have any impact.

Note on my arg differences. For backward compat reasons I didn't update some scheduler related config defaults that have different 'best practice' values now... I prefer to run with --sched-on-updates --warmup-prefix to update scheduler with each optimizer step (instead of once per epoch), and avoid a discontinuity at the transitions from warmup -> decay stage and decay stage -> cooldown/end.

I did release a new timm version (1.0.13) recently that includes the checkpoint saving update. Is it possible the train.py script on your space is out of sync with the lib version being used? Though I checked to try and prevent that being an issue with the changes made...

~/app/pytorch-image-models$ ls output/train/ImageNetTraining-0.1-test/ -ltri
total 2018120
5906643511 -rw-r--r-- 1 user user      2609 Jan 15 00:08 args.yaml
5906644672 -rw-r--r-- 1 user user 172211426 Jan 15 00:14 checkpoint-25.pth.tar
5906644673 -rw-r--r-- 1 user user 172211426 Jan 15 00:14 checkpoint-26.pth.tar
5906644677 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-27.pth.tar
5906644675 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-28.pth.tar
5906644674 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-29.pth.tar
5906644676 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-30.pth.tar
5906644679 -rw-r--r-- 1 user user 172211426 Jan 15 00:15 checkpoint-31.pth.tar
5906643515 -rw-r--r-- 1 user user 172211426 Jan 15 00:16 checkpoint-32.pth.tar
5906644678 -rw-r--r-- 1 user user 172211426 Jan 15 00:16 checkpoint-33.pth.tar
5906643513 -rw-r--r-- 1 user user      3399 Jan 15 00:16 summary.csv
5906643519 -rw-r--r-- 3 user user 172211426 Jan 15 00:16 model_best.pth.tar
5906643519 -rw-r--r-- 3 user user 172211426 Jan 15 00:16 last.pth.tar
5906643519 -rw-r--r-- 3 user user 172211426 Jan 15 00:16 checkpoint-34.pth.tar

@meg-huggingface
Copy link
Author

Thanks so much. After reading this, I tried another version of my data and everything ran to completion, no problem (with the checkpoints & naming being handled appropriately).
I'm wondering if there was a temporary issue on Spaces, and it was just a strange coincidence that it corresponded to what you had been working on.
I'm now trying to rerun with the same version of the data that caused the error for this example, and will circle back with the update.

@rwightman
Copy link
Collaborator

rwightman commented Jan 15, 2025

@meg-huggingface k, sounds good.. behaviour really shouldn't depend on the dataset.

Possible that it was a filesystem hiccup, loss of consistency (the first write wasn't seen by the subsequent rename/replace). That shouldn't happen for a normal filesystem, but crazy cloud storage architectures, possibly under load or abnormal circumstances? Also possible that there's a corner case when out of space where an error isn't logged/propagated on a failed write...

FYI when I was fiddling with timm in Spaces for first time I noticed that the CPU power is a bit on the weak side for image preprocessing. It might benefit from Pillow-SIMD so created a fork of the jupyterlab space with some updates

  • Python 3.12
  • Uninstall pillow, build & install pillow-simd
  • Pre-install timm, wandb, datasets (though can still checkout from git obv)

https://huggingface.co/spaces/rwightman/jupyterlab-timm

@rwightman
Copy link
Collaborator

@meg-huggingface have you run into this again? can be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants