-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] FileNotFoundError from checkpoint_saver.py. #2407
Comments
@meg-huggingface humm, that's odd. I was just monkeying around with the saver to allow it to work on filesystems that don't support hardlinks. However I tested on single node, multi node local drive. I tested Google colab saving to local storage and saving to mounted Google Drive. No issues. So it's running in HF Spaces? I've actually never tried that. I feel this might be related to filesystem, permissions related to that environment. Where there any other preceeding logs about failures? Can you list the contents of the experiment output folders from the Space? Is there enough free space? |
it is related to my change too, since it used to work, the question is why... Most of the change was related to reverting back to copy if hardlinks cannot be used. However I did change the use of os.rename -> os.replace .. though on linux OS that shouldn't have changed things, and is more appropriate for Windows I believe |
Also, this isn't trying to delete an old file, at least that callstack is indicating it's during the save & swap part. So to avoid corrupting the previous 'last' checkpoint saved if there was a crash or out of space at the wrong moment, the new checkpoint is first saved to a the So, it's failing at a weird spot. It's suggesting that the save to temp failed and the file doesn't exist right after the save returned, or that somehow the mechanics of os.rename vs os.replace are different and one either supressed the error, avoided a race, or was embued with magic powers. I used to use os.rename to move temp -> last, now use os.replace. |
@meg-huggingface I tried the following, created a 4xL40S space w/ a Docker Jupyter Notebook config, went to the shell and in the default app folder ran:
It ran to completion, saver was using hardlinks, saved checkpoints look correct. The dataset change shouldn't have any impact. Note on my arg differences. For backward compat reasons I didn't update some scheduler related config defaults that have different 'best practice' values now... I prefer to run with I did release a new timm version (1.0.13) recently that includes the checkpoint saving update. Is it possible the train.py script on your space is out of sync with the lib version being used? Though I checked to try and prevent that being an issue with the changes made...
|
Thanks so much. After reading this, I tried another version of my data and everything ran to completion, no problem (with the checkpoints & naming being handled appropriately). |
@meg-huggingface k, sounds good.. behaviour really shouldn't depend on the dataset. Possible that it was a filesystem hiccup, loss of consistency (the first write wasn't seen by the subsequent rename/replace). That shouldn't happen for a normal filesystem, but crazy cloud storage architectures, possibly under load or abnormal circumstances? Also possible that there's a corner case when out of space where an error isn't logged/propagated on a failed write... FYI when I was fiddling with timm in Spaces for first time I noticed that the CPU power is a bit on the weak side for image preprocessing. It might benefit from Pillow-SIMD so created a fork of the jupyterlab space with some updates
|
@meg-huggingface have you run into this again? can be closed? |
Describe the bug
After creating checkpoint 0, 1, and 2, pytorch-image-models/timm/utils/checkpoint_saver.py hits a FileNotFoundError, apparently looking to move a checkpoint that doesn't exist. This stops the training.
To Reproduce
Steps to reproduce the behavior:
random_num=0.0
,subset=1over64
./train.sh 4 --dataset hfds/datacomp/imagenet-1k-random-{random_num}-{subset} --log-wandb --experiment ImageNetTraining{random_num}-{subset} --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4
Expected behavior
The appropriate files, with the right filenames, are found when saving checkpoints, and training continues.
Traceback
Desktop (please complete the following information):
git clone https://github.com/huggingface/pytorch-image-models.git
https://github.com/huggingface/pytorch-image-models/blob/main/requirements.txt
.torch>=1.7
Additional context
This must have been introduced very recently, I have been successfully running the code here until last week; I reclone everytime I use it.
The text was updated successfully, but these errors were encountered: