Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Checkpoint found after running run_128_B.sh #23

Open
wd255 opened this issue Jul 13, 2024 · 0 comments
Open

No Checkpoint found after running run_128_B.sh #23

wd255 opened this issue Jul 13, 2024 · 0 comments

Comments

@wd255
Copy link

wd255 commented Jul 13, 2024

For the reason of #17 I turned off the sanity check in the Trainer code to avoid doing validation.

The training finished as I got some tfevent log files that look correct, but

  1. Validation failed as expected because of Training phase sanity check fails by loading "../../data/imagenet/val" as an image #17
  2. I find nothing in my checkpoint path(It's an existing directory on my machine, set here: https://github.com/TencentARC/Open-MAGVIT2/blob/main/configs/imagenet_lfqgan_128_B.yaml#L15). The total step is around 16000.

The potential causes I can think of are

  1. validation failed and checkpoint saving is part of validation or depends on it
  2. Total step not enough for a checkpoint
  3. I'm not configuring the checkpoint save path correctly

Has anyone encountered this error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant