Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requeue compatibility #74

Open
dennisbrookner opened this issue Nov 9, 2022 · 2 comments
Open

requeue compatibility #74

dennisbrookner opened this issue Nov 9, 2022 · 2 comments
Assignees

Comments

@dennisbrookner
Copy link

When submitting a job to a partition such as Harvard's gpu_requeue, sometimes, the job gets killed and requeued. It would be desirable for careless to continue where it left off in this case, rather than starting over! E.g. some flag could be added to the careless call that means, "before starting, inspect the contents of the out directory for a partial run and if you find it, continue from there?"

I have no idea how easy or hard this would be to implement (or if it exists already?). If it does exist, amazing, and if not, I figured I would mention it. I was kind of assuming that this would be the default behavior, and I was a little bummed when my job was killed and started over!

@kmdalton
Copy link
Member

i have often thought that i should implement model checkpointing. for a variety of reasons, this has historically been challenging to do. however, as of version 0.2.3, it is possible to save and load structure factors and scale parameters. it would not be overly painful to implement a flag that writes the parameters to disk every so often (something like 1,000 training steps seems an okay default). to resume a job one could then use the --scale-file and --structure-factor-file flags to resume the job. i will note that some state will be lost in the optimizer. i have no idea if that is a material concern.

definitely a good suggestion. i need to think about it more.

@kmdalton kmdalton self-assigned this Nov 10, 2022
@kmdalton
Copy link
Member

This would require a lot of work to do in a satisfying way, but the process is pretty much what I've been going through over on the abismal serialization branch. Essentially every layer and model needs to have the following 3 methods

  • get_config()
  • to_config(cls, config)
  • build(shape)

and should be decorated with the

  • @tfk.saving.register_keras_serializable(package="careless") decorator

it can be tricky to get this stuff right, but a few pointers

  • for very simple layers you can just set self.built=True in the constructor like i did here.
  • for to_config you can use the keras serializer to handle objects that you have implemented with the the above methods. see this serialization example and corresponding deserialization example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants