Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc21: add offline job state #306

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion data/spec_21/states.dot
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ digraph states {
DEPEND;
PRIORITY;
SCHED;
RUN;
{rank=same; RUN; OFFLINE;}
CLEANUP;
}

Expand All @@ -25,6 +25,9 @@ digraph states {

SCHED -> PRIORITY [label="flux-restart"]

RUN -> OFFLINE [xlabel="disconnect"]
OFFLINE -> RUN [xlabel="reconnect"]

edge [weight=0 color="red"];

DEPEND -> CLEANUP [label="exception"];
Expand Down
171 changes: 85 additions & 86 deletions data/spec_21/states.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 41 additions & 2 deletions spec_21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,14 @@ RUN
job shells have been started, and a ``finish`` event once all the job shells
have exited. The state transitions to CLEANUP.

OFFLINE
The job was started, but the job manager has lost track of it due
to an error (for example, a system crash). The job manager is
attempting to reconnect itself to the running job. A ``disconnect``
event is logged to indicate transition into this state.
``reconnect`` will be logged when the tracking has been
reestablished and we can re-enter the RUN state.

CLEANUP
The job has completed or an exception has occurred. Under normal termination,
the job manager waits for notification from the exec service that job
Expand All @@ -133,10 +141,10 @@ PENDING
The job is in DEPEND, PRIORITY, or SCHED states.

RUNNING
The job is in RUN or CLEANUP states.
The job is in RUN, OFFLINE, or CLEANUP states.

ACTIVE
The job is in DEPEND, PRIORITY, SCHED, RUN, or CLEANUP states.
The job is in DEPEND, PRIORITY, SCHED, RUN, OFFLINE, or CLEANUP states.


Exceptions
Expand Down Expand Up @@ -391,6 +399,37 @@ status
{"timestamp":1552594348.0,"name":"epilog-finish","context":{"description":"/usr/sbin/job-epilog.sh", "status":0}}


Disconnect Event
^^^^^^^^^^^^^^^^

The job manager has lost tracking to a running job.

The following keys are OPTIONAL in the event context object:

id
(long long) job ID

Example:

.. code:: json

{"timestamp":1636747761.5495925,"name":"disconnect","context":{"id":341835776000}}


Reconnect Event
^^^^^^^^^^^^^^^

The job manager has reconnected to the job shells.

The context SHALL be empty.

Example:

.. code:: json

{"timestamp":1636747761.827836,"name":"reconnect"}


Free Event
^^^^^^^^^^

Expand Down