Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspended workflows don't respect activeDeadlineSeconds #13722

Open
jakkubu opened this issue Oct 7, 2024 · 1 comment
Open

Suspended workflows don't respect activeDeadlineSeconds #13722

jakkubu opened this issue Oct 7, 2024 · 1 comment
Labels
area/suspend-resume Suspending and resuming workflows type/bug

Comments

@jakkubu
Copy link

jakkubu commented Oct 7, 2024

What happened? What did you expect to happen?

Reproduce:

  1. Start workflow
  2. Suspend during1st sleep
  3. Wait until 1st sleep finishes (sleep for 40s) and activedeadlineSeconds passed (100s)
  4. Workflow is in running state

Expectation 1: In step 4 workflow should fail after activeDeadlineSeconds pass.

I can imagine that this may be expected behaviour. This is questionable, and in our use case doesn't make sense. However in such case the following steps should succeed:

  1. Resume workflow after previous steps
  2. Next step fail

Expectation 2: In step 6 activeDeadlineSeconds should not fail and the deadline timer should be stopped for effective suspend time (workflow in suspend and running steps done).

Again: Expectation 1 is what would be most logical for me. In such case Expectation 2 would make no sense.

Version(s)

v3.5.11, v3.6.0-rc1, c9b093cf1844fad034f6b5a6a73a353bcfed5c7fe53ca10b4bd7f11b6483cd25

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: sleep
spec:
  templates:
    - name: job-sleep
      steps:
        - - name: sleep1
            template: sleep
            arguments:
              parameters:
                - name: duration
                  value: '{{workflow.parameters.duration}}'
        - - name: sleep2
            template: sleep
            arguments:
              parameters:
                - name: duration
                  value: '{{workflow.parameters.duration}}'
    - name: sleep
      inputs:
        parameters:
          - name: duration
      container:
        image: ubuntu
        command:
          - sleep
        args:
          - '{{inputs.parameters.duration}}'
        imagePullPolicy: IfNotPresent
  entrypoint: job-sleep
  arguments:
    parameters:
      - name: duration
        value: '40'
  activeDeadlineSeconds: 100

Logs from the workflow controller

time="2024-10-07T14:05:55.719Z" level=info msg="Processing workflow" Phase= ResourceVersion=650257 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.725Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=0 workflow=sleep-jb6ft
time="2024-10-07T14:05:55.725Z" level=info msg="Updated phase  -> Running" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.725Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.726Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.726Z" level=info msg="Steps node sleep-jb6ft initialized Running" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.726Z" level=info msg="StepGroup node sleep-jb6ft-2735383173 initialized Running" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.726Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.726Z" level=info msg="Pod node sleep-jb6ft-4036592383 initialized Pending" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.732Z" level=info msg="Created pod: sleep-jb6ft[0].sleep1 (sleep-jb6ft-sleep-4036592383)" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.732Z" level=info msg="Workflow step group node sleep-jb6ft-2735383173 not yet completed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.732Z" level=info msg="TaskSet Reconciliation" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.732Z" level=info msg=reconcileAgentPod namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:05:55.739Z" level=info msg="Workflow update successful" namespace=argo-test phase=Running resourceVersion=650261 workflow=sleep-jb6ft
time="2024-10-07T14:06:05.736Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=650261 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:05.736Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=1 workflow=sleep-jb6ft
time="2024-10-07T14:06:05.737Z" level=info msg="node changed" namespace=argo-test new.message= new.phase=Running new.progress=0/1 nodeID=sleep-jb6ft-4036592383 old.message= old.phase=Pending old.progress=0/1 workflow=sleep-jb6ft
time="2024-10-07T14:06:05.738Z" level=info msg="Workflow step group node sleep-jb6ft-2735383173 not yet completed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:05.738Z" level=info msg="TaskSet Reconciliation" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:05.738Z" level=info msg=reconcileAgentPod namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:05.746Z" level=info msg="Workflow update successful" namespace=argo-test phase=Running resourceVersion=650293 workflow=sleep-jb6ft
time="2024-10-07T14:06:15.755Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=650304 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:15.757Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=1 workflow=sleep-jb6ft
time="2024-10-07T14:06:15.757Z" level=info msg="node unchanged" namespace=argo-test nodeID=sleep-jb6ft-4036592383 workflow=sleep-jb6ft
time="2024-10-07T14:06:15.757Z" level=info msg="workflow suspended" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:15.767Z" level=info msg="Workflow update successful" namespace=argo-test phase=Running resourceVersion=650311 workflow=sleep-jb6ft
time="2024-10-07T14:06:25.768Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=650311 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:25.769Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=1 workflow=sleep-jb6ft
time="2024-10-07T14:06:25.770Z" level=info msg="node unchanged" namespace=argo-test nodeID=sleep-jb6ft-4036592383 workflow=sleep-jb6ft
time="2024-10-07T14:06:25.770Z" level=info msg="workflow suspended" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:25.779Z" level=info msg="Workflow update successful" namespace=argo-test phase=Running resourceVersion=650311 workflow=sleep-jb6ft
time="2024-10-07T14:06:48.264Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=650311 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:48.266Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=1 workflow=sleep-jb6ft
time="2024-10-07T14:06:48.266Z" level=info msg="node changed" namespace=argo-test new.message= new.phase=Succeeded new.progress=0/1 nodeID=sleep-jb6ft-4036592383 old.message= old.phase=Running old.progress=0/1 workflow=sleep-jb6ft
time="2024-10-07T14:06:48.266Z" level=info msg="workflow suspended" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:48.279Z" level=info msg="Workflow update successful" namespace=argo-test phase=Running resourceVersion=650386 workflow=sleep-jb6ft
time="2024-10-07T14:06:48.285Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo-test/sleep-jb6ft-sleep-4036592383/labelPodCompleted
time="2024-10-07T14:06:58.285Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=650386 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:58.286Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=1 workflow=sleep-jb6ft
time="2024-10-07T14:06:58.287Z" level=info msg="workflow suspended" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:06:58.296Z" level=info msg="Workflow update successful" namespace=argo-test phase=Running resourceVersion=650386 workflow=sleep-jb6ft
time="2024-10-07T14:09:00.975Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=650631 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.976Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=1 workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="Step group node sleep-jb6ft-2735383173 successful" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="node sleep-jb6ft-2735383173 phase Running -> Succeeded" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="node sleep-jb6ft-2735383173 finished: 2024-10-07 14:09:00.977251964 +0000 UTC" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="StepGroup node sleep-jb6ft-3742187408 initialized Running" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="SG Outbound nodes of sleep-jb6ft-4036592383 are [sleep-jb6ft-4036592383]" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="Pod node sleep-jb6ft-3329728253 initialized Pending" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="Workflow step group node sleep-jb6ft-3742187408 not yet completed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg="TaskSet Reconciliation" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.977Z" level=info msg=reconcileAgentPod namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:09:00.991Z" level=info msg="Workflow update successful" namespace=argo-test phase=Running resourceVersion=650647 workflow=sleep-jb6ft
time="2024-10-07T14:10:06.051Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=650647 namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.052Z" level=info msg="Task-result reconciliation" namespace=argo-test numObjs=1 workflow=sleep-jb6ft
time="2024-10-07T14:10:06.052Z" level=info msg="Workflow pod is missing" namespace=argo-test nodeName="sleep-jb6ft[1].sleep2" nodePhase=Pending recentlyStarted=false workflow=sleep-jb6ft
time="2024-10-07T14:10:06.052Z" level=info msg="node sleep-jb6ft-3329728253 phase Pending -> Failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.052Z" level=info msg="node sleep-jb6ft-3329728253 message: Step exceeded its deadline" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.052Z" level=info msg="node sleep-jb6ft-3329728253 finished: 2024-10-07 14:10:06.052487716 +0000 UTC" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.052Z" level=info msg="SG Outbound nodes of sleep-jb6ft-4036592383 are [sleep-jb6ft-4036592383]" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="Step group node sleep-jb6ft-3742187408 deemed failed: child 'sleep-jb6ft-3329728253' failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="node sleep-jb6ft-3742187408 phase Running -> Failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="node sleep-jb6ft-3742187408 message: child 'sleep-jb6ft-3329728253' failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="node sleep-jb6ft-3742187408 finished: 2024-10-07 14:10:06.05323005 +0000 UTC" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="step group sleep-jb6ft-3742187408 was unsuccessful: child 'sleep-jb6ft-3329728253' failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="Outbound nodes of sleep-jb6ft-3329728253 is [sleep-jb6ft-3329728253]" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="Outbound nodes of sleep-jb6ft is [sleep-jb6ft-3329728253]" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="node sleep-jb6ft phase Running -> Failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="node sleep-jb6ft message: child 'sleep-jb6ft-3329728253' failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="node sleep-jb6ft finished: 2024-10-07 14:10:06.053333091 +0000 UTC" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="TaskSet Reconciliation" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg=reconcileAgentPod namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="Updated phase Running -> Failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="Updated message  -> child 'sleep-jb6ft-3329728253' failed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.053Z" level=info msg="Marking workflow completed" namespace=argo-test workflow=sleep-jb6ft
time="2024-10-07T14:10:06.065Z" level=info msg="Workflow update successful" namespace=argo-test phase=Failed resourceVersion=650750 workflow=sleep-jb6ft

Logs from in your workflow's wait container

time="2024-10-07T14:05:57.186Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-10-07T14:05:57.186Z" level=info msg="Executor initialized" deadline="2024-10-07 14:07:35 +0000 UTC" includeScriptOutput=false namespace=argo-test podName=sleep-jb6ft-sleep-4036592383 templateName=sleep version="&Version{Version:v3.6.0-rc2,BuildDate:2024-10-03T07:50:54Z,GitCommit:5310c396ee1e9f4e5575f62ef5340a86540d8679,GitTag:v3.6.0-rc2,GitTreeState:clean,GoVersion:go1.23.2,Compiler:gc,Platform:linux/arm64,}"
time="2024-10-07T14:05:57.191Z" level=info msg="Starting deadline monitor"
time="2024-10-07T14:06:38.246Z" level=info msg="Main container completed" error="<nil>"
time="2024-10-07T14:06:38.246Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-10-07T14:06:38.246Z" level=info msg="No output parameters"
time="2024-10-07T14:06:38.246Z" level=info msg="No output artifacts"
time="2024-10-07T14:06:38.258Z" level=info msg="Alloc=9689 TotalAlloc=15638 Sys=22613 NumGC=4 Goroutines=8"
time="2024-10-07T14:06:38.260Z" level=info msg="Deadline monitor stopped"
time="2024-10-07T14:06:38.260Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
@agilgur5 agilgur5 added the area/suspend-resume Suspending and resuming workflows label Oct 7, 2024
@tczhao
Copy link
Member

tczhao commented Oct 24, 2024

This is an edge case that isn't handled in the current code.
the current workflow-controller will only validate the deadline for a suspended workflow if a running suspend node exists.

I believe we need to have a deadline checking logic here and mark workflow phase fail

if woc.ShouldSuspend() {
woc.log.Info("workflow suspended")
return
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/suspend-resume Suspending and resuming workflows type/bug
Projects
None yet
Development

No branches or pull requests

3 participants