Task retried in same workdir possibly due to failed trace retrieval #5641

jchorl · 2025-01-05T23:13:52Z

Bug report

I ran nextflow twice with -resume, both runs failing. However the first run had a task a succeed, and then the second run re-ran a in the same workdir. This was a surprise because:

The task succeeded, so I would have expected it to be properly cached
If the cache could not be used, I'd expect it to use a new workdir

Steps to reproduce the problem

N/A

Program output

I spelunked through the logs and I believe this bit is the culprit:

Jan-05 17:35:03.985 [TaskFinalizer-3] DEBUG nextflow.processor.TaskHandler - Unable to get task trace record -- cause: 
com.amazonaws.AbortedException: 
	at com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:61)
	at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:89)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:350)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:393)
	at java.base/sun.nio.cs.StreamDecoder.lockedRead(StreamDecoder.java:217)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:171)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:188)
	at java.base/java.io.BufferedReader.read1(BufferedReader.java:223)
	at java.base/java.io.BufferedReader.implRead(BufferedReader.java:314)
	at java.base/java.io.BufferedReader.read(BufferedReader.java:296)
	at java.base/java.io.Reader.read(Reader.java:265)
	at org.codehaus.groovy.runtime.IOGroovyMethods.getText(IOGroovyMethods.java:862)
	at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
	at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)
	at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
	at nextflow.processor.TaskHandler.getTraceRecord(TaskHandler.groovy:217)
	at nextflow.cloud.aws.batch.AwsBatchTaskHandler.super$2$getTraceRecord(AwsBatchTaskHandler.groovy)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166)
	at nextflow.cloud.aws.batch.AwsBatchTaskHandler.getTraceRecord(AwsBatchTaskHandler.groovy:862)
	at nextflow.processor.TaskHandler.safeTraceRecord(TaskHandler.groovy:160)
	at nextflow.Session.notifyTaskComplete(Session.groovy:1067)
	at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:691)
	at nextflow.processor.TaskPollingMonitor.safeFinalizeTask(TaskPollingMonitor.groovy:678)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:668)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Looking through that codepath, this makes a bit of sense. Here we can see where it grabs the trace-record and puts it in the cache:

nextflow/modules/nextflow/src/main/groovy/nextflow/Session.groovy

Lines 1069 to 1070 in cc0163a

    
           final trace = handler.safeTraceRecord() 
        
           cache.putTaskAsync(handler, trace)

Well the AbortedException is not one of those ignored here: https://github.com/nextflow-io/nextflow/blob/cc0163ac0f994f43d6195c4819880a1ac1ad149b/modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy#L230C54-L239

Which means it would bubble up and cause safeTraceRecord to return null:

nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy

Line 179 in cc0163a

return null

So as best I can tell, failing to read the trace caused the record not to get cached.

That aligns with subsequent log lines:

Jan-05 17:35:03.990 [TaskFinalizer-3] DEBUG nextflow.trace.ReportObserver - WARN: Unable to find trace record for task id=16
Jan-05 17:35:03.990 [TaskFinalizer-3] DEBUG nextflow.trace.TimelineObserver - Profile warn: Unable to find record for task_run with id: 16

It would be super helpful to know why retrieving the trace file failed. I don't see anything in the logs about why we got an AbortedException - was it a timeout, retries exhausted, something else?? Fwiw I have max_error_retry=5, upload_retry_sleep=10000

Unfortunately I cannot tell if a trace file was staged out. I don't see anything in the logs to indicate it wasnt staged out. However, the next invocation of the task that ran in the same workdir did write a .command.trace, and I don't have versioning enabled, so it could have overwrote an existing file, or there may not have been an existing file.

Notably though, I don't see Cannot read trace file in my logs, so almost certainly this did not get called:

nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy

Line 238 in cc0163a

log.debug "[WARN] Cannot read trace file: $file -- Cause: ${e.message}"

Is it critical for getTraceRecord to populate trace data from the task into the record? Could we just blanket ignore all exceptions when calling parseTraceFile?

Alternatively, could safeTraceRecord best-effort return the record and not null, can it know if it is safe to do so?

Thanks for looking!

Environment

Running on Batch with workdir in S3.

Nextflow version: 24.10.3
Java version: 21 (I think)
Operating system: linux
Bash version: N/A

Additional context

N/A

The text was updated successfully, but these errors were encountered:

pditommaso · 2025-01-10T11:03:53Z

very likely is related to this PR #5600

bentsherman · 2025-01-10T15:25:04Z

I think #5600 will at least prevent the retry from using the same task directory, but it's still a mystery why the trace record failed.

Could be the task failed to write the .command.trace, but I suspect you would have seen something in the task error log in that case. It would be good to figure out the root cause before we add any logic to ignore the error. Which will be easier to do once #5600 is merged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task retried in same workdir possibly due to failed trace retrieval #5641

Task retried in same workdir possibly due to failed trace retrieval #5641

jchorl commented Jan 5, 2025

pditommaso commented Jan 10, 2025

bentsherman commented Jan 10, 2025

Task retried in same workdir possibly due to failed trace retrieval #5641

Task retried in same workdir possibly due to failed trace retrieval #5641

Comments

jchorl commented Jan 5, 2025

Bug report

Steps to reproduce the problem

Program output

Environment

Additional context

pditommaso commented Jan 10, 2025

bentsherman commented Jan 10, 2025