You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran nextflow twice with -resume, both runs failing. However the first run had a task a succeed, and then the second run re-ran a in the same workdir. This was a surprise because:
The task succeeded, so I would have expected it to be properly cached
If the cache could not be used, I'd expect it to use a new workdir
Steps to reproduce the problem
N/A
Program output
I spelunked through the logs and I believe this bit is the culprit:
Jan-05 17:35:03.985 [TaskFinalizer-3] DEBUG nextflow.processor.TaskHandler - Unable to get task trace record -- cause:
com.amazonaws.AbortedException:
at com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:61)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:89)
at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:350)
at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:393)
at java.base/sun.nio.cs.StreamDecoder.lockedRead(StreamDecoder.java:217)
at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:171)
at java.base/java.io.InputStreamReader.read(InputStreamReader.java:188)
at java.base/java.io.BufferedReader.read1(BufferedReader.java:223)
at java.base/java.io.BufferedReader.implRead(BufferedReader.java:314)
at java.base/java.io.BufferedReader.read(BufferedReader.java:296)
at java.base/java.io.Reader.read(Reader.java:265)
at org.codehaus.groovy.runtime.IOGroovyMethods.getText(IOGroovyMethods.java:862)
at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)
at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
at nextflow.processor.TaskHandler.getTraceRecord(TaskHandler.groovy:217)
at nextflow.cloud.aws.batch.AwsBatchTaskHandler.super$2$getTraceRecord(AwsBatchTaskHandler.groovy)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166)
at nextflow.cloud.aws.batch.AwsBatchTaskHandler.getTraceRecord(AwsBatchTaskHandler.groovy:862)
at nextflow.processor.TaskHandler.safeTraceRecord(TaskHandler.groovy:160)
at nextflow.Session.notifyTaskComplete(Session.groovy:1067)
at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:691)
at nextflow.processor.TaskPollingMonitor.safeFinalizeTask(TaskPollingMonitor.groovy:678)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:668)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Looking through that codepath, this makes a bit of sense. Here we can see where it grabs the trace-record and puts it in the cache:
So as best I can tell, failing to read the trace caused the record not to get cached.
That aligns with subsequent log lines:
Jan-05 17:35:03.990 [TaskFinalizer-3] DEBUG nextflow.trace.ReportObserver - WARN: Unable to find trace record for task id=16
Jan-05 17:35:03.990 [TaskFinalizer-3] DEBUG nextflow.trace.TimelineObserver - Profile warn: Unable to find record for task_run with id: 16
It would be super helpful to know why retrieving the trace file failed. I don't see anything in the logs about why we got an AbortedException - was it a timeout, retries exhausted, something else?? Fwiw I have max_error_retry=5, upload_retry_sleep=10000
Unfortunately I cannot tell if a trace file was staged out. I don't see anything in the logs to indicate it wasnt staged out. However, the next invocation of the task that ran in the same workdir did write a .command.trace, and I don't have versioning enabled, so it could have overwrote an existing file, or there may not have been an existing file.
Notably though, I don't see Cannot read trace file in my logs, so almost certainly this did not get called:
Is it critical for getTraceRecord to populate trace data from the task into the record? Could we just blanket ignore all exceptions when calling parseTraceFile?
Alternatively, could safeTraceRecord best-effort return the record and not null, can it know if it is safe to do so?
Thanks for looking!
Environment
Running on Batch with workdir in S3.
Nextflow version: 24.10.3
Java version: 21 (I think)
Operating system: linux
Bash version: N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered:
I think #5600 will at least prevent the retry from using the same task directory, but it's still a mystery why the trace record failed.
Could be the task failed to write the .command.trace, but I suspect you would have seen something in the task error log in that case. It would be good to figure out the root cause before we add any logic to ignore the error. Which will be easier to do once #5600 is merged.
Bug report
I ran nextflow twice with
-resume
, both runs failing. However the first run had a taska
succeed, and then the second run re-rana
in the same workdir. This was a surprise because:Steps to reproduce the problem
N/A
Program output
I spelunked through the logs and I believe this bit is the culprit:
Looking through that codepath, this makes a bit of sense. Here we can see where it grabs the trace-record and puts it in the cache:
nextflow/modules/nextflow/src/main/groovy/nextflow/Session.groovy
Lines 1069 to 1070 in cc0163a
Well the
AbortedException
is not one of those ignored here: https://github.com/nextflow-io/nextflow/blob/cc0163ac0f994f43d6195c4819880a1ac1ad149b/modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy#L230C54-L239Which means it would bubble up and cause
safeTraceRecord
to returnnull
:nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy
Line 179 in cc0163a
So as best I can tell, failing to read the trace caused the record not to get cached.
That aligns with subsequent log lines:
It would be super helpful to know why retrieving the trace file failed. I don't see anything in the logs about why we got an
AbortedException
- was it a timeout, retries exhausted, something else?? Fwiw I havemax_error_retry=5, upload_retry_sleep=10000
Unfortunately I cannot tell if a trace file was staged out. I don't see anything in the logs to indicate it wasnt staged out. However, the next invocation of the task that ran in the same workdir did write a
.command.trace
, and I don't have versioning enabled, so it could have overwrote an existing file, or there may not have been an existing file.Notably though, I don't see
Cannot read trace file
in my logs, so almost certainly this did not get called:nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy
Line 238 in cc0163a
Is it critical for
getTraceRecord
to populate trace data from the task into the record? Could we just blanket ignore all exceptions when callingparseTraceFile
?Alternatively, could
safeTraceRecord
best-effort return the record and notnull
, can it know if it is safe to do so?Thanks for looking!
Environment
Running on Batch with workdir in S3.
Additional context
N/A
The text was updated successfully, but these errors were encountered: