Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task retried in same workdir possibly due to failed trace retrieval #5641

Open
jchorl opened this issue Jan 5, 2025 · 2 comments
Open

Task retried in same workdir possibly due to failed trace retrieval #5641

jchorl opened this issue Jan 5, 2025 · 2 comments

Comments

@jchorl
Copy link
Contributor

jchorl commented Jan 5, 2025

Bug report

I ran nextflow twice with -resume, both runs failing. However the first run had a task a succeed, and then the second run re-ran a in the same workdir. This was a surprise because:

  1. The task succeeded, so I would have expected it to be properly cached
  2. If the cache could not be used, I'd expect it to use a new workdir

Steps to reproduce the problem

N/A

Program output

I spelunked through the logs and I believe this bit is the culprit:

Jan-05 17:35:03.985 [TaskFinalizer-3] DEBUG nextflow.processor.TaskHandler - Unable to get task trace record -- cause: 
com.amazonaws.AbortedException: 
	at com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:61)
	at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:89)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:350)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:393)
	at java.base/sun.nio.cs.StreamDecoder.lockedRead(StreamDecoder.java:217)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:171)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:188)
	at java.base/java.io.BufferedReader.read1(BufferedReader.java:223)
	at java.base/java.io.BufferedReader.implRead(BufferedReader.java:314)
	at java.base/java.io.BufferedReader.read(BufferedReader.java:296)
	at java.base/java.io.Reader.read(Reader.java:265)
	at org.codehaus.groovy.runtime.IOGroovyMethods.getText(IOGroovyMethods.java:862)
	at org.apache.groovy.nio.extensions.NioExtensions.getText(NioExtensions.java:400)
	at nextflow.trace.TraceRecord.parseTraceFile(TraceRecord.groovy:422)
	at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
	at nextflow.processor.TaskHandler.getTraceRecord(TaskHandler.groovy:217)
	at nextflow.cloud.aws.batch.AwsBatchTaskHandler.super$2$getTraceRecord(AwsBatchTaskHandler.groovy)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166)
	at nextflow.cloud.aws.batch.AwsBatchTaskHandler.getTraceRecord(AwsBatchTaskHandler.groovy:862)
	at nextflow.processor.TaskHandler.safeTraceRecord(TaskHandler.groovy:160)
	at nextflow.Session.notifyTaskComplete(Session.groovy:1067)
	at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:691)
	at nextflow.processor.TaskPollingMonitor.safeFinalizeTask(TaskPollingMonitor.groovy:678)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:668)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Looking through that codepath, this makes a bit of sense. Here we can see where it grabs the trace-record and puts it in the cache:

final trace = handler.safeTraceRecord()
cache.putTaskAsync(handler, trace)

Well the AbortedException is not one of those ignored here: https://github.com/nextflow-io/nextflow/blob/cc0163ac0f994f43d6195c4819880a1ac1ad149b/modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy#L230C54-L239

Which means it would bubble up and cause safeTraceRecord to return null:

So as best I can tell, failing to read the trace caused the record not to get cached.

That aligns with subsequent log lines:

Jan-05 17:35:03.990 [TaskFinalizer-3] DEBUG nextflow.trace.ReportObserver - WARN: Unable to find trace record for task id=16
Jan-05 17:35:03.990 [TaskFinalizer-3] DEBUG nextflow.trace.TimelineObserver - Profile warn: Unable to find record for task_run with id: 16

It would be super helpful to know why retrieving the trace file failed. I don't see anything in the logs about why we got an AbortedException - was it a timeout, retries exhausted, something else?? Fwiw I have max_error_retry=5, upload_retry_sleep=10000

Unfortunately I cannot tell if a trace file was staged out. I don't see anything in the logs to indicate it wasnt staged out. However, the next invocation of the task that ran in the same workdir did write a .command.trace, and I don't have versioning enabled, so it could have overwrote an existing file, or there may not have been an existing file.

Notably though, I don't see Cannot read trace file in my logs, so almost certainly this did not get called:

log.debug "[WARN] Cannot read trace file: $file -- Cause: ${e.message}"

Is it critical for getTraceRecord to populate trace data from the task into the record? Could we just blanket ignore all exceptions when calling parseTraceFile?

Alternatively, could safeTraceRecord best-effort return the record and not null, can it know if it is safe to do so?

Thanks for looking!

Environment

Running on Batch with workdir in S3.

  • Nextflow version: 24.10.3
  • Java version: 21 (I think)
  • Operating system: linux
  • Bash version: N/A

Additional context

N/A

@pditommaso
Copy link
Member

very likely is related to this PR #5600

@bentsherman
Copy link
Member

I think #5600 will at least prevent the retry from using the same task directory, but it's still a mystery why the trace record failed.

Could be the task failed to write the .command.trace, but I suspect you would have seen something in the task error log in that case. It would be good to figure out the root cause before we add any logic to ignore the error. Which will be easier to do once #5600 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants