Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote exec is not resilient to remote build farm worker deaths #18319

Open
Tracked by #19904
jmmv opened this issue May 4, 2023 · 9 comments
Open
Tracked by #19904

Remote exec is not resilient to remote build farm worker deaths #18319

jmmv opened this issue May 4, 2023 · 9 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@jmmv
Copy link
Contributor

jmmv commented May 4, 2023

Description of the bug:

We are observing random build breakages when our deployment of Build Barn is suffering from unstability.

To further diagnose this, I've been trying to inject manual failures into our Build Barn (by forcibly terminating individual server processes while a build is ongoing) to see how Bazel reacts. In the vast majority of the cases, Bazel correctly detects the failure and falls back to local execution (provided --remote_local_fallback is enabled) / retries the failed action.

However, there seems to be one case from which Bazel cannot recover from. If I terminate a remote Build Barn worker while a long-running action is being executed on it, Bazel will immediately fail the build with:

ERROR: .../test/BUILD:1:8: Executing genrule //test:foo failed: (Exit -1): bash failed: error executing command (from target //test:foo) /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; sleep 120; touch bazel-out/aarch64-fastbuild/bin/test/foo.txt'
Action details (uncached result): blobs/sha256/historical_execute_response/50faea922bc3495d9e18c3928b471949651b5adc66243a8e3f2790c02d72c434-814/

Neither remote retries nor the local fallback kick in, failing the build for what should be a retryable error.

I'm not sure if this is something that ought to be fixed at the Bazel level or at the Build Barn level. I suspect this is a Build Barn issue based on what I'll explain below, but I'm not completely sure if that's the case or if Bazel could do better -- hence why I'm starting here. cc @EdSchouten

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Create a genrule that sleeps for 2 minutes (sleep 120 ; touch $@).
  2. Configure Bazel to use Build Barn and set --remote_local_fallback.
  3. Build the genrule.
  4. Kill the worker process while the genrule is executing.

Observe Bazel immediately abort the build as described above.

Which operating system are you running Bazel on?

N/A

What is the output of bazel info release?

bazel-6.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

One detail that caught my attention in the above error message is Exit -1. Process exit codes are 8 bits and should not be negative. And if we try to run a genrule that does exit -1, Bazel will (correctly) claim (Exit 255) in the failure.

But Bazel (and the RBE protocol) use 32-bit signed integers to propagate exit codes. So I suspect the -1 is coming from Build Barn itself or from the Bazel remote execution code – and if that’s true, we could potentially use this to discern between real action failures and infrastructure failures, and implement retries or local execution fallback.


Bazel’s remote execution module seems to propagate the ExitCode of the remote worker verbatim in the SpawnResult instances, so I do not think Bazel is generating this fictitious -1. It has to come from Build Barn.

The only place where I see this -1 originating from is in the pkg/runner/local_runner.go file in bb-remote-execution where the result of cmd.ProcessState.ExitCode() is propagated as the exit code of the action result. Go has the following to say about this function:

ExitCode returns the exit code of the exited process, or -1 if the process hasn't exited or was terminated by a signal.
So, there is the -1 right there.

However… this is in the “local runner” code, which I understand runs within the worker. Forcibly terminating the worker process wouldn’t give a chance to the worker code to process and return the -1, unless the worker caught a graceful termination signal and tried to do something with it. But the bb-remote-execution code does not seem to catch any signal… so this is very strange.


This all means that we cannot use -1 as a discriminating factor for worker deaths. A genrule that does kill -9 $$ is also reported as Exit -1 from the Build Farm due to the above explanation from Go. (But, interestingly, running this same genrule locally reports Killed instead of Exit -1 because, well, the code in Bazel is doing the right thing regarding exit codes—unlike Go’s simplistic view of the world.)

@sgowroji sgowroji added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label May 5, 2023
@wilwell wilwell removed the untriaged label May 9, 2023
@tjgq
Copy link
Contributor

tjgq commented May 10, 2023

I think our stance is that Bazel should fallback on transient or infrastructure failures, and fail otherwise. In order for that to happen, the remote protocol must be able to distinguish the two kinds of failure. bazelbuild/remote-apis#244 seems relevant.

@jmmv
Copy link
Contributor Author

jmmv commented May 10, 2023

bazelbuild/remote-apis#244 would not fix this. If I'm reading that one correctly, that change is to distinguish between process exit codes and signals---which is a fine thing to do... but is unrelated to this problem. For example: if an action is invoking the compiler and the compiler crashes, it is good to report that to the user as a crash instead of a numeric exit code (graceful exit). But retrying a compiler failure is likely going to result in another crash. However, if the action failed because the worker crashed (maybe the worker timed out saving the output to the cache, and the timeout isn't gracefully handled in the worker code), then that is an infrastructure failure and a retry is warranted because the retry might work.

There is a fine line between the two though. For example, consider what happens if the remote workers have tight memory constraints. In such a situation, running a very memory-hungry compiler action on them could result in the compiler being killed. While this is not the compiler's fault (it's the worker who decided to kill the process), it's still not an infrastructure failure because a retry will result in the same crash.

@tjgq
Copy link
Contributor

tjgq commented May 11, 2023

Yes, I agree that bazelbuild/remote-apis#244 as it currently stands isn't sufficient; I was saying more generally that the process termination reason must contain enough information for Bazel to decide whether to retry and/or fall back.

jmmv added a commit to Snowflake-Labs/bazel that referenced this issue May 24, 2023
This adds logic to treat remote actions that terminate due to a signal as
retryable errors, assuming that such terminations are caused by a worker
crash.

Because this is a hack to paper over a current Remote Build defficiency,
and because this heuristic may be wrong, this feature is hidden behind a
new --experimental_remote_exit_signals_are_transient_errors flag.

Mitigates bazelbuild#18319.
@jmmv
Copy link
Contributor Author

jmmv commented May 24, 2023

What would you think about something like Snowflake-Labs@c3e8c47 to deal with this issue until the Remote Build protocol and its implementations are adjusted to properly deal with this issue?

jmmv added a commit to Snowflake-Labs/bazel that referenced this issue May 24, 2023
This adds logic to treat remote actions that terminate due to a signal as
retryable errors, assuming that such terminations are caused by a worker
crash.

Because this is a hack to paper over a current Remote Build defficiency,
and because this heuristic may be wrong, this feature is hidden behind a
new --experimental_remote_exit_signals_are_transient_errors flag.

Mitigates bazelbuild#18319.
@tjgq tjgq mentioned this issue Oct 20, 2023
12 tasks
sfc-gh-mgalindo added a commit to Snowflake-Labs/bazel that referenced this issue Jun 10, 2024
    This adds logic to treat remote actions that terminate due to a signal as
    retryable errors, assuming that such terminations are caused by a worker
    crash.

    Because this is a hack to paper over a current Remote Build defficiency,
    and because this heuristic may be wrong, this feature is hidden behind a
    new --snowflake_remote_exit_signals_are_transient_errors flag.

    Mitigates bazelbuild#18319.

Author: Julio Merino <[email protected]>
Date:   Wed May 24 07:12:43 2023 -0700
sfc-gh-mgalindo added a commit to Snowflake-Labs/bazel that referenced this issue Jun 20, 2024
    This adds logic to treat remote actions that terminate due to a signal as
    retryable errors, assuming that such terminations are caused by a worker
    crash.

    Because this is a hack to paper over a current Remote Build defficiency,
    and because this heuristic may be wrong, this feature is hidden behind a
    new --snowflake_remote_exit_signals_are_transient_errors flag.

    Mitigates bazelbuild#18319.

Author: Julio Merino <[email protected]>
Date:   Wed May 24 07:12:43 2023 -0700
sfc-gh-mgalindo added a commit to Snowflake-Labs/bazel that referenced this issue Jun 21, 2024
    This adds logic to treat remote actions that terminate due to a signal as
    retryable errors, assuming that such terminations are caused by a worker
    crash.

    Because this is a hack to paper over a current Remote Build defficiency,
    and because this heuristic may be wrong, this feature is hidden behind a
    new --snowflake_remote_exit_signals_are_transient_errors flag.

    Mitigates bazelbuild#18319.

Author: Julio Merino <[email protected]>
Date:   Wed May 24 07:12:43 2023 -0700
sfc-gh-mgalindo added a commit to Snowflake-Labs/bazel that referenced this issue Jun 26, 2024
    This adds logic to treat remote actions that terminate due to a signal as
    retryable errors, assuming that such terminations are caused by a worker
    crash.

    Because this is a hack to paper over a current Remote Build defficiency,
    and because this heuristic may be wrong, this feature is hidden behind a
    new --snowflake_remote_exit_signals_are_transient_errors flag.

    Mitigates bazelbuild#18319.

Author: Julio Merino <[email protected]>
Date:   Wed May 24 07:12:43 2023 -0700
sfc-gh-mgalindo added a commit to Snowflake-Labs/bazel that referenced this issue Jun 28, 2024
    This adds logic to treat remote actions that terminate due to a signal as
    retryable errors, assuming that such terminations are caused by a worker
    crash.

    Because this is a hack to paper over a current Remote Build defficiency,
    and because this heuristic may be wrong, this feature is hidden behind a
    new --snowflake_remote_exit_signals_are_transient_errors flag.

    Mitigates bazelbuild#18319.

Author: Julio Merino <[email protected]>
Date:   Wed May 24 07:12:43 2023 -0700
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs. If you think this issue is still relevant and should stay open, please post any comment here and the issue will no longer be marked as stale.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Jul 28, 2024
@EdSchouten
Copy link
Contributor

This all means that we cannot use -1 as a discriminating factor for worker deaths.

And this is intentional. Because if a worker dies, you should simply make sure to set ExecuteResponse.status to something meaningful.

@jmmv
Copy link
Contributor Author

jmmv commented Aug 2, 2024

Because if a worker dies, you should simply make sure to set ExecuteResponse.status to something meaningful.

Who is "you" here?

@EdSchouten
Copy link
Contributor

Your infrastructure. Buildbarn, Buildfarm, Buildgrid, whatever you're using.

@EdSchouten
Copy link
Contributor

EdSchouten commented Aug 2, 2024

If I terminate a remote Build Barn worker while a long-running action is being executed on it, Bazel will immediately fail the build with [...]

Question: how are you terminating this worker? Make sure you only send a termination signal to bb_worker/bb_runner itself. Do NOT send it to any of the processes that bb_runner spawns that belong to the build action. If you are using tini to launch bb_runner, do NOT pass in the -g flag:

buildbarn/bb-deployments@e000c76

@github-actions github-actions bot removed the stale Issues or PRs that are stale (no activity for 30 days) label Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants