Add ProcessTerminationResult, for better alignment with POSIX #244

EdSchouten · 2023-03-08T14:57:52Z

POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

There are two kinds of process termination:

Normal termination occurs by a return from main(), when requested
with the exit(), _exit(), or _Exit() functions; or when the last
thread in the process terminates by returning from its start
function, by calling the pthread_exit() function, or through
cancellation.

Abnormal termination occurs when requested by the abort() function
or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional oneof cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the kill utility.
Not the the actual SIG* constants. Adding an enumeration would also be
unwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).

Fixes: #240

POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states the following: > There are two kinds of process termination: > > 1. Normal termination occurs by a return from main(), when requested > with the exit(), _exit(), or _Exit() functions; or when the last > thread in the process terminates by returning from its start > function, by calling the pthread_exit() function, or through > cancellation. > > 2. Abnormal termination occurs when requested by the abort() function > or when some signals are received. Both these cases can be detected by parent processes separately. Unfortunately, REv2 currently doesn't allow this, as we only provide an exit code. Implementations such as Buildbarn tend to set the exit code for abnormal termination to 128+signum, meaning that if a segmentation fault occurs, an action terminates with exit code 139. This tends to be fairly confusing for non-expert users. The goal of this change is to make all of this less confusing. It adds a new message type named ProcessTerminationResult, which contains the termination result in a more structured way. It can now either be an exit code or a signal number. If it turns out that other operating systems provide even more ways, we can add additional `oneof` cases. For the exit code in case of normal termination I decided to use an int64 instead of int32. The reason being that POSIX allows you to get the full exit code back, using APIs like waitid(). On ILP64 systems, the exit code may thus be 64 bits. For the signal in case of abnormal termination I decided to use a string. The reason being that signal numbers are not covered by POSIX. Everybody always assumes that 9 is SIGKILL, but POSIX only documents this as a convention for the command line flags of the `kill` utility. Not the the actual `SIG*` constants. Adding an enumeration would also be unwise, as operating systems are free to add their own signals that are not covered by POSIX (e.g., SIGINFO on BSD). Fixes: bazelbuild#240

EdSchouten · 2023-03-15T16:01:03Z

Hi @EricBurnett @juergbi,

Could you folks please take a look at the latest version of this PR? I think I have processed your feedback accordingly:

The exit_code field is no longer marked as deprecated. Instead, each of the oneof arms of ProcessTerminationResult.reason now documents how its value maps to exit_code.
What @juergbi brought up during yesterday's meeting is correct: exit codes may also be exception values. But what's interesting is that they seemingly don't live in a distinct space from exit code. A process terminating because of an exception will under the hood merely call TerminateProcess() with the exception value as the exit code. This means that the Signaled arm in oneof will remain exclusively used by POSIX-like systems.

Thanks!

EricBurnett · 2023-03-15T17:16:10Z

build/bazel/remote/execution/v2/remote_execution.proto

@@ -1312,6 +1323,57 @@ message OutputSymlink {
  NodeProperties node_properties = 4;
 }

+// The termination result of a process, as reported by the operating
+// system to the parent process (e.g., using wait*() on POSIX-like
+// systems, or GetExitCodeProcess() on Microsoft Windows).


I'd suggest removing or weakening the statement "as reported by the operating system..." . This structure of exit_code + process_termination_result I think would actually extend to other concepts as well in the future.

For example, a pattern I expect to need at some point is "flake due to intermediate on-worker infrastructure", where there is a logical test that we care about the result of, but also some intermediate scaffolding that's been injected to run it where we'd like to be able to distinguish flakes due to the scaffolding itself (that doesn't necessarily imply a problem with the test) from flakes of the test.

(To be clear, don't want to extend your PR to cover this; that can be a future discussion. I'd just say that the "process termination result" need not be only OS-provided information.

EricBurnett · 2023-03-15T17:34:30Z

build/bazel/remote/execution/v2/remote_execution.proto

+    string signal_name = 1;
+  }
+
+  oneof reason {


How would you feel about not using a oneof and simply having a set of optional fields? signal_name implying the lack of exit_code makes sense in this particular case, but I'm not sure every future result is going to follow the pattern of being mutually exclusive. E.g. if we also provided info from the syslog or Windows Event Log so you could know why the OS killed the process, say, where you'd get both a SIGKILL and some indication that it was killed due to OOM or too many handles or whatever.

So e.g.

message ProcessTerminationResult { //... int64 exit_code = 1; //... string signal_name = 2; //... string human_readable_explanation_for_signal = 3; }

With comments detailing the relationship between the various fields.

Would that open the door to problems with default values? I'm thinking about exit_code in particular.

Hmm, good point. For many the empty field is trivially uninteresting (e.g. string signal, empty==missing) but you're right about exit_code.

A few options:

Do nothing, since exit_code can be distinguished fairly readily: the only case of concern is when it's 0, where we shouldn't actually need to distinguish if its set or not. (Do we really need to provide a ProcessTerminationResult just to provide a 0 exit_code? Given that the top-level field already exists, it seems just as reasonable to omit the whole message entirely in that case. So "0 means unset" for exit_code is just fine, since it introduces no real ambiguity in practice.)

Do something special for exit_code, since so far it's the only one that has an ambiguous default. A second "has_exit_code" field would do, for example.

If you know a set that will definitely be mutually exclusive, a oneof suffices, and then we can use optional fields for the rest. So e.g. exit_code and signal_name within the oneof, but with the intent for augmenting fields (like the human_readable_explanation_for_signal example) to be optionals outside of it. Which I guess is...what you already have, just with the understanding that maybe not everything will go in the oneof?

Make these structs with a single field instead of raw fields, since you can check for the presence of a struct. E.g.

message ProcessTerminationResult { message ExitCode { //... int64 code = 1; } ExitCode exit_code = 1; message Signal { //... string name = 1; string human_readable_explanation_for_signal = 2; } Signal signal = 2; }

I think I vote for (1): doesn't really feel like there's "real" ambiguity being introduced in this case, even for exit_code.

bduffany

+1 for this change; our impl currently returns -1 for exit code in this case, mostly as a consequence of using exec.Command(...).Wait() in Golang which returns -1 whenever the process doesn't exit(). It would be nice to surface more details as to whether it's an abort() or something else.

bduffany · 2023-04-12T13:35:43Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // Abnormal process termination occurred.
+    //
+    // On POSIX-like systems, this is achieved by calling abort(), or by
+    // receiving a signal for which no signal handler is installed.


nit: SIGKILL (e.g. OOM-kill) and SIGSTOP (probably less common in practice) in particular can't be handled, so maybe this wording is slightly clearer:

Suggested change

// receiving a signal for which no signal handler is installed.

// receiving a signal that is not handled by the process.

bduffany · 2024-02-28T17:22:03Z

build/bazel/remote/execution/v2/remote_execution.proto

@@ -1156,6 +1156,17 @@ message ActionResult {
  // The exit code of the command.
  int32 exit_code = 4;

+  // An optional reason the process launched by this action terminated,


Since this is optional, it seems that checking ActionResult.exit_code == 0 is still how the client should check whether an execution was successful or not, rather than inspecting this new field.

To avoid confusion now that this additional field is being added (which also contains exit code information), should the ActionResult.exit_code field be further specified so that it must be non-zero if the action did not succeed?

EdSchouten requested a review from bergsieker as a code owner March 8, 2023 14:57

EdSchouten force-pushed the eschouten/20230308-termination branch 3 times, most recently from 25af301 to e3cce78 Compare March 15, 2023 15:52

EdSchouten force-pushed the eschouten/20230308-termination branch from e3cce78 to 19fd68b Compare March 15, 2023 15:54

EricBurnett requested changes Mar 15, 2023

View reviewed changes

bduffany reviewed Apr 12, 2023

View reviewed changes

tjgq mentioned this pull request May 10, 2023

Remote exec is not resilient to remote build farm worker deaths bazelbuild/bazel#18319

Open

bduffany reviewed Feb 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ProcessTerminationResult, for better alignment with POSIX #244

Add ProcessTerminationResult, for better alignment with POSIX #244

EdSchouten commented Mar 8, 2023 •

edited

Loading

EdSchouten commented Mar 15, 2023

EricBurnett Mar 15, 2023

EricBurnett Mar 15, 2023

mostynb Mar 16, 2023

EricBurnett Mar 16, 2023

bduffany left a comment •

edited

Loading

bduffany Apr 12, 2023 •

edited

Loading

bduffany Feb 28, 2024 •

edited

Loading

	// receiving a signal for which no signal handler is installed.
	// receiving a signal that is not handled by the process.

Add ProcessTerminationResult, for better alignment with POSIX #244

Are you sure you want to change the base?

Add ProcessTerminationResult, for better alignment with POSIX #244

Conversation

EdSchouten commented Mar 8, 2023 • edited Loading

EdSchouten commented Mar 15, 2023

EricBurnett Mar 15, 2023

Choose a reason for hiding this comment

EricBurnett Mar 15, 2023

Choose a reason for hiding this comment

mostynb Mar 16, 2023

Choose a reason for hiding this comment

EricBurnett Mar 16, 2023

Choose a reason for hiding this comment

bduffany left a comment • edited Loading

Choose a reason for hiding this comment

bduffany Apr 12, 2023 • edited Loading

Choose a reason for hiding this comment

bduffany Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

EdSchouten commented Mar 8, 2023 •

edited

Loading

bduffany left a comment •

edited

Loading

bduffany Apr 12, 2023 •

edited

Loading

bduffany Feb 28, 2024 •

edited

Loading