Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When network bootstrap fails, job runs in duplicate #1767

Open
elliottslaughter opened this issue Oct 4, 2024 · 16 comments
Open

When network bootstrap fails, job runs in duplicate #1767

elliottslaughter opened this issue Oct 4, 2024 · 16 comments

Comments

@elliottslaughter
Copy link
Contributor

I built Legion with UCX. For some reason the network bootstrap is failing, but my job still runs:

$ srun ./build/bin/circuit 
/scratch/eslaught/legion-test-ucx/runtime/realm/ucx/bootstrap/bootstrap_loader.cc:60: NULL value Bootstrap unable to load 'realm_ucp_bootstr
ap_mpi.so'                                                            
        realm_ucp_bootstrap_mpi.so: cannot open shared object file: No such file or directory
/scratch/eslaught/legion-test-ucx/runtime/realm/ucx/bootstrap/bootstrap_loader.cc:60: NULL value Bootstrap unable to load 'realm_ucp_bootstr
ap_mpi.so'                                                                                                                                  
        realm_ucp_bootstrap_mpi.so: cannot open shared object file: No such file or directory
[0 - 7fc5a97da3c0]    0.000000 {5}{ucp}: bootstrap_loader_init failed                                                                       
[0 - 7fc5a97da3c0]    0.000000 {5}{ucp}: failed to bootstrap ucp
[0 - 7fc5a97da3c0]    0.000000 {5}{ucp}: failed to create UCP network module
[0 - 7fc5820a6c80]    0.234238 {3}{circuit}: circuit settings: loops=2 pieces=4 nodes/piece=1024 wires/piece=1024 pct_in_piece=95 seed=12345
[0 - 7fc5820a6c80]    0.249111 {3}{circuit}: Initializing circuit simulation...
[0 - 7fc5820a6c80]    0.415783 {3}{circuit}: Finished initializing simulation...     
Starting main simulation loop                                         
SUCCESS!                                                              
ELAPSED TIME =   1.994 s                                                                                                                    
GFLOPS =   3.944 GFLOPS                                                                                                                     
[0 - 7fc5820a6c80]    2.410444 {3}{circuit}: simulation complete - destroying regions
[0 - 7fee552c43c0]    0.000000 {5}{ucp}: bootstrap_loader_init failed                                                                       
[0 - 7fee552c43c0]    0.000000 {5}{ucp}: failed to bootstrap ucp
[0 - 7fee552c43c0]    0.000000 {5}{ucp}: failed to create UCP network module
[0 - 7fee2db90c80]    0.238475 {3}{circuit}: circuit settings: loops=2 pieces=4 nodes/piece=1024 wires/piece=1024 pct_in_piece=95 seed=12345
[0 - 7fee2db90c80]    0.253128 {3}{circuit}: Initializing circuit simulation...
[0 - 7fee2db90c80]    0.424531 {3}{circuit}: Finished initializing simulation...     
Starting main simulation loop                                         
SUCCESS!
ELAPSED TIME =   2.556 s
GFLOPS =   3.077 GFLOPS
[0 - 7fee2db90c80]    2.981113 {3}{circuit}: simulation complete - destroying regions

I'm not sure this is the behavior we want. If the user requested networking and it fails to load (for any reason), we should fail hard and fast, and not continue to run anyway.

@elliottslaughter
Copy link
Contributor Author

Just FYI, we've seen this in the wild. @syamajala installed cuNumeric with GASNet from the Conda packages, but the GASNet wrapper was not included, and instead of getting a hard error it was just a warning, which we initially missed in our testing. We spent some time chasing down an OOM condition that turned out to be because we were running the job in duplicate, which was ultimately a waste of time since the memory usage would have been fine if we'd known the network had failed to initialize.

CC @manopapad

@eddy16112
Copy link
Contributor

It is possible that the application is built with multiple network modules, so even if ucx is failed, we may fall back to try other networks.

@elliottslaughter
Copy link
Contributor Author

Could we do something like keep two counters:

int num_networks_attempted = 0;
int num_networks_initialized = 0;

And if num_networks_attempted > 0 && num_networks_initialized == 0 then throw an error?

Basically if we attempt any networks, at least one should succeed.

@eddy16112
Copy link
Contributor

Yeah, that is doable. @muraj @apryakhin Let me know what do you think.

@elliottslaughter
Copy link
Contributor Author

The main thing we'll lose if we do this is omnibus builds, where you build every possible option at build time and then enable only what is available at runtime.

If that's a goal, perhaps we could add a flag like:

-ll:networks <N>

And then the condition becomes num_networks_initialized >= N. That is, the flag essentially says "Please make sure we successfully initialize at least this many networks."

Depending on what we expect the most common use case to be, we can pick the default to be either:

  1. 0: i.e., never fail by default even if we built with a network
  2. 1 if we built with at least one network, otherwise 0: so that we catch failures where the user expected to have a network but it didn't work

I'd personally expect omnibus builds to not be the most common deployment option for Legion, but I could be convinced otherwise if other people have opinions.

@manopapad
Copy link
Contributor

There is an option -ll:networks already, that expects a list of networks to try in order, or the special "none".

IMHO if Realm was built with any network, then at runtime it should try all its built-in networks until it finds one that works. If no network works, then it should fail w/o falling back to non-networked execution (i.e. the "none" network option should not be considered by default, unless explicitly passed by the user in -ll:networks).

FWIW in Legate we're currently doing separate builds for UCX and GASNetEx, but we'd like to be doing omnibus builds. We are prepared to pass -ll:networks none explicitly when doing single-node runs (and can leave -ll:networks undefined in multi-node runs).

@elliottslaughter
Copy link
Contributor Author

I would be fine with @manopapad's proposal.

@eddy16112
Copy link
Contributor

The question is should we fall back to the none network option? Currently, we always fall back without throw a failure.

@lightsighter
Copy link
Contributor

So I complained about this to @streichler and @SeyedMir a while ago. The reason that was given to me before was that Realm wanted to try and make progress and run even if modules like UCX or GASNet didn't load correctly, that way a binary could be portable to multiple machines. I'm not sure if that is a property that we still want to maintain or not.

@elliottslaughter
Copy link
Contributor Author

That is exactly what I was talking about with omnibus builds.

Pros:

  • With an omnibus build, you can build every possible option and run everywhere. If the network fails it just doesn't load.

Cons:

To me, I think it's worthwhile to assert by default with the option to fall back to -ll:networks none explicitly if the user knows there are no networks they want to use. In practice, omnibus builds are most common in projects like Legate that wrap Legion themselves and can access higher-level information about what's expected (e.g., whether a network is really present or not). I don't think average Legion users are going to switch to omnibus builds, so in the most common case we're just making it harder for them to get reliable errors about what's going on.

So my proposal is that (unless the user explicitly passes -ll:networks none) we always assert if Realm attempts at least one network, and zero networks initialize successfully.

@muraj
Copy link

muraj commented Oct 7, 2024

@elliottslaughter I'm not sure I agree, and most of my disagreement comes not from whether we error in this case or not, but more how we handle an error like this. The main issue I think is, any assert we throw is so deep in the stack that the user has no idea what it is or how to fix it. Say UCX doesn't load because the bootstrap failed, as mentioned earlier in this thread. The user running legate has no idea what that means, it just sees a crash, sees librealm.so in the callstack, and complains to the realm folks that things are broken. We need an interface that will indicate to the user what failed and why, and if legate or something higher up in the stack can fix it somehow, it needs the opportunity to do so. Say for some reason legate's application doesn't need to do cross-rank talk for some reason and therefore doesn't care about networking, but the installed "omnibus" installation has networking enabled. In this case why should the application crash, and why would the user "have" to specify -ll:networks none in order to bypass this, especially if during the development of the application where the developers did not compile realm with networks, did they not have to specify this argument then? It creates confusion in our interface to have different defaults in different versions of our library that presents a bad user experience.

Given that, I would rather have a way for an application to specify that they require a network available rather than they do not require a network, and to fail out of Runtime::init with some error code that would allow the application to recover or at least handle that error in a way it deemed natural (log to logger before returning it's own error code for example, or more likely, call assert itself).

It is my opinion that asserts should only be used in Realm for Realm bugs, things that should not happen unless the Realm developers screwed up. It should not be used for error handling user created issues. This is because in addition to allowing the caller an opportunity to handle the error itself, we also now have a way to sort out what are true Realm bugs and what are just user errors, and we can more conventionally test our validation and user error handling is working in CI without having to write things like death tests, which are expensive and notoriously difficult to write.

@elliottslaughter
Copy link
Contributor Author

Let's leave aside the issue of how errors are reported for another issue, as that's tangential to whether or not a network initialization failure should be an error. I wrote assert above as a short-hand, not as implementation advice.

It is my contention that:

  1. Frameworks like Legate can add their own wrappers based on high-level information that may be available to them. That is corroborated by @manopapad's comment in When network bootstrap fails, job runs in duplicate #1767 (comment). Therefore, users of such frameworks are abstracted away from the Legion command line and already do not have to think about -ll:networks and the like. So let's leave this use case aside.
  2. Almost every other remaining Legion user will NOT be using an omnibus build. This is not the standard way to use Legion. Therefore, since the frameworks can already wrap Legion to do what they want, I think that the usage of a vanilla Legion setup should always be geared towards the remaining users, who as I said will not be using omnibus builds.
  3. Most Legion users build exactly what they need and therefore, the set of built modules is a fair estimation of what they expect to work. Therefore, in default configurations of Legion I think it's fair to error out by default when e.g. the network fails to initialize.
  4. Whether we configure errors in Legion or Realm is mostly orthogonal to this, but please realize that vanilla applications don't really have any more context than Realm does in this case. It is not sane or reasonable to ask end-users to write a lot of error handling code, and even if we did they wouldn't know the first thing about how to do it and why. In most cases the user of the application is the same person who built it, and usually the only available context is "I built this configuration of Legion and I expected it to work." Most of our users are not more sophisticated than that.

If you disagree with the above points let's get that straightened out before diving into exactly how to fix this.

@eddy16112
Copy link
Contributor

Say for some reason legate's application doesn't need to do cross-rank talk for some reason and therefore doesn't care about networking, but the installed "omnibus" installation has networking enabled. In this case why should the application crash, and why would the user "have" to specify -ll:networks none in order to bypass this

If user DO NOT specify -ll:networks none, The MPI_Init used by network modules may crash. If we compile Legion with network modules enabled, and do not run it with mpirun, It may crash (depends on how MPI is configured) even if we only need one process. In this case, we will need to explicitly specify -ll:networks none to not initialize network modules.

Going back to this issue. what is the use case of having an application continue to progress without abort, if the user want to run it with multiple processes, but all network modules fail to initialize?

@muraj
Copy link

muraj commented Oct 8, 2024

Therefore, users of such frameworks are abstracted away from the Legion command line and already do not have to think about -ll:networks and the like. So let's leave this use case aside

But how those frameworks program -ll:networks is part of the issue here I think, so it's worth talking about. Let's say that the framework doesn't care if it has a network or not, for whatever reason (purely hypothetical, bare with me). In this case, if we go forward with the current plan, the framework would need to somehow detect that Legion/Realm was built with a network module (currently this means it would need to know about every network module, iterate through them, and see if Realm's Runtime::get_module<>() returns non-zero), then if it was, set -ll:networks none before calling Runtime::init(), otherwise they can get an error. That's a lot of hoops to opt out of feature just so the application doesn't crash. All I'm asking is that whatever we end up doing, it should not change based on what we are compiled with, and I would prefer not to crash (at least, not inside Realm) when the system is not configured properly.

Almost every other remaining Legion user will NOT be using an omnibus build

I'm sorry, but we need to get to a place where this is not true. Users should not be expected to build everything from source every time they want to use our library. This makes it really difficult to keep up with testing if we have to support a thousand different compilers, platforms, build configurations, and different dependency versions. While this may be the case today, we need to move in a direction where people will start using binary packages of legion and realm, preferably from the operating system's package manager.

In most cases the user of the application is the same person who built it, and usually the only available context is "I built this configuration of Legion and I expected it to work." Most of our users are not more sophisticated than that.

I'll comment on this as I think it's important, but we can continue this in another discussion. I agree with you in that 'most of our users are not more sophisticated than that'. In fact, I would take it to another level, and I'll give you an example; Many of our users complain about the warnings we currently display because they don't understand them in the context of their application, but their application still runs. Crashes from something like an assert are just cryptic messages that don't allow them to run their application, violating the "I expected it to work" part of your statement without much guidance on their part as to why it doesn't work. We're starting with the assumption that "if it compiles, it should work", and I believe that is an incorrect assumption. In my opinion, the initial assumption we should make is that nothing works until proven otherwise, either via documentation or proper error handling, and applications meant to work on a variety of system configurations should be robust to these issues. I don't care how we report the errors, be it error codes, c++ exceptions (though please keep those header only for ABI compat), some callback, whatever. Just allow the next higher level some mechanism to handle it and give a less cryptic error or log the issue in their own way. Don't crash and expect user to understand (or know how to figure out) why.

If user DO NOT specify -ll:networks none, The MPI_Init used by network modules may crash.

We could configure this so that it does not crash with MPI_ERRORS_ARE_FATAL=0 and setting up an error handler with the MPI communicator. Realm could potentially set this as part of the network module init. I honestly don't understand why MPI has error codes and yet has this functionality other than to support lazy developers.

@eddy16112
Copy link
Contributor

We could configure this so that it does not crash with MPI_ERRORS_ARE_FATAL=0 and setting up an error handler with the MPI communicator. Realm could potentially set this as part of the network module init.

The MPI_Init may not come from realm, it could come from network libraries such as gasnet. Here is what I get with gasnetex module after setting MPI_ERRORS_ARE_FATAL=0

(base) weiwu@prm-dgx-02:~/scratch/legion-master/build-gasnetex/bin$ MPI_ERRORS_ARE_FATAL=0 ./memspeed 
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[prm-dgx-02:110313] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

@muraj
Copy link

muraj commented Oct 8, 2024

@eddy16112 Sorry, it's not an environment variable, MPI_ERRORS_ARE_FATAL is a handler you register with MPI_Errorhandler_set. But apparently you can't call this before MPI_Init, my bad.

For openmpi, we can instead use the envvar OMPI_MCA_mpi_initial_errhandler=mpi_errors_return to get MPI_Init to return an error rather than crash. I am not sure what the MPICH and other implementations' equivalent, and apparently the use for this environment variable is new for MPI_Init, as v4.1.4 just defaults to the fatal error handler...

https://github.com/open-mpi/ompi/blob/main/ompi/errhandler/errhandler.c#L100
https://github.com/open-mpi/ompi/blob/v4.1.4/ompi/errhandler/errhandler_invoke.c#L46C9-L46C47

Oh well, I still think it's a good idea to try to attempt to handle this appropriately, regardless of what our dependencies end up doing. It seems like openmpi is at least getting the hint this is a bad idea as well and is making attempts to remedy the situation by either falling back to assuming that mpirun -n 1 was called, or returning an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants