-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When network bootstrap fails, job runs in duplicate #1767
Comments
Just FYI, we've seen this in the wild. @syamajala installed cuNumeric with GASNet from the Conda packages, but the GASNet wrapper was not included, and instead of getting a hard error it was just a warning, which we initially missed in our testing. We spent some time chasing down an OOM condition that turned out to be because we were running the job in duplicate, which was ultimately a waste of time since the memory usage would have been fine if we'd known the network had failed to initialize. CC @manopapad |
It is possible that the application is built with multiple network modules, so even if ucx is failed, we may fall back to try other networks. |
Could we do something like keep two counters: int num_networks_attempted = 0;
int num_networks_initialized = 0; And if Basically if we attempt any networks, at least one should succeed. |
Yeah, that is doable. @muraj @apryakhin Let me know what do you think. |
The main thing we'll lose if we do this is omnibus builds, where you build every possible option at build time and then enable only what is available at runtime. If that's a goal, perhaps we could add a flag like:
And then the condition becomes Depending on what we expect the most common use case to be, we can pick the default to be either:
I'd personally expect omnibus builds to not be the most common deployment option for Legion, but I could be convinced otherwise if other people have opinions. |
There is an option IMHO if Realm was built with any network, then at runtime it should try all its built-in networks until it finds one that works. If no network works, then it should fail w/o falling back to non-networked execution (i.e. the "none" network option should not be considered by default, unless explicitly passed by the user in FWIW in Legate we're currently doing separate builds for UCX and GASNetEx, but we'd like to be doing omnibus builds. We are prepared to pass |
I would be fine with @manopapad's proposal. |
The question is should we fall back to the |
So I complained about this to @streichler and @SeyedMir a while ago. The reason that was given to me before was that Realm wanted to try and make progress and run even if modules like UCX or GASNet didn't load correctly, that way a binary could be portable to multiple machines. I'm not sure if that is a property that we still want to maintain or not. |
That is exactly what I was talking about with omnibus builds. Pros:
Cons:
To me, I think it's worthwhile to assert by default with the option to fall back to So my proposal is that (unless the user explicitly passes |
@elliottslaughter I'm not sure I agree, and most of my disagreement comes not from whether we error in this case or not, but more how we handle an error like this. The main issue I think is, any assert we throw is so deep in the stack that the user has no idea what it is or how to fix it. Say UCX doesn't load because the bootstrap failed, as mentioned earlier in this thread. The user running legate has no idea what that means, it just sees a crash, sees librealm.so in the callstack, and complains to the realm folks that things are broken. We need an interface that will indicate to the user what failed and why, and if legate or something higher up in the stack can fix it somehow, it needs the opportunity to do so. Say for some reason legate's application doesn't need to do cross-rank talk for some reason and therefore doesn't care about networking, but the installed "omnibus" installation has networking enabled. In this case why should the application crash, and why would the user "have" to specify Given that, I would rather have a way for an application to specify that they require a network available rather than they do not require a network, and to fail out of Runtime::init with some error code that would allow the application to recover or at least handle that error in a way it deemed natural (log to logger before returning it's own error code for example, or more likely, call assert itself). It is my opinion that asserts should only be used in Realm for Realm bugs, things that should not happen unless the Realm developers screwed up. It should not be used for error handling user created issues. This is because in addition to allowing the caller an opportunity to handle the error itself, we also now have a way to sort out what are true Realm bugs and what are just user errors, and we can more conventionally test our validation and user error handling is working in CI without having to write things like death tests, which are expensive and notoriously difficult to write. |
Let's leave aside the issue of how errors are reported for another issue, as that's tangential to whether or not a network initialization failure should be an error. I wrote It is my contention that:
If you disagree with the above points let's get that straightened out before diving into exactly how to fix this. |
If user DO NOT specify Going back to this issue. what is the use case of having an application continue to progress without abort, if the user want to run it with multiple processes, but all network modules fail to initialize? |
But how those frameworks program -ll:networks is part of the issue here I think, so it's worth talking about. Let's say that the framework doesn't care if it has a network or not, for whatever reason (purely hypothetical, bare with me). In this case, if we go forward with the current plan, the framework would need to somehow detect that Legion/Realm was built with a network module (currently this means it would need to know about every network module, iterate through them, and see if Realm's
I'm sorry, but we need to get to a place where this is not true. Users should not be expected to build everything from source every time they want to use our library. This makes it really difficult to keep up with testing if we have to support a thousand different compilers, platforms, build configurations, and different dependency versions. While this may be the case today, we need to move in a direction where people will start using binary packages of legion and realm, preferably from the operating system's package manager.
I'll comment on this as I think it's important, but we can continue this in another discussion. I agree with you in that 'most of our users are not more sophisticated than that'. In fact, I would take it to another level, and I'll give you an example; Many of our users complain about the warnings we currently display because they don't understand them in the context of their application, but their application still runs. Crashes from something like an assert are just cryptic messages that don't allow them to run their application, violating the "I expected it to work" part of your statement without much guidance on their part as to why it doesn't work. We're starting with the assumption that "if it compiles, it should work", and I believe that is an incorrect assumption. In my opinion, the initial assumption we should make is that nothing works until proven otherwise, either via documentation or proper error handling, and applications meant to work on a variety of system configurations should be robust to these issues. I don't care how we report the errors, be it error codes, c++ exceptions (though please keep those header only for ABI compat), some callback, whatever. Just allow the next higher level some mechanism to handle it and give a less cryptic error or log the issue in their own way. Don't crash and expect user to understand (or know how to figure out) why.
We could configure this so that it does not crash with MPI_ERRORS_ARE_FATAL=0 and setting up an error handler with the MPI communicator. Realm could potentially set this as part of the network module init. I honestly don't understand why MPI has error codes and yet has this functionality other than to support lazy developers. |
The MPI_Init may not come from realm, it could come from network libraries such as gasnet. Here is what I get with gasnetex module after setting
|
@eddy16112 Sorry, it's not an environment variable, MPI_ERRORS_ARE_FATAL is a handler you register with MPI_Errorhandler_set. But apparently you can't call this before MPI_Init, my bad. For openmpi, we can instead use the envvar https://github.com/open-mpi/ompi/blob/main/ompi/errhandler/errhandler.c#L100 Oh well, I still think it's a good idea to try to attempt to handle this appropriately, regardless of what our dependencies end up doing. It seems like openmpi is at least getting the hint this is a bad idea as well and is making attempts to remedy the situation by either falling back to assuming that |
I built Legion with UCX. For some reason the network bootstrap is failing, but my job still runs:
I'm not sure this is the behavior we want. If the user requested networking and it fails to load (for any reason), we should fail hard and fast, and not continue to run anyway.
The text was updated successfully, but these errors were encountered: