Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: improve error messaging with more context #6392

Open
kkier opened this issue Oct 25, 2024 · 0 comments
Open

Suggestion: improve error messaging with more context #6392

kkier opened this issue Oct 25, 2024 · 0 comments

Comments

@kkier
Copy link
Contributor

kkier commented Oct 25, 2024

Reference #6389 and #6391:

In this case, the error on the node side was:

Oct 24 19:44:54 tuolumne17 flux[120979]: broker.crit[1]: tuolumne1 (rank 0) sent disconnect control message
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.info[1]: shutdown: run->cleanup 37.6744s
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.info[1]: cleanup-none: cleanup->shutdown 0.042097ms
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.err[1]: state-machine.monitor: No route to host

So of course I ended up down a rabbit hole checking the switch configs, firewalls, interface configs, etc.

My impression is that we have some error messages that reference symptoms (no route to host in this case) but don't really get into causes or give as much context as I'd wish for. Another example I just noticed is imp kill: flux-imp: Fatal: kill: failed to initialize pid info: No such file or directory which doesn't tell me what file or directory it was looking for or why. In the example at the top of this post, an error message referencing that rank 0 saw two nodes trying to be rank 1 and this node was one of them would have been huge.

Even if those errors only show up on rank 0 (rejecting connection from cluster1234 (rank 1): rank 1 already references cluster1235 or similar), they'll still help with the troubleshooting since we know what nodes aren't connecting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant