-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interested in elastic deployments on Slurm #6213
Comments
Currently, a Flux instance will continue running if it loses a broker rank (node) that is not deemed critical. Critical ranks include rank 0 (the first node of the instance) and also any interior nodes (they route messages). To make a Flux instance as resilient as possible to removed/lost nodes, you can configure the overlay in a flat tree by setting the tbon.topo broker attribute with a large kary number, e.g. Dynamically adding nodes, however, is an area of ongoing work which has unfortunately taken a backseat to other important development recently. One potential way to accomplish this would be to set up a bootstrap configuration with all possible nodes when first launching a Flux instance. All nodes not currently running a Flux broker would appear as down in the instance. When new nodes are added, the Flux broker could use this configuration to be added to the instance. There's also a solution proposed in #5184, but this was never merged. This method allows starting an instance with a larger size than actual, and adding arbitrary nodes up to that size. Missing nodes initially will have placeholder hostnames, and new brokers join by using ssh to connect to the existing rank 0 broker. If there is interest in this second method, you can feel free to test on that branch, though we should rebase the PR first to ensure you're working with the most recent version of Flux. |
Hi @grondo thanks for the info!
I understand. We've all been there 😄 Here are my thoughts so far:
My reading of this is that regardless of whether we try creating a bootstrap configuration, or the solution in #5184, we need to make sure that rank 0 keeps running. I.e the whole cluster is limited by the wallclock of the original job that started the Flux cluster right? I.e. we cannot "hand off" the broker to another slurm job? I'll have to think about how #5184 works -- and then I would be happy to do some testing. Based on your description, something gives me pause:
We discourage using ssh at scale -- is there a way we can make the new brokers communicate with the rank 0 broken via tcp? Cheers, |
Yes, that or something like it is definitely the long term plan. Ideally resources can be dynamically added without needing placeholders, but that requires some engineering that hasn't been planned yet unfortunately.
Yes, that is something that could work now, with a bit of work to start/stop brokers with the correct arguments and config, etc.
It is possible to shutdown a flux instance and restart it, but I think currently there would be trouble moving the broker to a different node. One short-term solution might be to keep rank 0 on a login node or similar where the broker could keep running. That rank could be excluded from running jobs, much like is done for a system instance of Flux where rank 0 runs on a management node.
Well, all options are open at this time, I think #5184 is a kind of proof-of-concept at this time. Part of the reason brokers in this approach have to ssh back to the existing instance is so that keys and configuration can be shared. It is difficult to imagine how to do a secure key exchange over TCP without already having keys. SSH is nice and simple (if passwordless ssh is enabled for the user), and the hope is that only a few nodes at a time would be joining in this solution. But, other solutions are possible I'm sure! @garlick is off this week, but I'm sure he'll have some input here when he returns. |
Just curious, what kind of scale might we be talking about for #5184? Adding 1K nodes all at once? More? When adding a broker there are actually two connections involved in the bootstrap:
It is 1) where adding N nodes creates N connections to one sshd, and I don't think that really needs to be secured. So maybe there is a way to do that without ssh? 2) will respect the overlay fanout, although a caveat is that adding a layer to the overlay tree does promote a set of nodes to a critical router role, even as it improves scalability. |
Thinking about it a bit more, that first handshake really does need to be authenticated to the instance owner, or it becomes a trivial DoS target. The handshake itself is lightweight so I would expect performance to be dominated by ssh connection establishment. I would think it would sort itself out pretty quickly if say 1K brokers were added at once. Maybe we could continue as designed and see if we have a real life problem there. |
Just checking, there are no keys exchanged in that initial handshake? Sorry if I was misinformed and said there were when that wasn't true. |
Oh wait I forgot something important about the current/proposed design! The two handshakes use the same rank 0 broker connection, even though the first one is an RPC to rank 0 and the second one is an RPC to the parent rank. Thank you for calling me out on that one - I missed that fact in my initial review of the code. So as far as ssh connections go, there is just the one to rank 0. Sorry about that! |
I'm interested in dynamically adding an removing nodes to a Flux deployment running in Slurm. I'm aware that similar functionality exists for K8s (https://flux-framework.org/flux-operator/). Since Slurm is the primary scheduler for NERSC's current and next HPC resources, we're interested in running Flux within an elastic Slurm allocation.
Happy to continue the discussion that I started with @milroy on Slack (https://llnl-performance.slack.com/archives/CBNV6RG8Y/p1723574723197099); and happy to contribute time/test things on Perlmutter.
Tagging @namehta4 who's interested in this also.
The text was updated successfully, but these errors were encountered: