Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSHExecutor fails to create a connection for sublattices and large workflows #13

Open
venkatBala opened this issue May 25, 2022 · 3 comments
Labels
bug 🐛 Something isn't working improvements / runtime Improvement in runtime efficiency

Comments

@venkatBala
Copy link
Contributor

When multiple workflows have been dispatched, the SSHExecutor fails to create SSH connections. This leaves the node/workflow hanging without ever completing.

@venkatBala venkatBala added bug 🐛 Something isn't working improvements / runtime Improvement in runtime efficiency labels May 25, 2022
@venkatBala
Copy link
Contributor Author

Can the SSHExecutor made to use an already open ssh connection instead of create a new one each time?

@cjao
Copy link
Contributor

cjao commented May 26, 2022

To elaborate, large workflows frequently fail midway with ERROR -- Error reading SSH protocol banner. The current SSH plugin is a working proof of concept but not really suitable for large workflows. The SSH server can easily get overloaded, and at a minimum the plugin should implement some logic to reattempt execution in this scenario.

@cjao
Copy link
Contributor

cjao commented Sep 29, 2022

We should probably stop holding the connection open throughout the entire task but rather poll periodically like we do with Slurm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working improvements / runtime Improvement in runtime efficiency
Projects
None yet
Development

No branches or pull requests

2 participants