Multi-node bundling of jobs won't subscribe to nodes correctly #239

vyasr · 2020-02-03T21:21:57Z

Feature description

Any submission involving multiple nodes that executes more than one operation will not parallelize correctly using backgrounding. For example, if a node has 24 cores and we submit 72 operations, our script generation will correctly request 3 nodes with 24 tasks per node. However, the operations will be executed by running the normal command and backgrounding it, and I don't believe there is any way for these operations to be transmitted across nodes. As a result, the 25th job will simply oversubscribe one of the processors on the current node.

Proposed solution

I believe that using the appropriate submission prefix like ibrun or srun should resolve this issue. Rather than only using these commands when running operations that individually require MPI, we may need to use them for any multi-node job. @joaander may be able to provide additional commentary on potential solutions.

The text was updated successfully, but these errors were encountered:

joaander · 2020-02-04T12:53:33Z

Any solution will be environment dependent. In principle, srun is supposed to solve this, but in most of the environments we have tested this it fails for one reason or another.

joaander · 2023-10-31T12:18:39Z

Discussion on this topic has been moved to #777.

b-butler mentioned this issue Feb 7, 2020

Implementing a grouping feature to organize flow operations #114

Merged

12 tasks

vyasr added the cluster submission Enhancements to the submission process label Feb 26, 2020

joaander closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node bundling of jobs won't subscribe to nodes correctly #239

Multi-node bundling of jobs won't subscribe to nodes correctly #239

vyasr commented Feb 3, 2020

joaander commented Feb 4, 2020

joaander commented Oct 31, 2023

Multi-node bundling of jobs won't subscribe to nodes correctly #239

Multi-node bundling of jobs won't subscribe to nodes correctly #239

Comments

vyasr commented Feb 3, 2020

Feature description

Proposed solution

joaander commented Feb 4, 2020

joaander commented Oct 31, 2023