You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Any submission involving multiple nodes that executes more than one operation will not parallelize correctly using backgrounding. For example, if a node has 24 cores and we submit 72 operations, our script generation will correctly request 3 nodes with 24 tasks per node. However, the operations will be executed by running the normal command and backgrounding it, and I don't believe there is any way for these operations to be transmitted across nodes. As a result, the 25th job will simply oversubscribe one of the processors on the current node.
Proposed solution
I believe that using the appropriate submission prefix like ibrun or srun should resolve this issue. Rather than only using these commands when running operations that individually require MPI, we may need to use them for any multi-node job. @joaander may be able to provide additional commentary on potential solutions.
The text was updated successfully, but these errors were encountered:
Any solution will be environment dependent. In principle, srun is supposed to solve this, but in most of the environments we have tested this it fails for one reason or another.
Feature description
Any submission involving multiple nodes that executes more than one operation will not parallelize correctly using backgrounding. For example, if a node has 24 cores and we submit 72 operations, our script generation will correctly request 3 nodes with 24 tasks per node. However, the operations will be executed by running the normal command and backgrounding it, and I don't believe there is any way for these operations to be transmitted across nodes. As a result, the 25th job will simply oversubscribe one of the processors on the current node.
Proposed solution
I believe that using the appropriate submission prefix like
ibrun
orsrun
should resolve this issue. Rather than only using these commands when running operations that individually require MPI, we may need to use them for any multi-node job. @joaander may be able to provide additional commentary on potential solutions.The text was updated successfully, but these errors were encountered: