-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect SLURM scripts produced with omp_num_threads>1
.
#777
Comments
There are definitely bugs and landmines in the submission logic. We would need to brainstorm and come up with the minimum number of settings that support all the workflows we claim to support. If we introduces changes to the directives' schema, we also need to decide what options to support (e.g. is Part of the reason (not a defense just an explanation of design intent) for the directives schema was to provide a simple set of options that we would appropriately handle. The other major reason is that it was developed piecemeal as needed. This particular bug is introduced in #722, my bad. We moved to |
I understand, setting these per rank is incompatible with heterogeneous jobs and they would need to be disallowed. I don't know if is even possible to craft a single SLURM job submission that requests different numbers of cores for different ranks. My problem with the "simple" set of options is that is difficult to predict what simple inputs will give the correct complex output. For example, if we attempted to discover the intended mem_per_rank as We could potentially support both aggregate (np, nranks, ngpus, memory) and per node (nranks, cpus_per_rank, mem_per_rank, gpus_per_rank) operations (they would be mutually exclusive). This unfortunately makes the template scripts more complex. There may be better options. |
signac has long promised effortless bundling and parallel job executions (see e.g. #270, glotzerlab/signac-docs#157). In practice, this has only worked on a small subset of systems (none of which are in active use by the glotzer group). Many users assume that it works and don't bother checking if their jobs are using all the resources they requested. I think in this refactor, the submit environments should disallow I would summarize the possible use-cases as:
For single-node jobs, example:
MPI (or another launcher) is required to make multi-node jobs a possibility (MPI is also possible on a single node). Parallel bundles will almost always be disabled in this case - unless extensive testing work has been done to prove that example:
This provides simple and scalable syntax well-adopted to the two different classes of use-cases. It enables:
The only case removed (which is essentially non-fuctional now, except possibly on stampede2) is heterogenous MPI driven jobs. Any thoughts or suggestions? One of the pain points I have with the current system is that setting
|
So the resource logic would look something like
Also, when would it make sense to have Also for clarity the proposed changes are,
Am I understanding correctly @joaander? |
Yes, that is my current proposal. Happy to discuss in person and refine these ideas. I wasn't thinking that the launcher would override When I think about more general support for other launchers, I mean something like The term "rank" already assumes MPI. As long as the word is rank, then it does not make sense to have I'd be happy to use a different nomenclature. SLURM uses the work
In this case, I mean that |
Duplicate of #785. |
Description
flow incorrectly computes
np = nranks * omp_num_threads
and then sets--ntasks=np
in the SLURM submission script.To reproduce
Then add:
to
project.py
Error output
The resulting job script is:
A job definition like this does not provide the desired behavior. Tested interactively:
Notice how all ranks are run on the same node. Each rank should be on a separate node.
Correct job definition:
When ntasks is the number of MPI tasks and
--cpus-per-task
is set appropriately, the 4 MPI ranks are assigned to different nodes.System configuration
Please complete the following information:
Solution
Fixing this requires passing information on ranks and cpus per rank separately into the submit templates. It also requires that jobs using MPI are homogeneous (there is unused code in flow to verify this).
I would prefer not to hack in a quick fix for this bug. The current flow
np
,ntasks
,ngpus
,memory
, etc... map to the actual job request only in a very indirect way. I would strongly prefer a solution that implements new directives that provide the user a more direct route to settingntasks
,cpus-per-task
,gpus-per-task
,mem-per-cpu
. I am considering possible schemas fordirectives
to achieve this without breaking existing jobs too much. Users will need a way to indicate whether to launch jobs with the parallel runner (e.g.mpirun
) or not. For complete generality, we should also allow jobs to request more than one node and yet still launch withoutmpirun
(e.g. for when users are running a tool that uses some other library for internode communication).The text was updated successfully, but these errors were encountered: