Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying mpi executable/binary doesn't carry over into a batch script? #287

Closed
sbryngelson opened this issue Jan 1, 2024 · 9 comments · Fixed by #307
Closed

Specifying mpi executable/binary doesn't carry over into a batch script? #287

sbryngelson opened this issue Jan 1, 2024 · 9 comments · Fixed by #307
Assignees
Labels
enhancement New feature or request

Comments

@sbryngelson
Copy link
Member

Running the following: [I]br012: bryngel-startup-proj/MFC $ ./mfc.sh run examples/1D_sodshocktube/case.py -t pre_process simulation -e batch -N 1 -n 1 -w "00:10:00" -p RM-small -# "shb-test" -b mpirun -a phy210041p

on Bridges2 gives in stdout:

mfc: OK > (venv) Entered the Python virtual environment.

      .=++*:          -+*+=.          [email protected] [Linux]
     :+   -*-        ==   =* .        -----------------------------------------
   :*+      ==      ++    .+-         --jobs 1
  :*##-.....:*+   .#%+++=--+=:::.     --mpi
  -=-++-======#=--**+++==+*++=::-:.   --no-gpu
 .:++=----------====+*= ==..:%.....   --no-debug
  .:-=++++===--==+=-+=   +.  :=       --targets pre_process and simulation
  +#=::::::::=%=. -+:    =+   *:
-----------------------------------------------------------
 .*=-=*=..    :=+*+:      -...--      $ ./mfc.sh [build, run, test, clean, count, packer]
--help

Run
  Acquiring /ocean/projects/phy210041p/bryngel/MFC/examples/1D_sodshocktube/case.py...
  Configuration:
    Input
/ocean/projects/phy210041p/bryngel/MFC/examples/1D_sodshocktube/case.py
    Job Name      (-#)  shb-test
    Engine        (-e)  batch
    Nodes         (-N)  1
    Tasks (/node) (-n)  1
    Walltime      (-w)  00:10:00
    Partition     (-p)  RM-small
    Account       (-a)  phy210041p
    Email         (-@)

  Generating input files for pre_process...

    Generating pre_process.inp.
      INFO: Forwarded 32/50 parameters.

  Generating input files for simulation...

    Generating simulation.inp.
      INFO: Forwarded 32/50 parameters.

  Generating syscheck/include/case.fpp.
    INFO: Custom case.fpp file is up to date.

  $ cmake --build
/ocean/projects/phy210041p/bryngel/MFC/build/no-debug_no-gpu_mpi/syscheck --target
syscheck -j 1 --config Release

[100%] Built target syscheck

  $ cmake --install
/ocean/projects/phy210041p/bryngel/MFC/build/no-debug_no-gpu_mpi/syscheck

-- Install configuration: "Release"
-- Installing: /jet/home/bryngel/bryngel-startup-proj/MFC/build/install/no-debug_no-gpu_mpi/bin/syscheck
  Generating pre_process/include/case.fpp.
    INFO: Custom case.fpp file is up to date.

  $ cmake --build
/ocean/projects/phy210041p/bryngel/MFC/build/no-debug_no-gpu_mpi/pre_process --target
pre_process -j 1 --config Release

[100%] Built target pre_process

  $ cmake --install
/ocean/projects/phy210041p/bryngel/MFC/build/no-debug_no-gpu_mpi/pre_process

-- Install configuration: "Release"
-- Installing: /jet/home/bryngel/bryngel-startup-proj/MFC/build/install/no-debug_no-gpu_mpi/bin/pre_process
  Generating fftw/include/case.fpp.
    INFO: Custom case.fpp file is up to date.

  $ cmake --build /ocean/projects/phy210041p/bryngel/MFC/build/dependencies/fftw --target
fftw -j 1 --config Release

[100%] Built target fftw

  $ cmake --install /ocean/projects/phy210041p/bryngel/MFC/build/dependencies/fftw

-- Install configuration: "Release"
  Generating simulation/include/case.fpp.
    INFO: Custom case.fpp file is up to date.

  $ cmake --build
/ocean/projects/phy210041p/bryngel/MFC/build/no-debug_no-gpu_mpi/simulation --target
simulation -j 1 --config Release

[100%] Built target simulation

  $ cmake --install
/ocean/projects/phy210041p/bryngel/MFC/build/no-debug_no-gpu_mpi/simulation

-- Install configuration: "Release"
-- Installing: /jet/home/bryngel/bryngel-startup-proj/MFC/build/install/no-debug_no-gpu_mpi/bin/simulation
  Detected the SLURM queue system.
  Running syscheck, pre_process, and simulation:
    > Generating batch file...
    > Evaluating template file...
    > > Warning: email was not specified. Thus, any line it appears on will be discarded.
    > Writing batch file...

    $ sbatch shb-test.sh

Submitted batch job 21429657
    INFO: Batch file submitted! Please check your queue system for the job status.
    INFO: If an error occurs, please check the generated batch file and error logs for
more information.
    INFO: You can modify the template batch file to your needs.
mfc: (venv) Exiting the Python virtual environment.

This seems "fine." But when one checks the generated slurm submission file we find (excerpt)

    if command -v srun > /dev/null 2>&1; then
        srun                                   \
            --nodes           1          \
            --ntasks-per-node 1 \
             "$binpath"

        #>
        #> srun --mpi=pmix                 \
        #>       "$binpath"
        #>
    else
        mpirun                         \
            -np 1 \
             "$binpath"
    fi

which ends up calling srun (since it is available), even though we specified mpirun via -b mpirun. This job fails. If I delete the srun lines from the generated slurm submission file shb-test.sh and keep the mpirun lines, it completes fine.

I do notice that the --help gives:

  -b {jsrun,srun,mpirun,mpiexec,N/A}, --binary {jsrun,srun,mpirun,mpiexec,N/A}
                        (Interactive) Override MPI execution binary (default: None)

Which seems to suggest that -b only does something in interactive mode (maybe?). It certainly seems true.

If the above is true, this seems like something we need to change (not saying who, just noting it here so we are aware at least).

@henryleberre can you confirm the above to "make sense"?

Tagging the relevant folks using Bridges2: @anshgupta1234 @wilfonba @JRChreim

@sbryngelson
Copy link
Member Author

Note that somehow for @JRChreim, he issues ./mfc.sh run ../PerfTests/WeakScaling/3D-Ref2-GPU/ETNRM6.py -t pre_process simulation -e batch -p GPU -N 2 -n 8 -w "01:00:00" -# "16CT3" -b mpirun

and gets

for binpath in '/ocean/projects/phy230019p/jrchreim/MFCMerge-DeleteAfterMerging/MFC-GPU/build/instal$

    echo -e ":) Running $binpath:"

        mpirun                         \
            -np 16 \
             "$binpath"

done

code=$?

which is notably different than my case..

@wilfonba
Copy link
Collaborator

wilfonba commented Jan 1, 2024

When I use batch mode on Phoenix, which also requires MPIRun, I always have to modify the batch script. There's an if statement in the default batch script, but it doesn't seem to work on Phoenix. Also relevant is #240 because I have to manually change the #SBATCH stuff at the top for it to work on Phoenix.

@sbryngelson
Copy link
Member Author

When I use batch mode on Phoenix, which also requires MPIRun, I always have to modify the batch script. There's an if statement in the default batch script, but it doesn't seem to work on Phoenix. Also relevant is #240 because I have to manually change the #SBATCH stuff at the top for it to work on Phoenix.

Good to know. The if statement doesn't work because it's checking if srun exists in the $PATH... which it does, it just isn't the executable you want to use.

@sbryngelson
Copy link
Member Author

Relevant function:

def __create_batch_file(self, system: queues.QueueSystem, targets: typing.List[MFCTarget]):

@JRChreim
Copy link
Contributor

JRChreim commented Jan 1, 2024

@sbryngelson I too had to manually change the slurm submission file/change it inside MFC code. It was failing before, and the -b mpirun seems to work only for interactive sessions.

@henryleberre
Copy link
Member

@sbryngelson This is related to #240. The current expectation is that one modifies the template/ file so that it works on their system. It is somewhat unreasonable to assume this should just work everywhere given how eccentric many systems are.

I agree that it isn't obvious that -b mpirun does not work in Batch scripts if you don't check the help menu:

henryleberre:~/dev/MFC $ ./mfc.sh run -h | grep binary
  -b {jsrun,srun,mpirun,mpiexec,N/A}, --binary {jsrun,srun,mpirun,mpiexec,N/A}
                        (Interactive) Override MPI execution binary (default:

The reason for the srun check your are talking about is because it is more probable that you would want to use it if it is present. mpirun is the generic fallback.

@sbryngelson
Copy link
Member Author

Yeah I've come to understand what the current state is better now. I think #240 is a good idea, especially because it makes it obvious to the user that any template exists. Right now, no one seems to really know this, and it would obviously become annoying to re-create the same one every time you clone or fork MFC. We presumably could just have a small set of such scripts sitting in a directory that people can add to (just like mfc.sh load)

@henryleberre
Copy link
Member

I agree this is probably the best path forward. I have had to make quite a few changes for benchmarking so once that is done it should be rather straightforward to get this working.

I image something like:

$ ./mfc.sh run case.py -e batch -c bridges2
.. template is toolchain/templates/bridges2.sh

$ ./mfc.sh run case.py -e batch -c ~/my_system.sh
.. template is ~/my_system.sh

@sbryngelson
Copy link
Member Author

Sounds good to me. I will leave this open as a tracking item for this improvement.

@sbryngelson sbryngelson linked a pull request Jan 3, 2024 that will close this issue
@sbryngelson sbryngelson added the enhancement New feature or request label Jan 4, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 6, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 6, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 7, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 7, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 7, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 7, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 7, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 9, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 10, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 11, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 13, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 13, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 13, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 14, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 14, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 14, 2024
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 14, 2024
@henryleberre henryleberre linked a pull request Jan 14, 2024 that will close this issue
henryleberre added a commit to henryleberre/ChemMFC that referenced this issue Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
4 participants