Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPICH with NVIDIA Compilers #7178

Open
aruhela opened this issue Oct 18, 2024 · 9 comments
Open

MPICH with NVIDIA Compilers #7178

aruhela opened this issue Oct 18, 2024 · 9 comments

Comments

@aruhela
Copy link

aruhela commented Oct 18, 2024

Hi Mpich Team,

I have build MPICH with NVIDIA compilers (nvc, nvc++ nvfortran) on TACC Vista machine. Though srun works but mpiexec job launcher results in following errors. Any suggestions?

i615-001gg$ mpiexec -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_QmhOmh
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_kYI4Ja
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_7fPRik
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_bjz7BQ
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_LGXVSr
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_4GtuuA
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_ud3CVC
[proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_uKHjRx
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
[proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed
Abort(878831119) on node 2: Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(49255)...: MPI_Init_thread(argc=0xfffff342b99c, argv=0xfffff342b990, required=1, provided=0xfffff342b988) failed
MPII_Init_thread(265).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(800).........:
MPIR_Comm_commit_internal(585):
MPID_Comm_commit_pre_hook(151):
MPIDI_world_pre_init(640).....:
MPIDI_UCX_init_world(263).....:
initial_address_exchange(79)..:
MPIDU_bc_table_create(153)....:
MPIR_pmi_allgather_shm(690)...:
get_ex_segs(431)..............:
(unknown)(): Other MPI error

@hzhou
Copy link
Contributor

hzhou commented Oct 18, 2024

Which version of MPICH is this? Could you try the latest release?

@aruhela
Copy link
Author

aruhela commented Oct 18, 2024

Its the latest 4.2.3 version.

@hzhou
Copy link
Contributor

hzhou commented Oct 18, 2024

Could you add -v -l option to mpiexec and upload the console log?

@aruhela
Copy link
Author

aruhela commented Oct 19, 2024

Here is the log file,
run.log

The main error is
[[email protected]] Launch arguments: /usr/bin/srun -N 8 -n 8 --input none --external-launcher /scratch/projects/compilers/nvidia24/mpich/4.2.3_cpu/bin/hydra_pmi_proxy --control-port i615-001.vista.tacc.utexas.edu:45341 --debug --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id -1
[proxy:[email protected]] HYDU_create_process (lib/utils/launch.c:73): execvp error on file 1 (No such file or directory)

@hzhou
Copy link
Contributor

hzhou commented Oct 19, 2024

Could you try ? -

mpiexec -v -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd

@aruhela
Copy link
Author

aruhela commented Oct 19, 2024

Hui, here is the log.

run2.log

@aruhela
Copy link
Author

aruhela commented Dec 21, 2024

Any update on this ticket?

@hzhou
Copy link
Contributor

hzhou commented Dec 21, 2024

Sorry for neglect. Could you try the newest MPICH 4.3.0rc1 release (https://www.mpich.org/downloads/), and if it still fails, upload the run log?

@natshineman
Copy link

natshineman commented Jan 17, 2025

Hi @hzhou we have seen a similar issue on Vista when working on MVAPICH.

From my testing it appears to be related to the --enable-fast=ndebug configure flag. Manually setting --enable-fast=02,alwaysinline instead of --enable-fast=all resolves the issue. However, it comes at the cost of significantly reduced small message intra-node performance so it is not a usable workaround for us.

Looks to me like the NVIDIA compiler performs some kind of unwanted optimization that is leading to this issue for us when NDEBUG is defined. Any thoughts on where to look?

Edit:
Looks like I had a typo in my configure and set 02 and not O2. ndebug was not at fault, it was the O2 optimizations. That explains why performance was impacted as well and makes much more sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants