Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corrupted double-linked list, BAD TERMINATION, exit code 134: mpich on github actions broken since 2 weeks #7256

Open
JohannesBuchner opened this issue Jan 5, 2025 · 12 comments

Comments

@JohannesBuchner
Copy link

In the last 4 weeks, the upgrade from libmpich12 4.2.0-5build3 to 4.0-3 on github actions has stopped working.

With no code change in the repository:

A run from 4 weeks ago:
https://github.com/JohannesBuchner/PyMultiNest/actions/runs/12173409593/job/33953705818

A run from 2 weeks ago:
https://github.com/JohannesBuchner/PyMultiNest/actions/runs/12425695793/job/34692833393

The error is:

corrupted double-linked list

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 7143 RUNNING AT fv-az1075-278
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

The MPI code is in fortran, built against mpich, then loaded as a library from python. Python is run with mpiexec -np 4.

Should I disable CI testing with mpich?

@raffenet
Copy link
Contributor

raffenet commented Jan 5, 2025

In the last 4 weeks, the upgrade from libmpich12 4.2.0-5build3 to 4.0-3 on github actions has stopped working.

Just to clarify - what operating system are you using? MPICH is installed from the distro package manager?

@raffenet
Copy link
Contributor

raffenet commented Jan 5, 2025

Additionally if you could provide the test code or a minimal reproducer, it would be helpful to track down the actual problem. While we cannot directly fix bugs in distro packages, we can suggest a different version to use for testing.

@JohannesBuchner
Copy link
Author

JohannesBuchner commented Jan 5, 2025

The top of the log link above shows the operating system setup (Ubuntu 22.04.5), where the libraries are installed with sudo apt-get update && sudo apt-get -y --no-install-recommends install -y -qq libblas{3,-dev} liblapack{3,-dev} libatlas-base-dev cmake build-essential git gfortran

The further install steps are listed in https://github.com/JohannesBuchner/PyMultiNest/actions/runs/12173409593/workflow but boil down to:

  • sudo apt-get install -qq mpich libmpich-dev python3-mpi4py
  • mamba install -c conda-forge --file conda-requirements.txt corner coveralls pytest toml
  • pip install --user mpi4py
  • git clone https://github.com/JohannesBuchner/MultiNest
  • mkdir -p MultiNest/build; pushd MultiNest/build; cmake .. && make && popd
  • pip install --user pymultinest
  • LD_LIBRARY_PATH=MultiNest/lib/:${LD_LIBRARY_PATH} mpiexec -np 4 python pymultinest_demo_minimal.py
    where pymultinest_demo_minimal.py is this file: https://github.com/JohannesBuchner/PyMultiNest/blob/master/pymultinest_demo_minimal.py

I have not tested it, but perhaps cd ..; mkdir chains; mpiexec -np 4 bin/eggboxC may also be able to reproduce it (or any of the other example programs).

Maybe you can set up Github Actions for this repository as well.

@hzhou
Copy link
Contributor

hzhou commented Jan 6, 2025

Following your steps, I got:

(base) hzhou [build]$ cmake ..
-- Detected gfortran, adding -ffree-line-length-none compiler flag.
-- Detected gfortran >= 10, adding -std=legacy compiler flag.
-- Detected gfortran-10+, adding -w -fallow-argument-mismatch compiler flag.
CMake Error in /home/hzhou/MultiNest/build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  Imported target "MPI::MPI_CXX" includes non-existent path

    "/usr/lib/x86_64-linux-gnu/openmpi/include"

  in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

  * The path was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and references files it does not
  provide.



CMake Error in /home/hzhou/MultiNest/build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  Imported target "MPI::MPI_CXX" includes non-existent path

    "/usr/lib/x86_64-linux-gnu/openmpi/include"

  in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

  * The path was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and references files it does not
  provide.



CMake Error at /usr/share/cmake-3.22/Modules/FindMPI.cmake:1264 (try_compile):
  Failed to generate test project build system.
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/FindMPI.cmake:1315 (_MPI_try_staged_settings)
  /usr/share/cmake-3.22/Modules/FindMPI.cmake:1638 (_MPI_check_lang_works)
  src/CMakeLists.txt:95 (FIND_PACKAGE)


-- Configuring incomplete, errors occurred!
See also "/home/hzhou/MultiNest/build/CMakeFiles/CMakeOutput.log".
See also "/home/hzhou/MultiNest/build/CMakeFiles/CMakeError.log".

I am not quite familiar with cmake or its FindMPI module. Any idea how to work around?

@JohannesBuchner
Copy link
Author

Can you try running:

update-alternatives --list mpi|grep mpich| xargs -rt sudo update-alternatives --set mpi
update-alternatives --list mpirun|grep mpich | xargs -rt sudo update-alternatives --set mpirun

@hzhou
Copy link
Contributor

hzhou commented Jan 7, 2025

That worked.

Running mpiexec -np 4 python pymultinest_demo_minimal.py got me this:

Attempting to use an MPI routine (internal_Comm_rank) before initializing or after finalizing MPICH

Missing MPI_Init?

And running cd ..; mkdir chains; mpiexec -np 4 bin/eggboxC give me:

 MultiNest Warning: no resume file found, starting from scratch
 *****************************************************
 MultiNest v3.10
 Copyright Farhan Feroz & Mike Hobson
 Release Jul 2015

 no. of live points = 1000
 dimensionality =    2
 *****************************************************
 Starting MultiNest
At line 233 of file /home/hzhou/MultiNest/src/nested.F90 (unit = 55)
Fortran runtime error: Cannot open file 'chains/eggboxC-ev.dat': No such file or directory

@JohannesBuchner
Copy link
Author

Huh, I have not seen that before. Maybe you need to change pymultinest_demo_minimal.py to add , init_MPI = True after verbose = True.

For the second, are you sure chains/ is a folder that exists and can be written into?

@hzhou
Copy link
Contributor

hzhou commented Jan 7, 2025

Thanks! Both runs following your suggestions. However, I don't see the segfaults.

My mpichversion shows:

MPICH Version:          4.0
MPICH Release date:     Fri Jan 21 10:42:29 CST 2022
MPICH Device:           ch4:ofi
MPICH configure:        --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-libfabric=/usr --
with-slurm=/usr --with-device=ch4:ofi --with-pm=hydra --with-hwloc-prefix=/usr --with-wrapper-dl-type=none --enable-shared --without-yaksa --prefix=/usr --enable-fortran=all --disable-rpath --disable-wrapper-rpath --sysconfdir=/etc/mpich --libdir=/usr/lib/x86_64-linux-gnu --includedir=/usr/include/x86_64-linux-gnu/mpich --docdir=/usr/share/doc/mpich CPPFLAGS= CFLAGS= CXXFLAGS= F
FLAGS=-O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -fallow-invalid-boz -fallow-argument-mismatch FCFLAGS=-O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -fallow-invalid-boz -fallow-argument-mismatch B
ASH_SHELL=/bin/bash
MPICH CC:       gcc  -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security  -O2
MPICH CXX:      g++  -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -O2
MPICH F77:      gfortran -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong  -fallow-invalid-boz -fallow-argument-mismatch -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -cpp  -fallow-invalid-boz -fallo
w-argument-mismatch -O2
MPICH FC:       gfortran -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong  -fallow-invalid-boz -fallow-argument-mismatch -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -cpp  -fallow-invalid-boz -fallo
w-argument-mismatch -O2
MPICH Custom Information:

@hzhou
Copy link
Contributor

hzhou commented Jan 7, 2025

I noticed the number in my run does not match the one in the log. Mine starts with:

 MultiNest Warning: no resume file found, starting from scratch
 *****************************************************
 MultiNest v3.10
 Copyright Farhan Feroz & Mike Hobson
 Release Jul 2015

 no. of live points =  400
 dimensionality =    2
 *****************************************************
 Starting MultiNest
 generating live points
 live points generated, starting sampling
Acceptance Rate:                        0.986842
Replacements:                                450
Total Samples:                               456
Nested Sampling ln(Z):                  0.158934
Importance Nested Sampling ln(Z):     234.734213 +/-  0.998902
Acceptance Rate:                        0.967118
Replacements:                                500
Total Samples:                               517
Nested Sampling ln(Z):                  5.610221
Importance Nested Sampling ln(Z):     236.110960 +/-  0.808323
Acceptance Rate:                        0.941781
Replacements:                                550
Total Samples:                               584
Nested Sampling ln(Z):                 11.274311
Importance Nested Sampling ln(Z):     236.352423 +/-  0.639686
...

Are the testing data randomly generated?

Also noticed in the test log:


Run mpiexec -np 4 python pymultinest_demo_minimal.py
 MultiNest Warning: no resume file found, starting from scratch
 *****************************************************
 MultiNest v3.10
 Copyright Farhan Feroz & Mike Hobson
 Release Jul 2015

 no. of live points =  400
 dimensionality =    2
 *****************************************************
 Starting MultiNest
 generating live points
 live points generated, starting sampling
Acceptance Rate:                        0.991189
Replacements:                                450
Total Samples:                               454
Nested Sampling ln(Z):                  0.681179
Importance Nested Sampling ln(Z):     235.386496 +/-  0.689609
 *****************************************************
 MultiNest v3.10
At line 460 of file /home/runner/work/PyMultiNest/PyMultiNest/MultiNest/src/nested.F90 (unit = 57, file = 'chains/1-phys_live.points')
 Copyright Farhan Feroz & Mike Hobson
Fortran runtime error: End of file
 Release Jul 2015
 ...

Is this "end of file" error suspicious?

@JohannesBuchner
Copy link
Author

Yes, there is a randomness to it (a Monte Carlo algorithm).

Yes, the end of file is very strange and I have not seen this before, it seems it cannot read/write to the file system?

@hzhou
Copy link
Contributor

hzhou commented Jan 7, 2025

Is there a disk space limit for the github runner containers? The binary size of mpich 4.0 did grow significantly from the previous versions, so it is possible that resulted in running out of space during the test.

@JohannesBuchner
Copy link
Author

I don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants