Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test segmentation fault with version 0.3.27 on RISC-V #4719

Closed
bedroge opened this issue May 28, 2024 · 27 comments · Fixed by #4733
Closed

Test segmentation fault with version 0.3.27 on RISC-V #4719

bedroge opened this issue May 28, 2024 · 27 comments · Fixed by #4733

Comments

@bedroge
Copy link

bedroge commented May 28, 2024

When trying to install OpenBLAS 0.3.27 with EasyBuild on a Starfive VisionFive 2 RISC-V development board, I'm getting:

 Real BLAS Test Program Results


 Test of subprogram number  1             SDOT 
                                    ----- PASS -----

 Test of subprogram number  2            SAXPY 
                                    ----- PASS -----

 Test of subprogram number  3            SROTG 
                                    ----- PASS -----

 Test of subprogram number  4             SROT 
                                    ----- PASS -----

 Test of subprogram number  5            SCOPY 
                                    ----- PASS -----

 Test of subprogram number  6            SSWAP 
                                    ----- PASS -----

 Test of subprogram number  7            SNRM2 
                                    ----- PASS -----

 Test of subprogram number  8            SASUM 
                                    ----- PASS -----

 Test of subprogram number  9            SSCAL 
                                    ----- PASS -----

 Test of subprogram number 10            ISAMAX
                                    ----- PASS -----

 Test of subprogram number 11            SROTMG
                                    ----- PASS -----

 Test of subprogram number 12            SROTM 
                                    ----- PASS -----

 Test of subprogram number 13            SDSDOT
                                    ----- PASS -----

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGTRAP: Trace/breakpoint trap.

Backtrace for this error:
make[1]: *** [Makefile:48: level1] Error 127
make[1]: Leaving directory '/nvme/eb/build/OpenBLAS/0.3.27/GCC-13.3.0/OpenBLAS-0.3.27/test'
make: *** [Makefile:167: tests] Error 2

cc @SebastianAchilles who encountered the same error on a SiFive HiFive Unmatched board.

@martin-frbg
Copy link
Collaborator

Strange as the cpu on the Visionfive2 does not provide rvv support (as far as I know), so the build should be using the "generic" plain C kernels. Which compiler and build options did you use ?

@bedroge
Copy link
Author

bedroge commented May 28, 2024

It's using GCC 13.3.0 with -O2 -ftree-vectorize -march=rv64gc -mabi=lp64d -fno-math-errno.

@martin-frbg
Copy link
Collaborator

Not reproducible with GCC 13.2.0 and default build options (which include march=rv64imafdc -mabi=lp64d)

@bedroge
Copy link
Author

bedroge commented May 28, 2024

I've just started a build with EasyBuild and GCC 13.2.0 as well, let's see if that does work.

@martin-frbg
Copy link
Collaborator

I'm guessing it may be either the -ftree-vectorize or the way you are passing the build options... I trust that you are using identical versions of gcc and gfortran

@bedroge
Copy link
Author

bedroge commented May 28, 2024

Using GCC 13.2.0 instead of 13.3.0 didn't make a difference, so I'll try changing those flags (the same ones did work fine with version 0.3.24).

@martin-frbg
Copy link
Collaborator

I also did not get a segfault with all the flags that you mentioned (including the -ftree-vectorize) added to the COMMON_OPT line in Makefile.rule) - the only caveat is that I am currently testing with the develop branch instead of 0.3.27, but I do not recall any change that would make a difference here. Checking out 0.3.27 now for a repeat of the test - may take some time as I'm using the shared hardware of the GCC Compile Farm

@martin-frbg
Copy link
Collaborator

Still not reproducible with 0.3.27 - I think there must be something else peculiar to your build or hardware. (Note that SDSDOT is actually the last test run in sblat1 - is it this or one of the other tests in the test folder that appears to be failing for you ?)

@bedroge
Copy link
Author

bedroge commented May 31, 2024

I've now tried it without EasyBuild, and instead just used the GCC of my OS:

user@starfive:/nvme/openblas/OpenBLAS-0.3.27$ gcc -v | grep version
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/riscv64-linux-gnu/12/lto-wrapper
Target: riscv64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 12.2.0-10' --with-bugurl=file:///usr/share/doc/gcc-12/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-12 --program-prefix=riscv64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --disable-multilib --with-arch=rv64gc --with-abi=lp64d --enable-checking=release --build=riscv64-linux-gnu --host=riscv64-linux-gnu --target=riscv64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.2.0 (Debian 12.2.0-10) 

user@starfive:/nvme/openblas/OpenBLAS-0.3.27$ gfortran -v 
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/riscv64-linux-gnu/12/lto-wrapper
Target: riscv64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 12.2.0-10' --with-bugurl=file:///usr/share/doc/gcc-12/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-12 --program-prefix=riscv64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --disable-multilib --with-arch=rv64gc --with-abi=lp64d --enable-checking=release --build=riscv64-linux-gnu --host=riscv64-linux-gnu --target=riscv64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.2.0 (Debian 12.2.0-10) 

And I compiled OpenBLAS using similar commands as EasyBuild would normally use:

make -j 4 shared  BINARY='64'  CC='gcc'  FC='gfortran'  MAKE_NB_JOBS='-1'  USE_OPENMP='1'  USE_THREAD='1'  CFLAGS='-O2'
make tests  BINARY='64'  CC='gcc'  FC='gfortran'  MAKE_NB_JOBS='-1'  USE_OPENMP='1'  USE_THREAD='1

The latter fails with:

make[1]: Entering directory '/nvme/openblas/OpenBLAS-0.3.27/test'
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp -fPIC -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize -c sblat1.f  -o sblat1.o
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp  -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize  -o sblat1 sblat1.o ../libopenblas_riscv64_genericp-r0.3.27.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/riscv64-linux-gnu/12 -L/lib/riscv64-linux-gnu -L/usr/lib/riscv64-linux-gnu  -lpthread -lc -latomic 
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgfortran.a(fpu.o): in function `.L0 ':
(.text._gfortrani_set_fpu_trap_exceptions+0x82): warning: fedisableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_set_fpu_trap_exceptions+0x70): warning: feenableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_get_fpu_trap_exceptions+0x4): warning: fegetexcept is not implemented and will always fail
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgomp.a(oacc-profiling.o): in function `.L0 ':
(.text+0x7da): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp -fPIC -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize -c dblat1.f  -o dblat1.o
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp  -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize  -o dblat1 dblat1.o ../libopenblas_riscv64_genericp-r0.3.27.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/riscv64-linux-gnu/12 -L/lib/riscv64-linux-gnu -L/usr/lib/riscv64-linux-gnu  -lpthread -lc -latomic 
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgfortran.a(fpu.o): in function `.L0 ':
(.text._gfortrani_set_fpu_trap_exceptions+0x82): warning: fedisableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_set_fpu_trap_exceptions+0x70): warning: feenableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_get_fpu_trap_exceptions+0x4): warning: fegetexcept is not implemented and will always fail
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgomp.a(oacc-profiling.o): in function `.L0 ':
(.text+0x7da): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp -fPIC -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize -c cblat1.f  -o cblat1.o
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp  -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize  -o cblat1 cblat1.o ../libopenblas_riscv64_genericp-r0.3.27.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/riscv64-linux-gnu/12 -L/lib/riscv64-linux-gnu -L/usr/lib/riscv64-linux-gnu  -lpthread -lc -latomic 
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgfortran.a(fpu.o): in function `.L0 ':
(.text._gfortrani_set_fpu_trap_exceptions+0x82): warning: fedisableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_set_fpu_trap_exceptions+0x70): warning: feenableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_get_fpu_trap_exceptions+0x4): warning: fegetexcept is not implemented and will always fail
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgomp.a(oacc-profiling.o): in function `.L0 ':
(.text+0x7da): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp -fPIC -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize -c zblat1.f  -o zblat1.o
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -fopenmp  -march=rv64imafdc -mabi=lp64d -static -fno-tree-vectorize  -o zblat1 zblat1.o ../libopenblas_riscv64_genericp-r0.3.27.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/riscv64-linux-gnu/12 -L/lib/riscv64-linux-gnu -L/usr/lib/riscv64-linux-gnu  -lpthread -lc -latomic 
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgfortran.a(fpu.o): in function `.L0 ':
(.text._gfortrani_set_fpu_trap_exceptions+0x82): warning: fedisableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_set_fpu_trap_exceptions+0x70): warning: feenableexcept is not implemented and will always fail
/usr/bin/ld: (.text._gfortrani_get_fpu_trap_exceptions+0x4): warning: fegetexcept is not implemented and will always fail
/usr/bin/ld: /usr/lib/gcc/riscv64-linux-gnu/12/libgomp.a(oacc-profiling.o): in function `.L0 ':
(.text+0x7da): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat1
 Real BLAS Test Program Results


 Test of subprogram number  1             SDOT 
                                    ----- PASS -----

 Test of subprogram number  2            SAXPY 
                                    ----- PASS -----

 Test of subprogram number  3            SROTG 
                                    ----- PASS -----

 Test of subprogram number  4             SROT 
                                    ----- PASS -----

 Test of subprogram number  5            SCOPY 
                                    ----- PASS -----

 Test of subprogram number  6            SSWAP 
                                    ----- PASS -----

 Test of subprogram number  7            SNRM2 
                                    ----- PASS -----

 Test of subprogram number  8            SASUM 
                                    ----- PASS -----

 Test of subprogram number  9            SSCAL 
                                    ----- PASS -----

 Test of subprogram number 10            ISAMAX
                                    ----- PASS -----

 Test of subprogram number 11            SROTMG
                                    ----- PASS -----

 Test of subprogram number 12            SROTM 
                                    ----- PASS -----

 Test of subprogram number 13            SDSDOT
                                    ----- PASS -----

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
Segmentation fault
make[1]: *** [Makefile:48: level1] Error 139
make[1]: Leaving directory '/nvme/openblas/OpenBLAS-0.3.27/test'

@bedroge
Copy link
Author

bedroge commented May 31, 2024

I've just tried the exact same thing for version 0.3.26, and that does work fine.

@martin-frbg
Copy link
Collaborator

Rerunning my build in the gcc compile farm now - maybe it is the USE_OPENMP=1 that makes the difference, I did not notice that in your initial report. (There have been some OpenMP-related changes in 0.3.27, but on the other hand they should be affecting all platforms)

@bedroge
Copy link
Author

bedroge commented May 31, 2024

It looks like that's indeed causing the failures. I've tried the same thing now with 0.3.27 and USE_OPENMP=0, and then everything works fine.

@martin-frbg
Copy link
Collaborator

Reproduced, but gdb's backtraces only lead to the implementation of the CLOSE() function in the Fortran runtime library. (I guess this could still hint at a memory management problem, but sadly that gccfarm machine does not appear to have valgrind installed). The C-only openblas_utest and openblas_utest_ext run without errors.

@martin-frbg
Copy link
Collaborator

Bisecting now, but this will probably take me until tomorrow due to real life.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 3, 2024

Bisected to

bef47917bd72f35c151038fee0cf485445476863 is the first bad commit
commit bef47917bd72f35c151038fee0cf485445476863
Author: Heller Zheng <[email protected]>
Date:   Tue Nov 15 00:06:25 2022 -0800

    Initial version for riscv sifive x280

not sure yet if that is actually true (RISCV64_GENERIC does not use any of the *_rvv.c kernel files added by that PR, but compiler flags for TARGET=RISCV64_GENERIC were introduced at the same time as well). Also this used to live on the risc-v branch for more than a year, so I imagine it could just be the merge in general that went wrong.

@martin-frbg
Copy link
Collaborator

Seems to be the static linking of libgfortran imposed by that PR (in Makefile.riscv64) that is causing the segfault. I do not recall a reason being given for why it must be -static @HellerZheng ?

@HellerZheng
Copy link
Contributor

Hi Martin, It was added long time ago, before I involved this project. I guess it was copied from C910V' case.

@martin-frbg
Copy link
Collaborator

Thank you for the quick response - I guess it may have helped with cross-compilation in the early days, but now it is probably best to remove it. (I will try to check the C910V case as soon as I get a rvv-0.71 capable gcc fork to build on my hardware)

@bedroge
Copy link
Author

bedroge commented Jun 4, 2024

I've tried the fix from #4733, and it does indeed solve the segmentation fault. I'm not sure if it's in any way related to this issue, but I do see quite a lot of failing LAPACK tests:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1561872         1       (0.000%)        0       (0.000%)        
DOUBLE PRECISION        1329105         36885   (2.775%)        0       (0.000%)        
COMPLEX                 1025645         1       (0.000%)        0       (0.000%)        
COMPLEX16               1030797         0       (0.000%)        0       (0.000%)        

--> ALL PRECISIONS      4947419         36887   (0.746%)        0       (0.000%)        

@martin-frbg
Copy link
Collaborator

Hmm, that looks seriously weird (although without knowing the magnitude of the individual errors it could still be harmless - as it is meant to be the internal testsuite of the reference implementation, it normally assumes the unoptimized reference BLAS to be used). Lots of deviations for one particular precision does look suspect, I'll see if I can reproduce this in the gcc compile farm.

@martin-frbg
Copy link
Collaborator

gcc12 build only has the single errors in REAL and COMPLEX, now trying gcc13

@martin-frbg
Copy link
Collaborator

wait, if you are building 0.3.27 the test errors are almost certainly coming from the lapack testsuite bug that has since been fixed by PR #4647

@bedroge
Copy link
Author

bedroge commented Jun 5, 2024

Awesome, that solved it:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    1561872         1       (0.000%)        0       (0.000%)
DOUBLE PRECISION        1570470         0       (0.000%)        0       (0.000%)
COMPLEX                 1025645         1       (0.000%)        0       (0.000%)
COMPLEX16               1030797         0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      5188784         2       (0.000%)        0       (0.000%)

Thanks a lot for the quick replies and solutions!

@peterzhuamazon
Copy link

Hi @martin-frbg ,

We saw a similar issue in AL2 on x86_64 on 3.27:
opensearch-project/opensearch-build#5226 (comment)

Tried the latest 3.28 still having issues.

Using gcc 13.2 (manually compiled with ../configure --enable-languages=all --prefix=/usr/local --disable-multilib --disable-bootstrap) with these flags:

make USE_OPENMP=1 FC=gfortran DYNAMIC_ARCH=1 CXX=g++

Thanks.

@martin-frbg
Copy link
Collaborator

@peterzhuamazon not sure your case is related in any way - what hardware is it running/crashing on ? (If in doubt, you can use export OPENBLAS_VERBOSE=2 with DYNAMIC_ARCH builds to have them tell you the cpu they detected). And would it be possible for you to test a checkout of the current develop branch rather than 0.3.28 ?

@peterzhuamazon
Copy link

peterzhuamazon commented Jan 9, 2025

Hi @martin-frbg upon testing, the develop branch (a588ea9) compiled with no errors at all.
But it is a moving target so I probably need to find out which commit it fixed it, or whether there will be a new release of OpenBlas soon.

Ignore: actually that is compiled on gcc10.

@peterzhuamazon
Copy link

Failed at the same place with the skylakeX core x86_64
opensearch-project/opensearch-build#5226 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants