Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Segfault on MacOS #1720

Closed
drpjm opened this issue Feb 7, 2024 · 38 comments
Closed

Python Segfault on MacOS #1720

drpjm opened this issue Feb 7, 2024 · 38 comments

Comments

@drpjm
Copy link

drpjm commented Feb 7, 2024

Description

I was running through the robotics text on performing MAP with multiple sensors and when computing the unnormalized posterior from a DiscreteConditional likelihood, I get a segfault.

This is running in Python 3.11.6, Mac OSX 14.3. Mac OS reports the following;


Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libgtsam.4.2.0.dylib          	       0x106085dd8 gtsam::DecisionTree<unsigned long long, double>::Choice::choose(unsigned long long const&, unsigned long) const + 96
1   libgtsam.4.2.0.dylib          	       0x10609f210 gtsam::DiscreteConditional::likelihood(gtsam::DiscreteValues const&) const + 220
2   libgtsam.4.2.0.dylib          	       0x10609f6d4 gtsam::DiscreteConditional::likelihood(unsigned long) const + 136
3   gtsam.cpython-311-darwin.so   	       0x106accfa8 0x106980000 + 1363880
4   gtsam.cpython-311-darwin.so   	       0x106996400 0x106980000 + 91136
5   Python                        	       0x1056425b8 cfunction_call + 60
6   Python                        	       0x1055f78f4 _PyObject_MakeTpCall + 128
7   Python                        	       0x1056d5984 _PyEval_EvalFrameDefault + 42108
8   Python                        	       0x1056ca8c4 PyEval_EvalCode + 168
9   Python                        	       0x1057213f0 run_eval_code_obj + 84
10  Python                        	       0x105721354 run_mod + 112
11  Python                        	       0x105721194 pyrun_file + 148
12  Python                        	       0x105720be4 _PyRun_SimpleFileObject + 268
13  Python                        	       0x10572057c _PyRun_AnyFileObject + 216
14  Python                        	       0x10573d164 pymain_run_file_obj + 220
15  Python                        	       0x10573caa4 pymain_run_file + 72
16  Python                        	       0x10573c384 Py_RunMain + 704
17  Python                        	       0x10573d4c0 Py_BytesMain + 40
18  dyld                          	       0x1864f10e0 start + 2360

Steps to reproduce

I am running this in a Python script, not a Jupyter notebook. I have a conductivity sensor based on the DiscreteConditional in the robotics textbook in Chapter 2.4.4.

The segfault occurs when I run something similar to the example in Chapter 2.4.10.

posterior = conductivity_factor * detector_factor * weight_factor * category_prior

Expected behavior

I would expect that the posterior is computed without crashing when multiplying out the likelihood factors and prior. When I use a DecisionTreeFactor to represent a continuous sensor model, this crash does not occur. So it appears that there is a problem with the DiscreteConditional python object when using the * operator. It looks like it happens for any combination of the DiscreteConditional or DecisionTreeFactor.

Environment

Python 3.11.6, Mac OSX 14.3 with Apple silicon (M2)

@ProfFan
Copy link
Collaborator

ProfFan commented Mar 17, 2024

Hi @drpjm is this from PyPI or compiled from main?

@drpjm
Copy link
Author

drpjm commented Apr 26, 2024

@ProfFan I tried to build from source and also use PyPI.

@dellaert
Copy link
Member

Coming very late to this conversation. I did most of the book using python 3.9, and there all tests succeed. But I am seeing segfaults with 3.12. I will try 3.10 and then 3.11, and see whether I can track down the issue.

@dellaert
Copy link
Member

Python 3.10 works (at least all tests pass without segfault)

@dellaert
Copy link
Member

OK, repro with Python 3.11.9:

(py311) (gtbook) FranksVrdantMac:build dellaert$ make python-test
[ 17%] Built target cephes-gtsam
[ 32%] Built target metis-gtsam
[ 76%] Built target gtsam
[ 76%] Built target gtsam_unstable_header
[ 76%] Built target pybind_wrap_gtsam_unstable
[ 85%] Built target gtsam_unstable
[ 85%] Built target gtsam_unstable_py
[ 85%] Built target gtsam_header
[ 91%] Built target pybind_wrap_gtsam
[100%] Built target gtsam_py
Segmentation fault

@dellaert
Copy link
Member

@ProfFan @varunagrawal any ideas? Maybe we need to upgrade pybind?

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 17, 2024

Might need to run the thing within LLDB and see what is happening

@dellaert
Copy link
Member

Would you be willing to upgrade pybind and give it a try?

@dellaert
Copy link
Member

I forget exactly where to do it, please put me on the review so I can do it the next time..l

@varunagrawal
Copy link
Collaborator

I had upgraded Pybind11 2 months ago

borglab/wrap#166

I'll take a closer look later today.

@varunagrawal
Copy link
Collaborator

varunagrawal commented Aug 17, 2024

My quick recommendation would be to try upgrading to numpy 2.0.0? IIRC there is backwards compatibility with numpy V1, but the symptoms described indicate that maybe numpy 2.0.0 is already being used and it's the latest gtsam python build that needs to be used.

@drpjm can you please report your numpy version here? You can get it with pip show numpy

@dellaert
Copy link
Member

Cool, thanks @varunagrawal . could you also tell me the PR where this version of wrap was then included into GTSAM? (Submodule or subtree? I forget)

@varunagrawal
Copy link
Collaborator

Here you go: #1773

@varunagrawal
Copy link
Collaborator

@drpjm I re-ran the current version of S24_sorter_perception.ipynb of the book with the latest version of GTSAM and I am unable to reproduce the issue.
You mention you are running this in a script. Can you please share the script?

@varunagrawal
Copy link
Collaborator

Haven't heard back from @drpjm so I will close this for now since I can't reproduce this. If you're still having issues, please feel free to reopen.

@drpjm
Copy link
Author

drpjm commented Aug 23, 2024

@varunagrawal Been very busy and had to track down the code that segfaults. I can add you as a collaborator to try it out. I was using numpy 1.26.2 at the time when the script was written and just tested it now and a segfault was produced.

@dellaert
Copy link
Member

Wait, @varunagrawal - I have reproduced the segfaults with Python 3.11.9, so I'm re-opening.

@dellaert dellaert reopened this Aug 24, 2024
@dellaert
Copy link
Member

I am running with numpy 2.0.1, still segfaults:

(py311) $ /Users/dellaert/mambaforge/envs/py311/bin/python /Users/dellaert/git/github/python/gtsam/tests/test_Factors.py
.Segmentation fault: 11
(py311) $ pip show numpy | grep Version
Version: 2.0.1

@dellaert
Copy link
Member

@ProfFan or @varunagrawal, with lldb I get below, which is mildly useless. I get unnamed symbols even when compiling GTSAM with Debug. Is that flag propagated correctly to wrap?

(lldb) run
Process 53926 launched: '/Users/dellaert/mambaforge/envs/py311/bin/python' (arm64)
.Process 53926 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x0000000101b95ac8 gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol11562 + 396
    frame #2: 0x0000000101f2bb9c gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol31238 + 128
    frame #3: 0x000000010199c6ec gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol1632 + 4756
    frame #4: 0x00000001000b807c python`cfunction_call + 124
    frame #5: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #6: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #7: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #8: 0x00000001000643a4 python`method_vectorcall + 520
    frame #9: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
    frame #10: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #11: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #12: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #13: 0x00000001000db2c0 python`slot_tp_call + 172
    frame #14: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #15: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #16: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #17: 0x00000001000643a4 python`method_vectorcall + 520
    frame #18: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
    frame #19: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #20: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #21: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #22: 0x00000001000db2c0 python`slot_tp_call + 172
    frame #23: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #24: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #25: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #26: 0x00000001000643a4 python`method_vectorcall + 520
    frame #27: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
    frame #28: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #29: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #30: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #31: 0x00000001000db2c0 python`slot_tp_call + 172
    frame #32: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #33: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #34: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #35: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #36: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #37: 0x00000001000dc8d8 python`slot_tp_init + 196
    frame #38: 0x00000001000d4ed0 python`type_call + 464
    frame #39: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #40: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #41: 0x0000000100156518 python`PyEval_EvalCode + 220
    frame #42: 0x00000001001bc4fc python`run_mod + 144
    frame #43: 0x00000001001bbf5c python`_PyRun_SimpleFileObject + 1260
    frame #44: 0x00000001001bb01c python`_PyRun_AnyFileObject + 240
    frame #45: 0x00000001001e1b30 python`Py_RunMain + 3100
    frame #46: 0x00000001001e2988 python`pymain_main + 1252
    frame #47: 0x0000000100003958 python`main + 56
    frame #48: 0x000000018ee420e0 dyld`start + 2360

@dellaert
Copy link
Member

OK, after blasting away all my libraries, I have symbols:

test_Factors fails with this, immediately:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x000000010347fcc4 gtsam.cpython-311-darwin.so`void boost::archive::detail::common_oarchive<boost::archive::binary_oarchive>::save_override<gtsam::PinholePose<gtsam::Cal3Fisheye> const>(gtsam::PinholePose<gtsam::Cal3Fisheye> const&) + 13944
    frame #2: 0x0000000103e68a58 gtsam.cpython-311-darwin.so`gtsam::NonlinearISAM::getFactorsUnsafe() const + 919916

and test_Cal3Fisheye fails with this:

    frame #1: 0x0000000103cefcc4 gtsam.cpython-311-darwin.so`void boost::archive::detail::common_oarchive<boost::archive::binary_oarchive>::save_override<gtsam::PinholePose<gtsam::Cal3Fisheye> const>(gtsam::PinholePose<gtsam::Cal3Fisheye> const&) + 13944
    frame #2: 0x0000000103dab060 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 13968
    frame #3: 0x0000000103daadd4 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 13316
    frame #4: 0x0000000103daac5c gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 12940
    frame #5: 0x0000000103daac14 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 12868
    frame #6: 0x00000001037bcba4 gtsam.cpython-311-darwin.so`pybind11::error_already_set::restore() + 56908

Both seem boost serialization related !

@varunagrawal
Copy link
Collaborator

I set up a 3.11.9 environment on my M1 mac and I am again not able to repro. :( All tests pass here. Could it be the way boost is installed? Mine is via homebrew.

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 25, 2024

Let me see what I can do, from what I see Python is from mambaforge, 3.11.

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 25, 2024

Can't reproduce the crash on develop. This is with boost 1.86 (Homebrew), Python 3.11 on conda-forge and numpy 2.0.

However the PyPI version does crash. @dellaert Did you reproduce the crash with develop?

@dellaert
Copy link
Member

dellaert commented Aug 25, 2024 via email

@dellaert
Copy link
Member

Here is one possible issue. cmake says:

 pybind11_DIR                    */opt/homebrew/share/cmake/pybind11

so it does not seem to pick up on the pybind included with wrap...

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 25, 2024

Here is one possible issue. cmake says:

 pybind11_DIR                    */opt/homebrew/share/cmake/pybind11

so it does not seem to pick up on the pybind included with wrap...

brew info pybind11
==> pybind11: stable 2.13.5 (bottled)

Should be up-to-date enough? What does this say on your computer?

@dellaert
Copy link
Member

dellaert commented Aug 25, 2024

==> pybind11: stable 2.13.5 (bottled)

But, pybind11 is included in wrap, so the bigger issue is: why does our cmake not use that one. It should not pick up on the brew one, right?

@varunagrawal
Copy link
Collaborator

Interesting.
Mine says

//Value Computed by CMake
pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11

//The directory containing a CMake configuration file for pybind11.
pybind11_DIR:PATH=pybind11_DIR-NOTFOUND

//Value Computed by CMake
pybind11_IS_TOP_LEVEL:STATIC=OFF

//Value Computed by CMake
pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11

which possibly explains the issue.

@varunagrawal
Copy link
Collaborator

I made a PR since this is easy to fix via CMake. @dellaert can you please try it out?

@dellaert
Copy link
Member

I'll try. In the meantime I'm also trying to create an M1 CI run, to see if the issue is reproducible on github runners

@dellaert
Copy link
Member

@drpjm that PR #1812 fixed segfaults on my system. Please check it out and/or close this issue?

@dellaert
Copy link
Member

Thanks @varunagrawal !

@drpjm
Copy link
Author

drpjm commented Aug 26, 2024

@dellaert Would I compile from source or install with pip?

@dellaert
Copy link
Member

Build from source.
ps if you have a minimal repro script I’d love to try it.

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 26, 2024

Interesting. Mine says

//Value Computed by CMake
pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11

//The directory containing a CMake configuration file for pybind11.
pybind11_DIR:PATH=pybind11_DIR-NOTFOUND

//Value Computed by CMake
pybind11_IS_TOP_LEVEL:STATIC=OFF

//Value Computed by CMake
pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11

which possibly explains the issue.

I still wonder why this fixed the issue. pybind11 is header-only and these variables look totally legit to me...
Also I have the same config at @dellaert and cannot reproduce.

@dellaert
Copy link
Member

dellaert commented Aug 26, 2024 via email

@varunagrawal
Copy link
Collaborator

Closing as complete.

@diplodocuslongus
Copy link

Well, I still have the segfault in linux (pop-os 22.04).
I've tried numpy 2.1.2 (current default from my python) and 2.0.0, and my python is python 3.10.12

The python/CMakeLists.txt of the latest stable release gtsam4.2 that I am currently using specifies the path to GTSAM provided pybind11, the same as in PR #1812

$ py CustomFactorExample.py 
Simulated car trajectory: [0.0, 10.0, 20.0, 30.0, 40.0]
unknowns =  ['x0', 'x1', 'x2', 'x3', 'x4']
Segmentation fault (core dumped)
(gtsam_42) gonze:examples/  $ py PreintegrationExample.py 
Segmentation fault (core dumped)

(I haven't tried all the examples).

I have however successfully been able to run examples if I use the following combination (had to match the python + matplotlib + numpy versions and relative dependencies):

  • numpy 1.22.1 (latest early version compatible with python 3.10)
  • matplpotlib 3.8 (version compatible with numpy 1.22.1)

This issue seems to be all about MacOS but it should apply to linux as well, yet, I still get the segfault unless I downgrade as mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants