[BUG]: In some situations, py::cast returns null instead of raising an exception #4099

ezyang · 2022-07-30T14:13:06Z

Required prerequisites

Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker and Discussions to verify that this hasn't already been reported. +1 or comment there if it has.
Consider asking first in the Gitter chat room or in a Discussion.

Problem description

I have noticed under certain situations that I can hit the error: TypeError: Unregistered type : c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > when I use a py::cast on a custom smart pointer. Ordinarily, I expect this to work (and in other contexts, it does work), but sometimes it does not. Additionally, the documentation specifies that py::cast should always raise an exception upon cast failure, but I observe that it instead returns a nullptr and sets the Python error context, without actually raising an exception.

I wasn't able to extract a short repro; I do have a full repro but it involves compiling a giant project, LMK if you're interested. The triggering code looks like:

      auto py_symint = py::cast(si.toSymIntNodeImpl()).release().ptr();
      if (!py_symint) throw python_error();

where toSymIntNodeImpl returns a c10::intrusive_ptr<c10::SymIntNodeImpl>. py_symint is null and a Python error is set after calling py::cast. Here is the backtrace at this point:

#0  pybind11::detail::type_caster_generic::src_and_type (src=0x7fffffffad88, 
    cast_type=..., rtti_type=0x0)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:788
#1  0x00007fffdf7c16d9 in pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::src_and_type (src=0x7fffffffad88)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:948
#2  0x00007fffdf7c15af in pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::cast (
    src=0x7fffffffad88, policy=pybind11::return_value_policy::move, parent=...)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:952
#3  0x00007fffdf7c1570 in pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::cast (
    src=..., parent=...)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:923
#4  0x00007fffdf7c0b7d in pybind11::cast<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> >, 0> (value=..., 
    policy=pybind11::return_value_policy::move, parent=...)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/cast.h:1067
#5  0x00007fffdfc609df in THPSize_NewFromSymSizes (self_=...)
    at /data/users/ezyang/pytorch-tmp/torch/csrc/Size.cpp:60

Stepping through the rest of the execution, lack of type info means type_caster_generic::cast short circuits:

pybind11::detail::type_caster_generic::cast (_src=0x0, policy=pybind11::return_value_policy::move, parent=...,
 tinfo=0x0, copy_constructor=0x7fffdf7c1770 <pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymInt
NodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::make_copy_constructor<c10:
:intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> >, vo
id>(c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImp
l> > const*)::{lambda(void const*)#1}::__invoke(void const*)>, move_constructor=0x7fffdf7c19f0 <pybind11::deta
il::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c
10::SymIntNodeImpl> > >::make_move_constructor<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_
target_default_null_type<c10::SymIntNodeImpl> >, void>(c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::in
trusive_target_default_null_type<c10::SymIntNodeImpl> > const*)::{lambda(void const*)#1}::__invoke(void const*
)>, existing_holder=0x0) at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/deta
il/type_caster_base.h:515                                                                                     
515             if (!tinfo) { // no type info: error will be set already                                      
(gdb)                                                                                                         
516                 return handle();                                                                          
(gdb)

but then nothing seems to detect that the handle is empty and so this null handle ends being returned all the way.

c10::intrusive_ptr is a shared ptr like class that does intrusive refcounting. It was declared to be a holder type with

torch/csrc/utils/pybind.h:PYBIND11_DECLARE_HOLDER_TYPE(T, c10::intrusive_ptr<T>, true);

I also interposed the type info registration mechanism, and observed that SymIntNodeImpl was registered, but not c10::intrusive_ptr<SymIntNodeImpl>. The workaround, in this case, is to explicitly deref the intrusive ptr before passing it to cast, but this is error prone and it would be nice to root cause the issue.

This is on pybind11 aa304c9

Reproducible example code

No response

The text was updated successfully, but these errors were encountered:

ezyang · 2022-07-31T03:05:07Z

It turns out that this can be triggered by an ODR violation. Compile in debug mode with no optimizations, and have a py::cast call with the HOLDER_TYPE in scope, and another without. Half of the time, the non-holder type definition of py::cast will clobber the holder type, and then you can end up with py::cast returning null.

I'm not sure what pybind11 is supposed to do here. I suppose it would be nice if, even in this situation, pybind11 raised an error rather than returning null. That's not much consolation though...

This will make the pointer type a single word, which is important for packing it into an int64_t This time, this diff doesn't segfault when you build with DEBUG mode; more details at pybind/pybind11#4099 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

We define specializations for pybind11 defined templates (in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently it is important that these specializations *always* be #include'd when making use of pybind11 templates whose behavior depends on these specializations, otherwise we can cause an ODR violation. The easiest way to ensure that all the specializations are always loaded is to designate a header (in this case, torch/csrc/util/pybind.h) that ensures the specializations are defined, and then add a lint to ensure this header is included whenever pybind11 headers are included. The existing grep linter didn't have enough knobs to do this conveniently, so I added some features. I'm open to suggestions for how to structure the features better. The main changes: - Added an --allowlist-pattern flag, which turns off the grep lint if some other line exists. This is used to stop the grep lint from complaining about pybind11 includes if the util include already exists. - Added --match-first-only flag, which lets grep only match against the first matching line. This is because, even if there are multiple includes that are problematic, I only need to fix one of them. We don't /really/ need this, but when I was running lintrunner -a to fixup the preexisting codebase it was annoying without this, as the lintrunner overall driver fails if there are multiple edits on the same file. I excluded any files that didn't otherwise have a dependency on torch/ATen, this was mostly caffe2 and the valgrind wrapper compat bindings. Note the grep replacement is kind of crappy, but clang-tidy lint cleaned it up in most cases. See also pybind/pybind11#4099 Signed-off-by: Edward Z. Yang <[email protected]> [ghstack-poisoned]

We define specializations for pybind11 defined templates (in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently it is important that these specializations *always* be #include'd when making use of pybind11 templates whose behavior depends on these specializations, otherwise we can cause an ODR violation. The easiest way to ensure that all the specializations are always loaded is to designate a header (in this case, torch/csrc/util/pybind.h) that ensures the specializations are defined, and then add a lint to ensure this header is included whenever pybind11 headers are included. The existing grep linter didn't have enough knobs to do this conveniently, so I added some features. I'm open to suggestions for how to structure the features better. The main changes: - Added an --allowlist-pattern flag, which turns off the grep lint if some other line exists. This is used to stop the grep lint from complaining about pybind11 includes if the util include already exists. - Added --match-first-only flag, which lets grep only match against the first matching line. This is because, even if there are multiple includes that are problematic, I only need to fix one of them. We don't /really/ need this, but when I was running lintrunner -a to fixup the preexisting codebase it was annoying without this, as the lintrunner overall driver fails if there are multiple edits on the same file. I excluded any files that didn't otherwise have a dependency on torch/ATen, this was mostly caffe2 and the valgrind wrapper compat bindings. Note the grep replacement is kind of crappy, but clang-tidy lint cleaned it up in most cases. See also pybind/pybind11#4099 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

We define specializations for pybind11 defined templates (in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently it is important that these specializations *always* be #include'd when making use of pybind11 templates whose behavior depends on these specializations, otherwise we can cause an ODR violation. The easiest way to ensure that all the specializations are always loaded is to designate a header (in this case, torch/csrc/util/pybind.h) that ensures the specializations are defined, and then add a lint to ensure this header is included whenever pybind11 headers are included. The existing grep linter didn't have enough knobs to do this conveniently, so I added some features. I'm open to suggestions for how to structure the features better. The main changes: - Added an --allowlist-pattern flag, which turns off the grep lint if some other line exists. This is used to stop the grep lint from complaining about pybind11 includes if the util include already exists. - Added --match-first-only flag, which lets grep only match against the first matching line. This is because, even if there are multiple includes that are problematic, I only need to fix one of them. We don't /really/ need this, but when I was running lintrunner -a to fixup the preexisting codebase it was annoying without this, as the lintrunner overall driver fails if there are multiple edits on the same file. I excluded any files that didn't otherwise have a dependency on torch/ATen, this was mostly caffe2 and the valgrind wrapper compat bindings. Note the grep replacement is kind of crappy, but clang-tidy lint cleaned it up in most cases. See also pybind/pybind11#4099 Signed-off-by: Edward Z. Yang <ezyangfb.com> ghstack-source-id: c466358a96edcd76dacd1c4785c1ba10fdc6e819 Pull Request resolved: #82552

This will make the pointer type a single word, which is important for packing it into an int64_t This time, this diff doesn't segfault when you build with DEBUG mode; more details at pybind/pybind11#4099 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: #82548 Approved by: https://github.com/albanD

We define specializations for pybind11 defined templates (in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently it is important that these specializations *always* be #include'd when making use of pybind11 templates whose behavior depends on these specializations, otherwise we can cause an ODR violation. The easiest way to ensure that all the specializations are always loaded is to designate a header (in this case, torch/csrc/util/pybind.h) that ensures the specializations are defined, and then add a lint to ensure this header is included whenever pybind11 headers are included. The existing grep linter didn't have enough knobs to do this conveniently, so I added some features. I'm open to suggestions for how to structure the features better. The main changes: - Added an --allowlist-pattern flag, which turns off the grep lint if some other line exists. This is used to stop the grep lint from complaining about pybind11 includes if the util include already exists. - Added --match-first-only flag, which lets grep only match against the first matching line. This is because, even if there are multiple includes that are problematic, I only need to fix one of them. We don't /really/ need this, but when I was running lintrunner -a to fixup the preexisting codebase it was annoying without this, as the lintrunner overall driver fails if there are multiple edits on the same file. I excluded any files that didn't otherwise have a dependency on torch/ATen, this was mostly caffe2 and the valgrind wrapper compat bindings. Note the grep replacement is kind of crappy, but clang-tidy lint cleaned it up in most cases. See also pybind/pybind11#4099 Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #82552 Approved by: https://github.com/albanD

Skylion007 · 2022-08-01T18:00:28Z

@rwgk Has been looking into fixing/detecting these potential ODR violations so pinging him.

rwgk · 2022-08-01T19:36:45Z

Did you already run the full reproducer with sanitizers (valgrind, asan, msan)?

It's not certain that the problem here is related to work I did, although it could be. Reading the comments here didn't trigger any great ideas on my end.

If you post the instructions for the full reproducer and it's not too cumbersome, I could maybe give it a try.

The holder code has strictly speaking "plenty of UB potential": #2672 (comment)

Are the C-style and reinterpret casts a problem here or not? I don't know.

Has been looking into fixing/detecting these potential ODR violations so pinging him.

The ODR guard work I did recently is very specific to detecting type_caster ODR violations: PR #4022

If you want to give it a try, git clone smart_holder instead of master and use -DPYBIND11_ENABLE_TYPE_CASTER_ODR_GUARD or simply edit pybind11/detail/descr.h to hard-wire the #define there. Prefer an optimized build for this if you can. (The ODR guard is known to not work in debug mode with one specific compiler. All other compilers work in any mode.)

Fixing the ODR is a different question.

ezyang · 2022-08-01T20:10:24Z

I ran the full reproducer with ASAN, but I didn't try Valgrind or MSAN. I think I've got a pretty decent handle on the root cause so I can try to make a minimal reproducer. To full reproduce, check out pytorch/pytorch@7be44f8 , do a build with DEBUG=1 python setup.py develop (you might also need USE_GOLD_LINKER=1, not sure), and then run python test/test_dynamic_shapes.py. This will segfault in Python; to reveal the pybind11 error patch in pytorch/pytorch#82519 which adds a check to see if py::cast is returning null.

ezyang · 2022-08-01T20:11:48Z

To fix the ODR, I added a lint rule to our project to force all includes of pybind.h to go through a helper header, which ensures that all the specializations are in scope whenever we use pybind.h. This is viable for us but maybe not for a monorepo setting.

rwgk · 2022-08-01T20:37:21Z

I think I've got a pretty decent handle on the root cause so I can try to make a minimal reproducer.

Sounds promising, I'll wait for that. (I cannot easily run msan in a general Linux environment. It's much more practical with a reproducer that I can reasonably easily run in our corp dev environment.)

rwgk · 2022-08-01T20:40:25Z

To fix the ODR, I added a lint rule to our project to force all includes of pybind.h to go through a helper header, which ensures that all the specializations are in scope whenever we use pybind.h. This is viable for us but maybe not for a monorepo setting.

Maybe you have covered this already: PYBIND11_MAKE_OPAQUE needs to be used consistently as well (link-level visibility), another thing to watch out for.

ezyang · 2022-08-01T21:39:48Z

nice one. We don't have any uses of it in our codebase yet but I can easily add it to our lint haha

Summary: This will make the pointer type a single word, which is important for packing it into an int64_t This time, this diff doesn't segfault when you build with DEBUG mode; more details at pybind/pybind11#4099 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: #82548 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/50e8abbcadde70c6eb6ef932d3e6cc3fa26a5cd7 Reviewed By: osalpekar Differential Revision: D38322453 Pulled By: ezyang fbshipit-source-id: 63b05bd97604da85b5cae2d7a6e7ff6880cd4ae5

ezyang added the triage New bug, unverified label Jul 30, 2022

ezyang mentioned this issue Jul 31, 2022

Change SymIntNode into an intrusive pointer pytorch/pytorch#82548

Closed

ezyang mentioned this issue Jul 31, 2022

Add a lint rule for torch/csrc/util/pybind.h include pytorch/pytorch#82552

Closed

Skylion007 added bug help wanted and removed triage New bug, unverified labels Aug 1, 2022

rwgk mentioned this issue Feb 10, 2023

FWD pybind11 google/pybind11clif#4099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: In some situations, py::cast returns null instead of raising an exception #4099

[BUG]: In some situations, py::cast returns null instead of raising an exception #4099

ezyang commented Jul 30, 2022 •

edited

Loading

ezyang commented Jul 31, 2022

Skylion007 commented Aug 1, 2022 •

edited

Loading

rwgk commented Aug 1, 2022

ezyang commented Aug 1, 2022

ezyang commented Aug 1, 2022

rwgk commented Aug 1, 2022

rwgk commented Aug 1, 2022

ezyang commented Aug 1, 2022

[BUG]: In some situations, py::cast returns null instead of raising an exception #4099

[BUG]: In some situations, py::cast returns null instead of raising an exception #4099

Comments

ezyang commented Jul 30, 2022 • edited Loading

Required prerequisites

Problem description

Reproducible example code

ezyang commented Jul 31, 2022

Skylion007 commented Aug 1, 2022 • edited Loading

rwgk commented Aug 1, 2022

ezyang commented Aug 1, 2022

ezyang commented Aug 1, 2022

rwgk commented Aug 1, 2022

rwgk commented Aug 1, 2022

ezyang commented Aug 1, 2022

ezyang commented Jul 30, 2022 •

edited

Loading

Skylion007 commented Aug 1, 2022 •

edited

Loading