Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: In some situations, py::cast returns null instead of raising an exception #4099

Open
3 tasks done
ezyang opened this issue Jul 30, 2022 · 8 comments
Open
3 tasks done

Comments

@ezyang
Copy link

ezyang commented Jul 30, 2022

Required prerequisites

Problem description

I have noticed under certain situations that I can hit the error: TypeError: Unregistered type : c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > when I use a py::cast on a custom smart pointer. Ordinarily, I expect this to work (and in other contexts, it does work), but sometimes it does not. Additionally, the documentation specifies that py::cast should always raise an exception upon cast failure, but I observe that it instead returns a nullptr and sets the Python error context, without actually raising an exception.

I wasn't able to extract a short repro; I do have a full repro but it involves compiling a giant project, LMK if you're interested. The triggering code looks like:

      auto py_symint = py::cast(si.toSymIntNodeImpl()).release().ptr();
      if (!py_symint) throw python_error();

where toSymIntNodeImpl returns a c10::intrusive_ptr<c10::SymIntNodeImpl>. py_symint is null and a Python error is set after calling py::cast. Here is the backtrace at this point:

#0  pybind11::detail::type_caster_generic::src_and_type (src=0x7fffffffad88, 
    cast_type=..., rtti_type=0x0)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:788
#1  0x00007fffdf7c16d9 in pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::src_and_type (src=0x7fffffffad88)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:948
#2  0x00007fffdf7c15af in pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::cast (
    src=0x7fffffffad88, policy=pybind11::return_value_policy::move, parent=...)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:952
#3  0x00007fffdf7c1570 in pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::cast (
    src=..., parent=...)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/detail/type_caster_base.h:923
#4  0x00007fffdf7c0b7d in pybind11::cast<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> >, 0> (value=..., 
    policy=pybind11::return_value_policy::move, parent=...)
    at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/cast.h:1067
#5  0x00007fffdfc609df in THPSize_NewFromSymSizes (self_=...)
    at /data/users/ezyang/pytorch-tmp/torch/csrc/Size.cpp:60

Stepping through the rest of the execution, lack of type info means type_caster_generic::cast short circuits:

pybind11::detail::type_caster_generic::cast (_src=0x0, policy=pybind11::return_value_policy::move, parent=...,
 tinfo=0x0, copy_constructor=0x7fffdf7c1770 <pybind11::detail::type_caster_base<c10::intrusive_ptr<c10::SymInt
NodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> > >::make_copy_constructor<c10:
:intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImpl> >, vo
id>(c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c10::SymIntNodeImp
l> > const*)::{lambda(void const*)#1}::__invoke(void const*)>, move_constructor=0x7fffdf7c19f0 <pybind11::deta
il::type_caster_base<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_target_default_null_type<c
10::SymIntNodeImpl> > >::make_move_constructor<c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::intrusive_
target_default_null_type<c10::SymIntNodeImpl> >, void>(c10::intrusive_ptr<c10::SymIntNodeImpl, c10::detail::in
trusive_target_default_null_type<c10::SymIntNodeImpl> > const*)::{lambda(void const*)#1}::__invoke(void const*
)>, existing_holder=0x0) at /data/users/ezyang/pytorch-tmp/cmake/../third_party/pybind11/include/pybind11/deta
il/type_caster_base.h:515                                                                                     
515             if (!tinfo) { // no type info: error will be set already                                      
(gdb)                                                                                                         
516                 return handle();                                                                          
(gdb)                         

but then nothing seems to detect that the handle is empty and so this null handle ends being returned all the way.

c10::intrusive_ptr is a shared ptr like class that does intrusive refcounting. It was declared to be a holder type with

torch/csrc/utils/pybind.h:PYBIND11_DECLARE_HOLDER_TYPE(T, c10::intrusive_ptr<T>, true);

I also interposed the type info registration mechanism, and observed that SymIntNodeImpl was registered, but not c10::intrusive_ptr<SymIntNodeImpl>. The workaround, in this case, is to explicitly deref the intrusive ptr before passing it to cast, but this is error prone and it would be nice to root cause the issue.

This is on pybind11 aa304c9

Reproducible example code

No response

@ezyang ezyang added the triage New bug, unverified label Jul 30, 2022
@ezyang
Copy link
Author

ezyang commented Jul 31, 2022

It turns out that this can be triggered by an ODR violation. Compile in debug mode with no optimizations, and have a py::cast call with the HOLDER_TYPE in scope, and another without. Half of the time, the non-holder type definition of py::cast will clobber the holder type, and then you can end up with py::cast returning null.

I'm not sure what pybind11 is supposed to do here. I suppose it would be nice if, even in this situation, pybind11 raised an error rather than returning null. That's not much consolation though...

ezyang added a commit to pytorch/pytorch that referenced this issue Jul 31, 2022
This will make the pointer type a single word, which is important
for packing it into an int64_t

This time, this diff doesn't segfault when you build with DEBUG mode; more details at pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <ezyangfb.com>

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this issue Jul 31, 2022
This will make the pointer type a single word, which is important
for packing it into an int64_t

This time, this diff doesn't segfault when you build with DEBUG mode; more details at pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <ezyangfb.com>

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this issue Jul 31, 2022
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.

The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.

The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features.  I'm open to suggestions
for how to structure the features better.  The main changes:

- Added an --allowlist-pattern flag, which turns off the grep lint
  if some other line exists.  This is used to stop the grep
  lint from complaining about pybind11 includes if the util
  include already exists.

- Added --match-first-only flag, which lets grep only match against
  the first matching line.  This is because, even if there are multiple
  includes that are problematic, I only need to fix one of them.
  We don't /really/ need this, but when I was running lintrunner -a
  to fixup the preexisting codebase it was annoying without this,
  as the lintrunner overall driver fails if there are multiple edits
  on the same file.

I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.

Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.

See also pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <[email protected]>

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this issue Jul 31, 2022
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.

The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.

The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features.  I'm open to suggestions
for how to structure the features better.  The main changes:

- Added an --allowlist-pattern flag, which turns off the grep lint
  if some other line exists.  This is used to stop the grep
  lint from complaining about pybind11 includes if the util
  include already exists.

- Added --match-first-only flag, which lets grep only match against
  the first matching line.  This is because, even if there are multiple
  includes that are problematic, I only need to fix one of them.
  We don't /really/ need this, but when I was running lintrunner -a
  to fixup the preexisting codebase it was annoying without this,
  as the lintrunner overall driver fails if there are multiple edits
  on the same file.

I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.

Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.

See also pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <ezyangfb.com>

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this issue Jul 31, 2022
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.

The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.

The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features.  I'm open to suggestions
for how to structure the features better.  The main changes:

- Added an --allowlist-pattern flag, which turns off the grep lint
  if some other line exists.  This is used to stop the grep
  lint from complaining about pybind11 includes if the util
  include already exists.

- Added --match-first-only flag, which lets grep only match against
  the first matching line.  This is because, even if there are multiple
  includes that are problematic, I only need to fix one of them.
  We don't /really/ need this, but when I was running lintrunner -a
  to fixup the preexisting codebase it was annoying without this,
  as the lintrunner overall driver fails if there are multiple edits
  on the same file.

I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.

Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.

See also pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <ezyangfb.com>

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this issue Jul 31, 2022
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.

The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.

The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features.  I'm open to suggestions
for how to structure the features better.  The main changes:

- Added an --allowlist-pattern flag, which turns off the grep lint
  if some other line exists.  This is used to stop the grep
  lint from complaining about pybind11 includes if the util
  include already exists.

- Added --match-first-only flag, which lets grep only match against
  the first matching line.  This is because, even if there are multiple
  includes that are problematic, I only need to fix one of them.
  We don't /really/ need this, but when I was running lintrunner -a
  to fixup the preexisting codebase it was annoying without this,
  as the lintrunner overall driver fails if there are multiple edits
  on the same file.

I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.

Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.

See also pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <ezyangfb.com>

ghstack-source-id: c466358a96edcd76dacd1c4785c1ba10fdc6e819
Pull Request resolved: #82552
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Aug 1, 2022
This will make the pointer type a single word, which is important
for packing it into an int64_t

This time, this diff doesn't segfault when you build with DEBUG mode; more details at pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: #82548
Approved by: https://github.com/albanD
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Aug 1, 2022
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.

The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.

The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features.  I'm open to suggestions
for how to structure the features better.  The main changes:

- Added an --allowlist-pattern flag, which turns off the grep lint
  if some other line exists.  This is used to stop the grep
  lint from complaining about pybind11 includes if the util
  include already exists.

- Added --match-first-only flag, which lets grep only match against
  the first matching line.  This is because, even if there are multiple
  includes that are problematic, I only need to fix one of them.
  We don't /really/ need this, but when I was running lintrunner -a
  to fixup the preexisting codebase it was annoying without this,
  as the lintrunner overall driver fails if there are multiple edits
  on the same file.

I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.

Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.

See also pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <[email protected]>
Pull Request resolved: #82552
Approved by: https://github.com/albanD
@Skylion007 Skylion007 added bug help wanted and removed triage New bug, unverified labels Aug 1, 2022
@Skylion007
Copy link
Collaborator

Skylion007 commented Aug 1, 2022

@rwgk Has been looking into fixing/detecting these potential ODR violations so pinging him.

@rwgk
Copy link
Collaborator

rwgk commented Aug 1, 2022

Did you already run the full reproducer with sanitizers (valgrind, asan, msan)?

It's not certain that the problem here is related to work I did, although it could be. Reading the comments here didn't trigger any great ideas on my end.

If you post the instructions for the full reproducer and it's not too cumbersome, I could maybe give it a try.

The holder code has strictly speaking "plenty of UB potential": #2672 (comment)

Are the C-style and reinterpret casts a problem here or not? I don't know.

Has been looking into fixing/detecting these potential ODR violations so pinging him.

The ODR guard work I did recently is very specific to detecting type_caster ODR violations: PR #4022

If you want to give it a try, git clone smart_holder instead of master and use -DPYBIND11_ENABLE_TYPE_CASTER_ODR_GUARD or simply edit pybind11/detail/descr.h to hard-wire the #define there. Prefer an optimized build for this if you can. (The ODR guard is known to not work in debug mode with one specific compiler. All other compilers work in any mode.)

Fixing the ODR is a different question.

@ezyang
Copy link
Author

ezyang commented Aug 1, 2022

I ran the full reproducer with ASAN, but I didn't try Valgrind or MSAN. I think I've got a pretty decent handle on the root cause so I can try to make a minimal reproducer. To full reproduce, check out pytorch/pytorch@7be44f8 , do a build with DEBUG=1 python setup.py develop (you might also need USE_GOLD_LINKER=1, not sure), and then run python test/test_dynamic_shapes.py. This will segfault in Python; to reveal the pybind11 error patch in pytorch/pytorch#82519 which adds a check to see if py::cast is returning null.

@ezyang
Copy link
Author

ezyang commented Aug 1, 2022

To fix the ODR, I added a lint rule to our project to force all includes of pybind.h to go through a helper header, which ensures that all the specializations are in scope whenever we use pybind.h. This is viable for us but maybe not for a monorepo setting.

@rwgk
Copy link
Collaborator

rwgk commented Aug 1, 2022

I think I've got a pretty decent handle on the root cause so I can try to make a minimal reproducer.

Sounds promising, I'll wait for that. (I cannot easily run msan in a general Linux environment. It's much more practical with a reproducer that I can reasonably easily run in our corp dev environment.)

@rwgk
Copy link
Collaborator

rwgk commented Aug 1, 2022

To fix the ODR, I added a lint rule to our project to force all includes of pybind.h to go through a helper header, which ensures that all the specializations are in scope whenever we use pybind.h. This is viable for us but maybe not for a monorepo setting.

Maybe you have covered this already: PYBIND11_MAKE_OPAQUE needs to be used consistently as well (link-level visibility), another thing to watch out for.

@ezyang
Copy link
Author

ezyang commented Aug 1, 2022

nice one. We don't have any uses of it in our codebase yet but I can easily add it to our lint haha

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this issue Aug 2, 2022
Summary:
This will make the pointer type a single word, which is important
for packing it into an int64_t

This time, this diff doesn't segfault when you build with DEBUG mode; more details at pybind/pybind11#4099

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: #82548
Approved by: https://github.com/albanD

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/50e8abbcadde70c6eb6ef932d3e6cc3fa26a5cd7

Reviewed By: osalpekar

Differential Revision: D38322453

Pulled By: ezyang

fbshipit-source-id: 63b05bd97604da85b5cae2d7a6e7ff6880cd4ae5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants