Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240

Merged
merged 7 commits into from
Jan 3, 2025

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Dec 19, 2024

What does this PR do?

This PR is a follow-up to #4195 and fixes a similar issue to that earlier PR.

Specifically, when the profiler is trying to read from the opentelemetry ("otel") context, we would try to look up a value from a Ruby hash using rb_hash_lookup(...). This innocent-looking method would call #hash on the key being used, which meant that a Ruby method could start running in the middle of a profiler sample.

When combined with allocation profiling, we had reports of Ruby crashing with:

[BUG] unexpected situation - recordd:1 current:0

-- C level backtrace information -------------------------------------------
ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820
ruby(rb_vm_bugreport) vm_dump.c:1151
ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042
ruby(rb_bug_without_die) error.c:1042
ruby(die+0x0) [0x55ba03ac0998] error.c:1050
ruby(rb_bug) error.c:1052
ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226
ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144
ruby(rb_ec_tag_state) eval_intern.h:155
ruby(rb_vm_exec) vm.c:2484
ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509
ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728
ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598
ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615
ruby(thread_start_func_2) thread.c:672
ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353]

This PR fixes this by replacing the rb_hash_lookup with a native-level linear iteration on the hash contents, which allows us to skip invoking any kind of Ruby methods.

Additionally, this PR:

  • Adds a checker mechanism to detect these kinds of problem conditions when executing our test suite. This mechanism was able to flag both the issue in [PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195 as well as the one in rb_hash_lookup

  • Adds a tweak to avoid allocation sampling while Ruby is raising an exception, as in my testing this was another way of very easily causing Ruby to crash or misbehave

  • Fixes a potential issue with static vs dynamic Ruby symbols when reading from the otel context code.

Motivation:

Fix the crash.

Change log entry

Yes. Fix allocation profiling + otel tracing causing Ruby crash

Additional Notes:

As a reminder, reading from the opentelemetry context is disabled by default, so only customers setting DD_PROFILING_PREVIEW_OTEL_CONTEXT_ENABLED to true/both/only could be affected by this issue.

Also, the crash was only triggered when allocation profiling was in use.

This PR may make a bit more sense to read commit-by-commit.

How to test the change?

This PR adds code coverage + a new checker to find the fixed issues.

This checker is used to detect accidental thread scheduling switching
points happening during profiling sampling.

See the bigger comment in unsafe_api_calls_check.h .

I was able to check that this checker correctly triggers for the bug
in #4195, and also the bug I'm going to fix next, which is the
use of `rb_hash_lookup` in the otel context reading code.
`rb_hash_lookup` calls `#hash` on the key being looked up so it's safe
to use unless during sampling.

This can cause the same issue as we saw in #4195 leading to

```
[BUG] unexpected situation - recordd:1 current:0

-- C level backtrace information -------------------------------------------
ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820
ruby(rb_vm_bugreport) vm_dump.c:1151
ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042
ruby(rb_bug_without_die) error.c:1042
ruby(die+0x0) [0x55ba03ac0998] error.c:1050
ruby(rb_bug) error.c:1052
ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226
ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144
ruby(rb_ec_tag_state) eval_intern.h:155
ruby(rb_vm_exec) vm.c:2484
ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509
ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728
ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598
ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615
ruby(thread_start_func_2) thread.c:672
ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353]
```
During my experiments to reproduce issues around allocation profiling,
I've noted that the VM is in an especially delicate state during
exception raising, so let's just decline to sample in this situation.
It occurs to me that if a symbol is dynamic, we were causing it to
become a static symbol (e.g. making it never be able to be garbage
collected).

This can be very bad! And also, we know the symbol we're looking for
must be a static symbol because if nothing else, our initialization
caused it to become a static symbol.

Thus, if we see a dynamic symbol, we can stop there, since by
definition it won't be the symbol we're looking after.

This is... really awkward to add a specific unit test for, so
I've just relied on our existing test coverage to show that this
has not affected the correctness of our otel code.
@ivoanjo ivoanjo requested review from a team as code owners December 19, 2024 15:29
@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Dec 19, 2024
Copy link

github-actions bot commented Dec 19, 2024

Thank you for updating Change log entry section 👏

Visited at: 2024-12-19 17:13:07 UTC

@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.74%. Comparing base (e8a6307) to head (442690b).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4240      +/-   ##
==========================================
- Coverage   97.74%   97.74%   -0.01%     
==========================================
  Files        1355     1355              
  Lines       82366    82404      +38     
  Branches     4230     4230              
==========================================
+ Hits        80512    80544      +32     
- Misses       1854     1860       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}

static void check_for_unsafe_api_calls(DDTRACE_UNUSED void *_unused) {
if (inside_unsafe_context) rb_bug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard crash right? I see how it would make issues more visible but won't it also lead the profiler to crash client apps more often than what may be caused by some "racy" issues? Are we trading off potential profiling accuracy with increased crashing or would the unsafe calls potentially manifest in issues more dangerous than a simple crash (e.g. a data corruption)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but there's a key detail: I've only placed calls to debug_enter_unsafe_context and debug_leave_unsafe_context in methods that are in the testing_module, so this code never gets exercised in the "normal" in-production profiling path.

But re-reading the diff this is a subtle detail, so I'll push a commit to document and make this more clear that these calls should not be added outside of testing code paths.

(I did consider introducing an option to enable this behavior + adding it also to the normal profiling path, so we could tell folks that ran into issues "try with this setting"; but to be honest I wasn't sure it was worth the time spent into it and additional complexity, given we already have some coverage for this in the tests + it's not particularly something that has happened in the past.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added big scary comment in 7e49324

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only placed calls to debug_enter_unsafe_context and debug_leave_unsafe_context in methods that are in the testing_module, so this code never gets exercised in the "normal" in-production profiling path.

Ah I did wonder if that was true given the first few places you added it to had the "don't use this anywhere else" comment. But some of the other methods didn't and I was lazy to check where they were registered or called from :shame:

@@ -236,7 +236,7 @@ def safely_synchronize
rescue ThreadError => e
logger_without_components.error(
'Detected deadlock during datadog initialization. ' \
'Please report this at https://github.com/DataDog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug' \
'Please report this at https://github.com/datadog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug' \
Copy link
Contributor

@AlexJF AlexJF Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Intended? I mean it works but the official casing is the previous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try to lower-case whenever I see it linked like this, trying to convince the world it's not DataDog ;)

@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Dec 19, 2024

Datadog Report

Branch report: ivoanjo/prof-11003-fix-profiler-otel-hash-lookup
Commit report: 442690b
Test service: dd-trace-rb

✅ 0 Failed, 22130 Passed, 1476 Skipped, 5m 28.92s Total Time

@p-datadog
Copy link
Member

Is the iteration limit for hash lookup added in this PR customer visible? Seems to me that it would be? If yes should it be noted somewhere?

@ivoanjo
Copy link
Member Author

ivoanjo commented Jan 2, 2025

Is the iteration limit for hash lookup added in this PR customer visible? Seems to me that it would be? If yes should it be noted somewhere?

It's a good call-out but... I'm somewhat hesitant to introduce too many knobs of this kind if we can avoid it.

This whole feature itself is still under preview (and off by default), so my thinking is: let's see if any customer reports any issue with the current hardcoded value for this setting, and in that case I'll work with them do decide what value would be needed. After that, we can decide if we bump the default, or do something else.

Does this sound reasonable, or are you not convinced yet?

@ivoanjo
Copy link
Member Author

ivoanjo commented Jan 2, 2025

Ok so the failure in the benchmarks is expected and an unfortunate effect of how the benchmarks are currently ran. In particular, our harness tries to run the benchmarks from this branch with the main branch, which is an interesting idea BUT forces the benchmark code to be backwards-compatible, which is really annoying for some changes such as this one.

The master branch is currently red due to unrelated issues; my current plan for this PR is to wait until master is green again, then merge it into this branch, and then merge it once everything is green other than the benchmarks.

@ivoanjo ivoanjo merged commit b1400d5 into master Jan 3, 2025
335 of 336 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-11003-fix-profiler-otel-hash-lookup branch January 3, 2025 09:55
@github-actions github-actions bot added this to the 2.9.0 milestone Jan 3, 2025
@ivoanjo ivoanjo added bug Involves a bug otel OpenTelemetry-related changes and removed core Involves Datadog core libraries labels Jan 3, 2025
@ivoanjo
Copy link
Member Author

ivoanjo commented Jan 3, 2025

Update: After merging to master, CI is green as expected:
image

@p-datadog p-datadog mentioned this pull request Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug otel OpenTelemetry-related changes profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants