[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240

ivoanjo · 2024-12-19T15:29:09Z

What does this PR do?

This PR is a follow-up to #4195 and fixes a similar issue to that earlier PR.

Specifically, when the profiler is trying to read from the opentelemetry ("otel") context, we would try to look up a value from a Ruby hash using rb_hash_lookup(...). This innocent-looking method would call #hash on the key being used, which meant that a Ruby method could start running in the middle of a profiler sample.

When combined with allocation profiling, we had reports of Ruby crashing with:

[BUG] unexpected situation - recordd:1 current:0

-- C level backtrace information -------------------------------------------
ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820
ruby(rb_vm_bugreport) vm_dump.c:1151
ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042
ruby(rb_bug_without_die) error.c:1042
ruby(die+0x0) [0x55ba03ac0998] error.c:1050
ruby(rb_bug) error.c:1052
ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226
ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144
ruby(rb_ec_tag_state) eval_intern.h:155
ruby(rb_vm_exec) vm.c:2484
ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509
ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728
ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598
ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615
ruby(thread_start_func_2) thread.c:672
ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353]

This PR fixes this by replacing the rb_hash_lookup with a native-level linear iteration on the hash contents, which allows us to skip invoking any kind of Ruby methods.

Additionally, this PR:

Adds a checker mechanism to detect these kinds of problem conditions when executing our test suite. This mechanism was able to flag both the issue in [PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195 as well as the one in rb_hash_lookup
Adds a tweak to avoid allocation sampling while Ruby is raising an exception, as in my testing this was another way of very easily causing Ruby to crash or misbehave
Fixes a potential issue with static vs dynamic Ruby symbols when reading from the otel context code.

Motivation:

Fix the crash.

Change log entry

Yes. Fix allocation profiling + otel tracing causing Ruby crash

Additional Notes:

As a reminder, reading from the opentelemetry context is disabled by default, so only customers setting DD_PROFILING_PREVIEW_OTEL_CONTEXT_ENABLED to true/both/only could be affected by this issue.

Also, the crash was only triggered when allocation profiling was in use.

This PR may make a bit more sense to read commit-by-commit.

How to test the change?

This PR adds code coverage + a new checker to find the fixed issues.

This checker is used to detect accidental thread scheduling switching points happening during profiling sampling. See the bigger comment in unsafe_api_calls_check.h . I was able to check that this checker correctly triggers for the bug in #4195, and also the bug I'm going to fix next, which is the use of `rb_hash_lookup` in the otel context reading code.

`rb_hash_lookup` calls `#hash` on the key being looked up so it's safe to use unless during sampling. This can cause the same issue as we saw in #4195 leading to ``` [BUG] unexpected situation - recordd:1 current:0 -- C level backtrace information ------------------------------------------- ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820 ruby(rb_vm_bugreport) vm_dump.c:1151 ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042 ruby(rb_bug_without_die) error.c:1042 ruby(die+0x0) [0x55ba03ac0998] error.c:1050 ruby(rb_bug) error.c:1052 ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226 ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144 ruby(rb_ec_tag_state) eval_intern.h:155 ruby(rb_vm_exec) vm.c:2484 ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509 ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728 ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598 ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615 ruby(thread_start_func_2) thread.c:672 ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187 /lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353] ```

During my experiments to reproduce issues around allocation profiling, I've noted that the VM is in an especially delicate state during exception raising, so let's just decline to sample in this situation.

It occurs to me that if a symbol is dynamic, we were causing it to become a static symbol (e.g. making it never be able to be garbage collected). This can be very bad! And also, we know the symbol we're looking for must be a static symbol because if nothing else, our initialization caused it to become a static symbol. Thus, if we see a dynamic symbol, we can stop there, since by definition it won't be the symbol we're looking after. This is... really awkward to add a specific unit test for, so I've just relied on our existing test coverage to show that this has not affected the correctness of our otel code.

github-actions · 2024-12-19T15:29:26Z

Thank you for updating Change log entry section 👏

^{Visited at: 2024-12-19 17:13:07 UTC}

codecov-commenter · 2024-12-19T15:32:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.74%. Comparing base (e8a6307) to head (442690b).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4240      +/-   ##
==========================================
- Coverage   97.74%   97.74%   -0.01%     
==========================================
  Files        1355     1355              
  Lines       82366    82404      +38     
  Branches     4230     4230              
==========================================
+ Hits        80512    80544      +32     
- Misses       1854     1860       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AlexJF · 2024-12-19T16:20:47Z

ext/datadog_profiling_native_extension/unsafe_api_calls_check.c

+}
+
+static void check_for_unsafe_api_calls(DDTRACE_UNUSED void *_unused) {
+  if (inside_unsafe_context) rb_bug(


This is a hard crash right? I see how it would make issues more visible but won't it also lead the profiler to crash client apps more often than what may be caused by some "racy" issues? Are we trading off potential profiling accuracy with increased crashing or would the unsafe calls potentially manifest in issues more dangerous than a simple crash (e.g. a data corruption)?

You're right, but there's a key detail: I've only placed calls to debug_enter_unsafe_context and debug_leave_unsafe_context in methods that are in the testing_module, so this code never gets exercised in the "normal" in-production profiling path.

But re-reading the diff this is a subtle detail, so I'll push a commit to document and make this more clear that these calls should not be added outside of testing code paths.

(I did consider introducing an option to enable this behavior + adding it also to the normal profiling path, so we could tell folks that ran into issues "try with this setting"; but to be honest I wasn't sure it was worth the time spent into it and additional complexity, given we already have some coverage for this in the tests + it's not particularly something that has happened in the past.)

Added big scary comment in 7e49324

I've only placed calls to debug_enter_unsafe_context and debug_leave_unsafe_context in methods that are in the testing_module, so this code never gets exercised in the "normal" in-production profiling path.

Ah I did wonder if that was true given the first few places you added it to had the "don't use this anywhere else" comment. But some of the other methods didn't and I was lazy to check where they were registered or called from :shame:

AlexJF · 2024-12-19T16:21:34Z

lib/datadog/core/configuration.rb

@@ -236,7 +236,7 @@ def safely_synchronize
          rescue ThreadError => e
            logger_without_components.error(
              'Detected deadlock during datadog initialization. ' \
-              'Please report this at https://github.com/DataDog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug' \
+              'Please report this at https://github.com/datadog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug' \


Nit: Intended? I mean it works but the official casing is the previous.

I usually try to lower-case whenever I see it linked like this, trying to convince the world it's not DataDog ;)

datadog-datadog-prod-us1 · 2024-12-19T17:40:59Z

Datadog Report

Branch report: ivoanjo/prof-11003-fix-profiler-otel-hash-lookup
Commit report: 442690b
Test service: dd-trace-rb

✅ 0 Failed, 22130 Passed, 1476 Skipped, 5m 28.92s Total Time

p-datadog · 2024-12-19T19:37:01Z

Is the iteration limit for hash lookup added in this PR customer visible? Seems to me that it would be? If yes should it be noted somewhere?

ivoanjo · 2025-01-02T11:55:32Z

Is the iteration limit for hash lookup added in this PR customer visible? Seems to me that it would be? If yes should it be noted somewhere?

It's a good call-out but... I'm somewhat hesitant to introduce too many knobs of this kind if we can avoid it.

This whole feature itself is still under preview (and off by default), so my thinking is: let's see if any customer reports any issue with the current hardcoded value for this setting, and in that case I'll work with them do decide what value would be needed. After that, we can decide if we bump the default, or do something else.

Does this sound reasonable, or are you not convinced yet?

ivoanjo · 2025-01-02T11:59:43Z

Ok so the failure in the benchmarks is expected and an unfortunate effect of how the benchmarks are currently ran. In particular, our harness tries to run the benchmarks from this branch with the main branch, which is an interesting idea BUT forces the benchmark code to be backwards-compatible, which is really annoying for some changes such as this one.

The master branch is currently red due to unrelated issues; my current plan for this PR is to wait until master is green again, then merge it into this branch, and then merge it once everything is green other than the benchmarks.

…lookup

ivoanjo · 2025-01-03T10:21:42Z

Update: After merging to master, CI is green as expected:

ivoanjo added 5 commits December 19, 2024 14:19

Avoid trying to sample allocations when VM is raising exception

eeed3a9

During my experiments to reproduce issues around allocation profiling, I've noted that the VM is in an especially delicate state during exception raising, so let's just decline to sample in this situation.

Update tests with new signatures for test methods

31997bf

ivoanjo requested review from a team as code owners December 19, 2024 15:29

github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Dec 19, 2024

AlexJF approved these changes Dec 19, 2024

View reviewed changes

Document that unsafe api calls checker is only for test code

7e49324

marcotc approved these changes Jan 2, 2025

View reviewed changes

Merge branch 'master' into ivoanjo/prof-11003-fix-profiler-otel-hash-…

442690b

…lookup

ivoanjo merged commit b1400d5 into master Jan 3, 2025
335 of 336 checks passed

ivoanjo deleted the ivoanjo/prof-11003-fix-profiler-otel-hash-lookup branch January 3, 2025 09:55

github-actions bot added this to the 2.9.0 milestone Jan 3, 2025

ivoanjo added bug Involves a bug otel OpenTelemetry-related changes and removed core Involves Datadog core libraries labels Jan 3, 2025

p-datadog mentioned this pull request Jan 15, 2025

Bump to version 2.9.0 #4293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240

[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240

ivoanjo commented Dec 19, 2024 •

edited

Loading

github-actions bot commented Dec 19, 2024 •

edited

Loading

codecov-commenter commented Dec 19, 2024 •

edited

Loading

AlexJF Dec 19, 2024

ivoanjo Dec 19, 2024

ivoanjo Dec 19, 2024

AlexJF Dec 19, 2024

AlexJF Dec 19, 2024 •

edited

Loading

ivoanjo Dec 19, 2024

datadog-datadog-prod-us1 bot commented Dec 19, 2024 •

edited

Loading

p-datadog commented Dec 19, 2024

ivoanjo commented Jan 2, 2025

ivoanjo commented Jan 2, 2025

ivoanjo commented Jan 3, 2025

[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240

[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240

Conversation

ivoanjo commented Dec 19, 2024 • edited Loading

github-actions bot commented Dec 19, 2024 • edited Loading

codecov-commenter commented Dec 19, 2024 • edited Loading

Codecov Report

AlexJF Dec 19, 2024

Choose a reason for hiding this comment

ivoanjo Dec 19, 2024

Choose a reason for hiding this comment

ivoanjo Dec 19, 2024

Choose a reason for hiding this comment

AlexJF Dec 19, 2024

Choose a reason for hiding this comment

AlexJF Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

ivoanjo Dec 19, 2024

Choose a reason for hiding this comment

datadog-datadog-prod-us1 bot commented Dec 19, 2024 • edited Loading

Datadog Report

p-datadog commented Dec 19, 2024

ivoanjo commented Jan 2, 2025

ivoanjo commented Jan 2, 2025

ivoanjo commented Jan 3, 2025

ivoanjo commented Dec 19, 2024 •

edited

Loading

github-actions bot commented Dec 19, 2024 •

edited

Loading

codecov-commenter commented Dec 19, 2024 •

edited

Loading

AlexJF Dec 19, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Dec 19, 2024 •

edited

Loading