-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(telemetry): integration exception tracking #11732
base: main
Are you sure you want to change the base?
Conversation
|
BenchmarksBenchmark execution time: 2025-01-31 01:58:43 Comparing candidate commit bf97f39 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics. |
Collect, dedupe, ddtrace.contrib logs, and send to the telemetry.
Report only an error or an exception with a stack trace. Added tags and stack trace (without redaction)
Add count
b11966f
to
ec8f7ca
Compare
ddtrace/internal/logger.py
Outdated
) | ||
|
||
|
||
class _TelemetryConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we are introducing telemetry-specific logic into a logging source. Can we try to see if there is a different design that allows keeping the two separate, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really "introducing", since some of this was already there to capture errors, and this change just extends it to exception tracking.
Alternatively, we would have to duplicate all the logging calls in the contib modules just to have exception tracking, which is easy to forget to add, and just introduces code duplication in the instrumentation code.
I'll consider adding a separate telemetry logger if you think that's a better solution. It will probably need to be in the same package, because my attempt to put it in a telemetry package ended with
ImportError: cannot import name 'get_logger' from partially initialized module 'ddtrace.internal.logger' (most likely due to circular import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have introduced DDTelemetryLogger to separate concerns. Please let me know what you think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks. I think we really need to move all telemetry-related code to the already existing telemetry sources. For instance, we already parse DD_INSTRUMENTATION_TELEMETRY_ENABLED
in
dd-trace-py/ddtrace/settings/config.py
Lines 509 to 510 in 0e31457
self._telemetry_enabled = _get_config("DD_INSTRUMENTATION_TELEMETRY_ENABLED", True, asbool) | |
self._telemetry_heartbeat_interval = _get_config("DD_TELEMETRY_HEARTBEAT_INTERVAL", 60, float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! While I agree with the general concern about coupling software components, I would appreciate some clarification and guidance on how the proposed improvements can be implemented effectively. My previous attempts to achieve this didn’t succeed, so your input would be invaluable.
Could you elaborate on what you mean by "all telemetry-related code"? Moving DDTelemetryLogger
to the telemetry module isn’t straightforward because it is tightly coupled with DDLogger
. Its primary functionality revolves around logging - extracting exceptions and passing them to the telemetry module. As a result, its logic and state are more closely tied to the logger than to telemetry itself.
Regarding the configuration, this is indeed a trade-off. Moving it to the telemetry module would result in circular dependency issues during initialization. Any suggestions on how to address these challenges while keeping the codebase clean and decoupled would be greatly appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Yury,
In ddtrace/contrib/, we define 0 error logs, 49 warning logs, and 118 debug logs (GitHub search). This accounts for only a small fraction of the errors that occur.
In most cases, when ddtrace instrumentation fails at runtime, an unhandled exception is raised. These exceptions are not captured by ddtrace loggers.
If an exception escapes a user's application and reaches the Python interpreter, it will be captured by the TelemetryWriter exception hook. Currently, this hook only captures startup errors, but it could be extended to capture exceptions raised during runtime.
Rather than defining a ddtrace logger primarily for debug logs, we could capture critical runtime integration errors directly using the telemetry exception hook. This approach decouples the telemetry writer from the core loggers and ensures that one error per failed process is captured, eliminating the need for rate limiting.
Would this approach capture the errors you're concerned about?
Additionally, I’m a big fan of using telemetry metrics where possible. Metrics are easier to send and ingest, have lower cardinality, and are generally simpler to monitor and analyze. While a metric wouldn’t provide the context of tracebacks, it would be valuable if we could define telemetry metrics to track integration health.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look and sharing your thoughts, Munir!
I appreciate your suggestion, it makes perfect sense and would complement this effort well. Extending the telemetry exception hook to capture runtime errors in addition to startup errors would indeed provide valuable insight and ensure that critical errors are visible to us. It would be interesting to hear how the telemetry exception hook would need to be modified to do this, as I thought it already covered this.
However, I think this is a slightly different goal than the one addressed in this PR. I think that reporting caught exceptions in our instrumentation can still be valuable, even though most caught exceptions in the contrib code are currently logged at the debug level. While this approach ensures that they remain largely invisible to customers (which makes sense), these exceptions can still be very useful to us internally, particularly in identifying and improving potentially broken integration code.
Without this functionality, we remain unaware of the problems associated with these caught exceptions that this PR is intended to address. The primary consumer of this data would be our team, not end users. While uncaught exceptions are visible to users, caught exceptions, while less severe, can provide us with actionable insights to improve the product and that is the idea behind this change. I hope this clarifies the intent and need behind the proposed changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that capturing exceptions which are uncaught would be a nice follow up task. That said the functionality defined in the RFC that's currently landing in all 8 tracers is mostly around capturing errors sent via manual logger messages.
9c9296f
to
a817bdf
Compare
Datadog ReportBranch report: ✅ 0 Failed, 1598 Passed, 0 Skipped, 27m 41.42s Total duration (14m 10.88s time saved) |
a817bdf
to
3c1acc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need some more tests. Aside from that this looks good to me!
munir: add telemetry log handler
3c1acc5
to
bf97f39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.StreamHandler doesn't seem to be the right alternative for the problem
self.telemetry_writer.add_error(1, record.msg % record.args, full_file_name, record.lineno) | ||
|
||
# Capture errors logged in the ddtrace integrations | ||
if record.name.startswith("ddtrace.contrib"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we okay with the possible performance implications of using a handler instead of a logger since this check is now done for every logging message. Especially if we allow intercepting DEBUG logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handler is only registered if instrumentation telemetry and telemetry log collection are enabled.
The handler is also only registered on the root ddtrace logger and is only used in dd-trace-py library. Any loggers configured by users or external libraries should be unaffected.
The main performance hit is that this handler will be called every time the ddtrace library generates a log.
|
||
def emit(self, record): | ||
# type: (logging.LogRecord) -> None | ||
if record.levelno >= logging.ERROR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't capture debug logs, we miss a lot more possible caught exceptions that we might want to know about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am refactoring an existing feature. I moved all telemetry code from the internal logger to this module.
This feature captures is used to track whether the ddtrace library was successfully configured. It records the last error emitted on application start up (for the whole ddtrace library not just integrations). It shouldn't capture debug logs. We can move this refactor to another PR to avoid confusion.
from ddtrace.internal.telemetry.constants import TELEMETRY_LOG_LEVEL | ||
|
||
|
||
class DDTelemetryLogHandler(logging.StreamHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid it won't intercept debug logs if the current log level is higher than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. This one a bit tricky. On one hand I think we should respect the log level set and only emit debug logs to instrumentation telemetry if dd_trace_debug is enabled. On the hand I see the value of always capturing integration errors regardless of the log level.
I think a better solution is to update the log level used in ddtrace integrations. If an integration has an error it should emit an error log (not a debug with a stacktrace)
else TELEMETRY_LOG_LEVEL.DEBUG | ||
) | ||
# Only collect telemetry for logs with a traceback | ||
stack_trace = self._format_stack_trace(record.exc_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without rate limiting, walking the stack can have bad performance implications as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup this isn't a polished solution. I just wanted to illiterate how we could use a handler instead of creating a new logger. The deduplication of log messages happens in the telemetry writer. We should refactor this logic (ex: hash log records using only the filename + line number + error message and check if a log record has been queued before generating a stack trace).
from ddtrace.internal.telemetry.constants import TELEMETRY_LOG_LEVEL | ||
|
||
|
||
class DDTelemetryLogHandler(logging.StreamHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending every error to telemetry is not ideal, as it can easily be overwhelmed in case of repetitive errors. The RFC specifically asks to implement dedupe logic for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deduping logic is already built in to the telemetry writer. The TelemetryWriter.add_log(..) operation adds log records to a set. Each log record is hashed using the log message, stack trace, and log level. So all the log messages in a telemetry payload should be unique.
This change is part of the cross tracer Integration Exception Tracking initiative:
It's intended to capture integration related errors and report them to the telemetry backend. It does this by implementing a
DDTelemetryLogger
that is used by the instrumentation code that is part of theddtrace.contrib
package, and only catches logs with level error or higher, or others that have an exception attached.The motivation for having
DDTelemetryLogger
is that the logger is the natural way to report integration specific errors. This eliminates the need to duplicate this logic and helps ensure that such exceptions are not forgotten to report to telemetry. Here are some numbers about theddtrace.contrib
package log statementsexc_info=True
)exc_info=True
)To avoid overloading the telemetry backend, and also to minimize the impact of traversing and formatting the trace back,
DDTelemetryLogger
introduces a rate limiter that will not report the same error more than once per 60 second heartbeat interval.For the trace back it does some processing including:
Jira ticket: AIDM-389
Checklist
Reviewer Checklist