-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(telemetry): integration exception tracking #11732
base: main
Are you sure you want to change the base?
Changes from all commits
6905806
c409143
1a2ac4f
ec8f7ca
b132c4e
a6a8ab1
c735c60
606fcb8
daab422
f613c49
bf97f39
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
import logging | ||
import os | ||
import traceback | ||
|
||
from ddtrace.internal.telemetry.constants import TELEMETRY_LOG_LEVEL | ||
|
||
|
||
class DDTelemetryLogHandler(logging.StreamHandler): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sending every error to telemetry is not ideal, as it can easily be overwhelmed in case of repetitive errors. The RFC specifically asks to implement dedupe logic for that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The deduping logic is already built in to the telemetry writer. The TelemetryWriter.add_log(..) operation adds log records to a set. Each log record is hashed using the log message, stack trace, and log level. So all the log messages in a telemetry payload should be unique. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's right. It's probably not ideal because we're missing a counter that can indicate how often the error has occurred, but that's a minor issue as long as it doesn't overwhelm the backend. |
||
CWD = os.getcwd() | ||
|
||
def __init__(self, telemetry_writer): | ||
self.telemetry_writer = telemetry_writer | ||
super().__init__() | ||
|
||
def emit(self, record): | ||
# type: (logging.LogRecord) -> None | ||
if record.levelno >= logging.ERROR: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we don't capture debug logs, we miss a lot more possible caught exceptions that we might want to know about. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here I am refactoring an existing feature. I moved all telemetry code from the internal logger to this module. This feature captures is used to track whether the ddtrace library was successfully configured. It records the last error emitted on application start up (for the whole ddtrace library not just integrations). It shouldn't capture debug logs. We can move this refactor to another PR to avoid confusion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I probably overlooked that it doesn't affect the add_log functionality below, I thought it was exiting the method at first glance, but now it looks fine. |
||
# Capture start up errors | ||
full_file_name = os.path.join(record.pathname, record.filename) | ||
self.telemetry_writer.add_error(1, record.msg % record.args, full_file_name, record.lineno) | ||
|
||
# Capture errors logged in the ddtrace integrations | ||
if record.name.startswith("ddtrace.contrib"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we okay with the possible performance implications of using a handler instead of a logger since this check is now done for every logging message. Especially if we allow intercepting DEBUG logs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This handler is only registered if instrumentation telemetry and telemetry log collection are enabled. The handler is also only registered on the root ddtrace logger and is only used in dd-trace-py library. Any loggers configured by users or external libraries should be unaffected. The main performance hit is that this handler will be called every time the ddtrace library generates a log. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Which should be on by default.
Yes, that was my expectation.
I assume that DDLogger's rate limiter will naturally limit this. Then it should probably be fine. |
||
telemetry_level = ( | ||
TELEMETRY_LOG_LEVEL.ERROR | ||
if record.levelno >= logging.ERROR | ||
else TELEMETRY_LOG_LEVEL.WARNING | ||
if record.levelno == logging.WARNING | ||
else TELEMETRY_LOG_LEVEL.DEBUG | ||
) | ||
# Only collect telemetry for logs with a traceback | ||
stack_trace = self._format_stack_trace(record.exc_info) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Without rate limiting, walking the stack can have bad performance implications as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup this isn't a polished solution. I just wanted to illiterate how we could use a handler instead of creating a new logger. The deduplication of log messages happens in the telemetry writer. We should refactor this logic (ex: hash log records using only the filename + line number + error message and check if a log record has been queued before generating a stack trace). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Perhaps the trace back rendering logic can then be moved out to the telemetry write method? |
||
if stack_trace is not None: | ||
# Report only exceptions with a stack trace | ||
self.telemetry_writer.add_log( | ||
telemetry_level, | ||
record.msg, | ||
# Do we need to set this tag? Should we allow telemetry intake to infer this value? | ||
tags={"lib_language": "python"}, | ||
stack_trace=stack_trace, | ||
) | ||
|
||
def _format_stack_trace(self, exc_info): | ||
if exc_info is None: | ||
return None | ||
exc_type, exc_value, exc_traceback = exc_info | ||
if exc_traceback: | ||
tb = traceback.extract_tb(exc_traceback) | ||
formatted_tb = ["Traceback (most recent call last):"] | ||
for filename, lineno, funcname, srcline in tb: | ||
if self._should_redact(filename): | ||
formatted_tb.append(" <REDACTED>") | ||
else: | ||
relative_filename = self._format_file_path(filename) | ||
formatted_line = f' File "{relative_filename}", line {lineno}, in {funcname}\n {srcline}' | ||
formatted_tb.append(formatted_line) | ||
formatted_tb.append(f"{exc_type.__module__}.{exc_type.__name__}: {exc_value}") | ||
return "\n".join(formatted_tb) | ||
|
||
def _should_redact(self, filename): | ||
return "ddtrace" not in filename | ||
|
||
def _format_file_path(self, filename): | ||
try: | ||
return os.path.relpath(filename, start=self.CWD) | ||
except ValueError: | ||
return filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid it won't intercept debug logs if the current log level is higher than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. This one a bit tricky. On one hand I think we should respect the log level set and only emit debug logs to instrumentation telemetry if dd_trace_debug is enabled. On the hand I see the value of always capturing integration errors regardless of the log level.
I think a better solution is to update the log level used in ddtrace integrations. If an integration has an error it should emit an error log (not a debug with a stacktrace)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem with that approach is that it exposes an error to users who may not be able to do anything about it, so it just pollutes the logs. In some other tracers, even if we don't show an error to the user, we still report it to telemetry so we can decide what to do with it. Think of it as a signal to us, not to the user. The idea is that we can report all caught exceptions and use the logger as an interface to get them to telemetry so we can act on them.