Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(telemetry): integration exception tracking #11732

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
8 changes: 0 additions & 8 deletions ddtrace/internal/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,14 +134,6 @@ def handle(self, record):
:param record: The log record being logged
:type record: ``logging.LogRecord``
"""
if record.levelno >= logging.ERROR:
# avoid circular import
from ddtrace.internal import telemetry

# currently we only have one error code
full_file_name = os.path.join(record.pathname, record.filename)
telemetry.telemetry_writer.add_error(1, record.msg % record.args, full_file_name, record.lineno)

# If rate limiting has been disabled (`DD_TRACE_LOGGING_RATE=0`) then apply no rate limit
# If the logging is in debug, then do not apply any limits to any log
if not self.rate_limit or self.getEffectiveLevel() == logging.DEBUG:
Expand Down
67 changes: 67 additions & 0 deletions ddtrace/internal/telemetry/logging.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import logging
import os
import traceback

from ddtrace.internal.telemetry.constants import TELEMETRY_LOG_LEVEL


class DDTelemetryLogHandler(logging.StreamHandler):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid it won't intercept debug logs if the current log level is higher than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. This one a bit tricky. On one hand I think we should respect the log level set and only emit debug logs to instrumentation telemetry if dd_trace_debug is enabled. On the hand I see the value of always capturing integration errors regardless of the log level.

I think a better solution is to update the log level used in ddtrace integrations. If an integration has an error it should emit an error log (not a debug with a stacktrace)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with that approach is that it exposes an error to users who may not be able to do anything about it, so it just pollutes the logs. In some other tracers, even if we don't show an error to the user, we still report it to telemetry so we can decide what to do with it. Think of it as a signal to us, not to the user. The idea is that we can report all caught exceptions and use the logger as an interface to get them to telemetry so we can act on them.

Copy link
Author

@ygree ygree Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending every error to telemetry is not ideal, as it can easily be overwhelmed in case of repetitive errors. The RFC specifically asks to implement dedupe logic for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deduping logic is already built in to the telemetry writer. The TelemetryWriter.add_log(..) operation adds log records to a set. Each log record is hashed using the log message, stack trace, and log level. So all the log messages in a telemetry payload should be unique.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. It's probably not ideal because we're missing a counter that can indicate how often the error has occurred, but that's a minor issue as long as it doesn't overwhelm the backend.

CWD = os.getcwd()

def __init__(self, telemetry_writer):
self.telemetry_writer = telemetry_writer
super().__init__()

def emit(self, record):
# type: (logging.LogRecord) -> None
if record.levelno >= logging.ERROR:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't capture debug logs, we miss a lot more possible caught exceptions that we might want to know about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am refactoring an existing feature. I moved all telemetry code from the internal logger to this module.

This feature captures is used to track whether the ddtrace library was successfully configured. It records the last error emitted on application start up (for the whole ddtrace library not just integrations). It shouldn't capture debug logs. We can move this refactor to another PR to avoid confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably overlooked that it doesn't affect the add_log functionality below, I thought it was exiting the method at first glance, but now it looks fine.

# Capture start up errors
full_file_name = os.path.join(record.pathname, record.filename)
self.telemetry_writer.add_error(1, record.msg % record.args, full_file_name, record.lineno)

# Capture errors logged in the ddtrace integrations
if record.name.startswith("ddtrace.contrib"):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay with the possible performance implications of using a handler instead of a logger since this check is now done for every logging message. Especially if we allow intercepting DEBUG logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handler is only registered if instrumentation telemetry and telemetry log collection are enabled.

The handler is also only registered on the root ddtrace logger and is only used in dd-trace-py library. Any loggers configured by users or external libraries should be unaffected.

The main performance hit is that this handler will be called every time the ddtrace library generates a log.

Copy link
Author

@ygree ygree Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handler is only registered if instrumentation telemetry and telemetry log collection are enabled.

Which should be on by default.

The handler is also only registered on the root ddtrace logger and is only used in dd-trace-py library. Any loggers configured by users or external libraries should be unaffected.

Yes, that was my expectation.

The main performance hit is that this handler will be called every time the ddtrace library generates a log.

I assume that DDLogger's rate limiter will naturally limit this. Then it should probably be fine.

telemetry_level = (
TELEMETRY_LOG_LEVEL.ERROR
if record.levelno >= logging.ERROR
else TELEMETRY_LOG_LEVEL.WARNING
if record.levelno == logging.WARNING
else TELEMETRY_LOG_LEVEL.DEBUG
)
# Only collect telemetry for logs with a traceback
stack_trace = self._format_stack_trace(record.exc_info)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without rate limiting, walking the stack can have bad performance implications as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this isn't a polished solution. I just wanted to illiterate how we could use a handler instead of creating a new logger. The deduplication of log messages happens in the telemetry writer. We should refactor this logic (ex: hash log records using only the filename + line number + error message and check if a log record has been queued before generating a stack trace).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Perhaps the trace back rendering logic can then be moved out to the telemetry write method?

if stack_trace is not None:
# Report only exceptions with a stack trace
self.telemetry_writer.add_log(
telemetry_level,
record.msg,
# Do we need to set this tag? Should we allow telemetry intake to infer this value?
tags={"lib_language": "python"},
stack_trace=stack_trace,
)

def _format_stack_trace(self, exc_info):
if exc_info is None:
return None
exc_type, exc_value, exc_traceback = exc_info
if exc_traceback:
tb = traceback.extract_tb(exc_traceback)
formatted_tb = ["Traceback (most recent call last):"]
for filename, lineno, funcname, srcline in tb:
if self._should_redact(filename):
formatted_tb.append(" <REDACTED>")
else:
relative_filename = self._format_file_path(filename)
formatted_line = f' File "{relative_filename}", line {lineno}, in {funcname}\n {srcline}'
formatted_tb.append(formatted_line)
formatted_tb.append(f"{exc_type.__module__}.{exc_type.__name__}: {exc_value}")
return "\n".join(formatted_tb)

def _should_redact(self, filename):
return "ddtrace" not in filename

def _format_file_path(self, filename):
try:
return os.path.relpath(filename, start=self.CWD)
except ValueError:
return filename
8 changes: 8 additions & 0 deletions ddtrace/internal/telemetry/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
from .data import get_host_info
from .data import get_python_config_vars
from .data import update_imported_dependencies
from .logging import DDTelemetryLogHandler
from .metrics import CountMetric
from .metrics import DistributionMetric
from .metrics import GaugeMetric
Expand Down Expand Up @@ -68,6 +69,10 @@ class _TelemetryConfig:
INSTALL_TYPE = os.environ.get("DD_INSTRUMENTATION_INSTALL_TYPE", None)
INSTALL_TIME = os.environ.get("DD_INSTRUMENTATION_INSTALL_TIME", None)
FORCE_START = asbool(os.environ.get("_DD_INSTRUMENTATION_TELEMETRY_TESTS_FORCE_APP_STARTED", "false"))
LOG_COLLECTION_ENABLED = TELEMETRY_ENABLED and os.getenv("DD_TELEMETRY_LOG_COLLECTION_ENABLED", "true").lower() in (
"true",
"1",
)


class LogData(dict):
Expand Down Expand Up @@ -222,6 +227,8 @@ def __init__(self, is_periodic=True, agentless=None):
# Force app started for unit tests
if _TelemetryConfig.FORCE_START:
self._app_started()
if _TelemetryConfig.LOG_COLLECTION_ENABLED:
getLogger("ddtrace").addHandler(DDTelemetryLogHandler(self))

def enable(self):
# type: () -> bool
Expand Down Expand Up @@ -504,6 +511,7 @@ def add_log(self, level, message, stack_trace="", tags=None):
data["tags"] = ",".join(["%s:%s" % (k, str(v).lower()) for k, v in tags.items()])
if stack_trace:
data["stack_trace"] = stack_trace
# Logs are hashed using the message, level, tags, and stack_trace. This should prevent duplicatation.
self._logs.add(data)

def add_gauge_metric(self, namespace: TELEMETRY_NAMESPACE, name: str, value: float, tags: MetricTagType = None):
Expand Down
Loading