-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(telemetry): integration exception tracking #11732
Open
ygree
wants to merge
11
commits into
main
Choose a base branch
from
ygree/integration-exception-tracking
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+75
−8
Open
Changes from 4 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
6905806
Integration Exception Tracking
ygree c409143
Integration Exception Tracking
ygree 1a2ac4f
Integration Exception Tracking
ygree ec8f7ca
Fix format
ygree b132c4e
Extract DDTelemetryLogger
ygree a6a8ab1
Implement _format_stack_trace to replace absolute file path with rela…
ygree c735c60
Redact non-ddtrace stack frames from being set to telemetry.
ygree 606fcb8
Merge branch 'main' into ygree/integration-exception-tracking
ygree daab422
Fix format
ygree f613c49
Merge branch 'main' into ygree/integration-exception-tracking
ygree bf97f39
munir: move telemetry logger to a handler
mabdinur File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we are introducing telemetry-specific logic into a logging source. Can we try to see if there is a different design that allows keeping the two separate, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really "introducing", since some of this was already there to capture errors, and this change just extends it to exception tracking.
Alternatively, we would have to duplicate all the logging calls in the contib modules just to have exception tracking, which is easy to forget to add, and just introduces code duplication in the instrumentation code.
I'll consider adding a separate telemetry logger if you think that's a better solution. It will probably need to be in the same package, because my attempt to put it in a telemetry package ended with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have introduced DDTelemetryLogger to separate concerns. Please let me know what you think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks. I think we really need to move all telemetry-related code to the already existing telemetry sources. For instance, we already parse
DD_INSTRUMENTATION_TELEMETRY_ENABLED
indd-trace-py/ddtrace/settings/config.py
Lines 509 to 510 in 0e31457
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! While I agree with the general concern about coupling software components, I would appreciate some clarification and guidance on how the proposed improvements can be implemented effectively. My previous attempts to achieve this didn’t succeed, so your input would be invaluable.
Could you elaborate on what you mean by "all telemetry-related code"? Moving
DDTelemetryLogger
to the telemetry module isn’t straightforward because it is tightly coupled withDDLogger
. Its primary functionality revolves around logging - extracting exceptions and passing them to the telemetry module. As a result, its logic and state are more closely tied to the logger than to telemetry itself.Regarding the configuration, this is indeed a trade-off. Moving it to the telemetry module would result in circular dependency issues during initialization. Any suggestions on how to address these challenges while keeping the codebase clean and decoupled would be greatly appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Yury,
In ddtrace/contrib/, we define 0 error logs, 49 warning logs, and 118 debug logs (GitHub search). This accounts for only a small fraction of the errors that occur.
In most cases, when ddtrace instrumentation fails at runtime, an unhandled exception is raised. These exceptions are not captured by ddtrace loggers.
If an exception escapes a user's application and reaches the Python interpreter, it will be captured by the TelemetryWriter exception hook. Currently, this hook only captures startup errors, but it could be extended to capture exceptions raised during runtime.
Rather than defining a ddtrace logger primarily for debug logs, we could capture critical runtime integration errors directly using the telemetry exception hook. This approach decouples the telemetry writer from the core loggers and ensures that one error per failed process is captured, eliminating the need for rate limiting.
Would this approach capture the errors you're concerned about?
Additionally, I’m a big fan of using telemetry metrics where possible. Metrics are easier to send and ingest, have lower cardinality, and are generally simpler to monitor and analyze. While a metric wouldn’t provide the context of tracebacks, it would be valuable if we could define telemetry metrics to track integration health.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look and sharing your thoughts, Munir!
I appreciate your suggestion, it makes perfect sense and would complement this effort well. Extending the telemetry exception hook to capture runtime errors in addition to startup errors would indeed provide valuable insight and ensure that critical errors are visible to us. It would be interesting to hear how the telemetry exception hook would need to be modified to do this, as I thought it already covered this.
However, I think this is a slightly different goal than the one addressed in this PR. I think that reporting caught exceptions in our instrumentation can still be valuable, even though most caught exceptions in the contrib code are currently logged at the debug level. While this approach ensures that they remain largely invisible to customers (which makes sense), these exceptions can still be very useful to us internally, particularly in identifying and improving potentially broken integration code.
Without this functionality, we remain unaware of the problems associated with these caught exceptions that this PR is intended to address. The primary consumer of this data would be our team, not end users. While uncaught exceptions are visible to users, caught exceptions, while less severe, can provide us with actionable insights to improve the product and that is the idea behind this change. I hope this clarifies the intent and need behind the proposed changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that capturing exceptions which are uncaught would be a nice follow up task. That said the functionality defined in the RFC that's currently landing in all 8 tracers is mostly around capturing errors sent via manual logger messages.