-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telemetry worker: flush data after stops, and two other fixes #515
Conversation
BenchmarksComparisonBenchmark execution time: 2024-12-06 11:15:26 Comparing candidate commit cbce086 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 51 metrics, 2 unstable metrics. CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
BaselineOmitted due to size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These booleans are woefully undocumented and I'm not completely sure about the expected life cycle, but from reading the code it looks like your interpretation is correct.
I think that it might be better to leave the worker as started
if it's restartable
and leave the checks in place.
@@ -296,7 +293,9 @@ impl TelemetryWorker { | |||
self.log_err(&e); | |||
} | |||
self.data.started = false; | |||
self.deadlines.clear_pending(); | |||
if !self.config.restartable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be enough to only include self.data.started = false;
inside the if statement as well, and leave the exit early checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends. Do we want Start, Stop, Stop to generate two stops? Because that's what PHP ends up generating. The second stop is a noop, but if I moved the assignment self.data.started = false
under the condition if !self.config.restartable
, then the stops would be effective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the exit early checks are still in the code, then how would the second Stop
be effective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, what you're proposing is that I reset started = false
only if !restartable
. In that case started
would stay true forever once there is a start. So the early check
if !self.data.started {
return BREAK;
}
would never be hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I was confusing the early checks.
There is something in the logic that feels a bit broken. The FlushData
is also protected by this !self.data.started
check. Should it work after a restartable stop?
The things that Stop
does, should they happen when the first request ends, or when the worker stops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should [FlushData] work after a restartable stop?
My guess is yes, otherwise there is no way to send the data that's collected after the stop (build_observability_batch
is called only from the handlers of Stop and FlushData), at least not without an intervening start+stop.
The things that Stop does, should they happen when the first request ends, or when the worker stops?
That is a good point. The final flush of the metrics should happen when the worker stops, not when handling Stop. I guess at some point they were the same, but then the restart thing was introduced. But regardless, once we have a way to send metrics after a Stop, for that happen periodic flushes should still happen. So FlushData shouldn't be skipped or unscheduled after a Stop.
4f63a34
to
251bafa
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #515 +/- ##
==========================================
+ Coverage 71.08% 71.12% +0.04%
==========================================
Files 301 301
Lines 44528 44533 +5
==========================================
+ Hits 31651 31673 +22
+ Misses 12877 12860 -17
|
447c409
to
a7d11b0
Compare
9c34541
to
e458f75
Compare
@bwoebi Can you look at this? I need it for DataDog/dd-trace-php#2735 |
e458f75
to
b018033
Compare
Telemetry workers are functionally dead after a Stop lifecycle action, provided there's no intervening Start. While AddPoint actions are still processed, their data is never flushed, since the Stop action handler unschedules FlushMetrics and FlushData actions. PHP sends a Stop action at the end of every request via ddog_sidecar_telemetry_end(), but a Start action is only generated just after a telemetry worker is spawned. It is not clear to me whether the intention is to a Start/Stop pair on every PHP requests (where Stop flushes the metrics) or if the intention is to to have only such a pair in the first request, with the Stop event generated by ddog_sidecar_telemetry_end() effectively a noop. It would appear, judging by [this comment](#391): > Also allow the telemetry worker to have a mode where it's continuing execution after a start-stop cycle, otherwise it won't send any more metrics afterwards. that the intention is to keep sending metrics after a Start/Stop pair. In that case: * The Stop action handler should not unschedule FlushData and FlushMetrics events and * FlushData, if called outside a Start-Stop pair, should not be a noop. Finally: swap the order in which FlushData and FlushMetrics are scheduled so that FlushMetrics runs first and therefore its generated data can be sent by the next FlushData.
EnqueuedTelemetryData has actions stored directly in an "actions" field, but also dependencies, configurations and integrations in separate fields; extract_telemetry_actions() converts from the latter into proper TelemetryActions. However, this method was being called only from a branch of register_service_and_flush_queued_actions() that would be hit when RuntimeInfo::apps does not already contain the key corresponding to the (service, env) in question (so, essentially, deps/confs/integrations would be lost on all but the first call).
b018033
to
b439363
Compare
Added two more commits to fix two other telemetry-related problems. Fixes the tests in https://github.com/DataDog/dd-trace-php/blob/glopes/waf-telemetry/appsec/tests/integration/src/test/groovy/com/datadog/appsec/php/integration/TelemetryTests.groovy |
ddtelemetry/src/worker/mod.rs
Outdated
let mut payloads = Vec::new(); | ||
|
||
if self.data.dependencies.flush_not_empty() { | ||
payloads.push(data::Payload::AppDependenciesLoaded( | ||
data::AppDependenciesLoaded { | ||
dependencies: self.data.dependencies.unflushed().cloned().collect(), | ||
dependencies: self.data.dependencies.drain_unflushed().cloned().collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these already be removed on a successful send by the call to removed_flushed
? Yeah the method name has a d
too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a new commit that uses the existing method.
25bdce5
to
67da02b
Compare
async move { | ||
// This is different from the non-functional: | ||
// match self_arc.read().await.send_payload(&payload).await { ... } | ||
// presumably because the temp read guard would live till end of match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting find
Telemetry workers are functionally dead after a Stop lifecycle action, provided there's no intervening Start. While AddPoint actions are still processed, their data is never flushed, since the Stop action handler unschedules FlushMetrics and FlushData actions.
PHP sends a Stop action at the end of every request via ddog_sidecar_telemetry_end(), but a Start action is only generated just after a telemetry worker is spawned. With no more Start actions generated, no metrics can effectively be sent after the first Stop.
It is not clear to me whether the intention is to have a Start/Stop pair on every PHP request (where Stop flushes the metrics) or if the intention is to to have only such a pair in the first request, with the Stop event generated by ddog_sidecar_telemetry_end() effectively a noop. It would appear, judging by this
comment:
that the intention is to keep sending metrics after a Start/Stop pair. It also makes more sense, insofar as data is flushed only on the interval, rather than after every request via Stop. In that case:
Finally: swap the order in which FlushData and FlushMetrics are scheduled so that FlushMetrics runs first and therefore its generated data can be sent by the next FlushData.