You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The problem is that if you specify an EMAThroughputSampler, any late arriving spans will not be accounted for, leading to the Sampler sampling more than it should be doing.
Describe the solution you'd like
I would like the late arriving spans to be accounted for, so that the EMAThroughputSampler does what it says on the tin :)
Describe alternatives you've considered
There are no alternatives that I can think of, besides ignoring the problem and hope it isn't going to be a major issue. But because the late arriving spans can be very unpredictable, because it can be a mix of application decisions and telemetry pipeline congestion, it is pretty much impossible to plan for.
Additional context
I have taken a quick look at what I think would be required to implement this and the thing would seem to be that a reference to the sampler that was used should be included in the TraceSentCache so that when TraceSentRecord.count is called, it can call a count method on the Sampler if that is available.
If you want to make it pretty you probably want to add a count or countSpan method to the Sampler interface and have it be a no-op for all samplers where the span count doesn't make sense and implement it in EMAThroughputSampler
I am happy to take a stab at things, but Go is very much not on my list of language I am comfortable enough with.
The text was updated successfully, but these errors were encountered:
This is interesting, but the changes are fairly far-reaching. It would help justify the work if you could futher explain the use case. In other words, what style and shape of telemetry + configuration is causing this to be enough of a problem that you think it needs fixing? What scale of distortion are you seeing?
As you can see in the screenshot from a usage query grouped by `meta.refinery.reason`, there are a lot of late arriving spans for some categories.
And that is a mix of spans actually async behaviour where the spans ends after the root span has ended and delays in the telemetry pipeline.
This particular pipeline is now running via Kafka, so daemonset collectors send through the telemetry to Kafka, which get picked up by another set of collectors who read from Kafka and forward to Refinery.
But the pattern was very similar to when it was daemonset collectors straight to Refinery. The Refinery instances were also likely to be delaying the messages by a few seconds under those circumstances.
It is a bit hard to estimate, but at this point I think it is down to about 10-20% distortion, it was well over 50% before the Kafka change.
And while it is a bit of work, I can certainly see that, are there any major obstacles that you can see? I would be happy to take a stab at it while I have some spare time.
Is your feature request related to a problem? Please describe.
The problem is that if you specify an EMAThroughputSampler, any late arriving spans will not be accounted for, leading to the Sampler sampling more than it should be doing.
Describe the solution you'd like
I would like the late arriving spans to be accounted for, so that the EMAThroughputSampler does what it says on the tin :)
Describe alternatives you've considered
There are no alternatives that I can think of, besides ignoring the problem and hope it isn't going to be a major issue. But because the late arriving spans can be very unpredictable, because it can be a mix of application decisions and telemetry pipeline congestion, it is pretty much impossible to plan for.
Additional context
I have taken a quick look at what I think would be required to implement this and the thing would seem to be that a reference to the sampler that was used should be included in the TraceSentCache so that when
TraceSentRecord.count
is called, it can call acount
method on the Sampler if that is available.refinery/collect/cache/traceSentCache.go
Lines 21 to 22 in 4d856c7
But that means we also need to change the EMAThroughputSampler to add a
count
method that only increases the count for that key: https://github.com/honeycombio/dynsampler-go/blob/48090f35ac362ffe3186e82501f552bd5f06eaca/emathroughput.go#L255If you want to make it pretty you probably want to add a
count
orcountSpan
method to the Sampler interface and have it be a no-op for all samplers where the span count doesn't make sense and implement it inEMAThroughputSampler
I am happy to take a stab at things, but Go is very much not on my list of language I am comfortable enough with.
The text was updated successfully, but these errors were encountered: