-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VirtualThread pinning with RealSpan and zipkin2.reporter by switching from synchronized to ReentrantLock #1427
Conversation
Rerun tracing benchmarks on current branch (
|
On this change (
|
|
…hing from synchronized to ReentrantLock Signed-off-by: Andriy Redko <[email protected]>
@@ -139,17 +142,30 @@ final class RealSpan extends Span { | |||
} | |||
|
|||
@Override public void finish(long timestamp) { | |||
synchronized (state) { | |||
lock.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codefromthecrypt please correct me if I am doing something silly here, but it looks like operations over pendingSpans
do not depend on state
and could be synchronized over a separate lock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state is a MutableSpan looked up by tracecontext. The problem with us having to remove all synchronized is now we have a much more complex sitation. I mean I am assuming it is really very required and still have a hard time believing it is.
Before the object was its own lock, guarded by itself. Now we are externalizing it even though the object (MutableSpan) can be mutated via various callers.
So, I think we need to focus on what is the pattern suggested for migrating to ReentrantLock from synchronized, when in synchronized we can use the object itself as the lock. Be careful to not expose API surface when thinking this through!
One way is to make sure that any locked access is using PendingSpan (internal) and moving the lock there. At least there, it has 1-1 relationship to what it is guarding (a mutable span)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way is to make sure that any locked access is using PendingSpan (internal) and moving the lock there. At least there, it has 1-1 relationship to what it is guarding (a mutable span)
Got it, thanks @codefromthecrypt . To elaborate a bit on the issue, synchronized
over state
is fine and non-issue with virtual thread (it is super short and I didn't change it expect here) but pendingSpans.xxx
methods have unpredictable latency and may hold synchronized
for a looong time - this is an issue.
Before the object was its own lock, guarded by itself. Now we are externalizing it even though the object (MutableSpan) can be mutated via various callers.
That's I think is one of the problems: we cannot reliably guard MutableSpan
unless all accesses to it are guarded as well (not necessarily internally). But thanks a lot for highlighting context
<-> mutablespan link, I will look further.
So, I think we need to focus on what is the pattern suggested for migrating to ReentrantLock from synchronized, when in synchronized we can use the object itself as the lock. Be careful to not expose API surface when thinking this through!
ReentrantLock
sadly does not sadly help us much here since synchronized(state)
in brute force would be equivalent to something like state.lock.lock(); ....
- exactly the problem you mentioned, API surface will be harmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we solve for the issue then and leave the rest alone?
So, the only reason why state is held for a long time (microseconds) is when it is being sized for encoding in the async reporter, which is in another project, or someone is doing large amount of work tagging (bug).
One option could be that when a configuration is present, use copy mode which will be faster than microseconds (faster than sizing the span as you don't need to check encoded size of strings etc).
Also, the reported stack trace is inherently inefficient as it is copying the span from brave's type to zipkin's type first (ZipkinSpanHandler
). There's a brave-native encoder out for years, but for some reason wasn't queued for micrometer until I personally added it in 1.3 (`https://github.com/micrometer-metrics/tracing/releases/tag/v1.3.0-M2
java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(Unknown Source)
java.base/java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
zipkin2.reporter.ByteBoundedQueue.offer(ByteBoundedQueue.java:51)
zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:259)
zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)
Finally, I really think we owe it to define what is expensive blocking and so we can measure it. We are assuming this is expensive just because some message ended up in the log of a suboptimal setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracing in micrometer is very slow due to all the frameworks used to the point where overhead vs other options is milliseconds worse, not microseconds. Staring at this issue as if it holistically changes things is highly suspicious and a questionable reason to introduce a lot of risk. I think the most narrow solution to the problem, including muting log messages is a decent thing to consider.
e.g. measure this and you'll see hello world requests with millis extra overhead. https://github.com/openzipkin/brave-example/tree/master/webflux6-micrometer
fyi I'm going to unsubscribe from this topic as I feel like I don't want to watch us overcompensate for a "problem" put out of context. This is not about you, rather my frustration of being the recipient of work on something I don't even use, despite the number of paid employees of tracing that exist. Here's the summary of suggestions:
Maybe, we can change the AsyncReporter to handle this optionally https://github.com/openzipkin/zipkin-reporter-java/blob/master/brave/src/main/java/zipkin2/reporter/brave/AsyncZipkinSpanHandler.java#L212 Here, instead of doing "report" inline (which sizes the span), we can make a copy and return false (meaning this reporter must be the end of the line, which it already should be. After the copy is done (still in the lock), put it into a thread (maybe the existing reporter thread) to size or drop as needed. There are benchmarks in that project, so you can see how long the lock is being held for. Of course if a tracer adds a ton of tags it may be a lot longer with the current approach, but not as long as actually serializing or sending the span. EOF and good luck! |
Closes #1426
TODO: