-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When "NATSSlowConsumerException" happens the Connection keeps throwing this exception and never restores it state #814
Comments
Slow consumer is a legitimate problem that cannot really be addressed without code changes. I would suggest switching to the simplification API which uses pull under the covers. Pull better manages the flow of messages. But, if you want to stick with a push based consumer, one thing you can try is flow control. |
I understand that it is a legitiem error and I am fine that the exception is thrown; My problem is that the connection never gets back into a state that it will resume accepting new messages; even not when there are no pending messages anymore (see log line on here: Basically you can wait hours not doing anything and after you then publish something again the connection will still throw that exception as there is nothing that will clear out the exception from the variable lastEx And will oooking in to the simplification API to see if that address our actual problem for now |
I hate to push for this; but any thoughts? and we are trying to use the new API as you suggested; but for the legacy / old flow when we debugged the library this really looks like a problem in the library as it never hits the codes that should recover from this error |
Do you have some reproducible sample for me to work with? |
I can share the solution if that helps; it contains the console app that is here above. With some help of breakpoints you will find then that in the method And having We did investigate to add a flow where this could be fix and offer you a PR but to be honest we lacked the knowledge of the inner workings to come to a nice solution. |
@davesmits Maybe different last exceptions for outgoing versus incoming messages on a shared connection? |
We've encountered the same issue in our system, and our investigation led us to the same conclusion: when the slow consumer exception is thrown, it never recovers. To address this, we've implemented a solution where we exit our app with exit code 1, allowing the orchestrator to initiate a new instance. |
@scottf and if store it in different variable; what would be the right place to close it? Throw it once in publish and restart after its thrown one time? |
You would have to track whether it's an incoming message or an outgoing message, since the connection can have both. |
What version were you using?
Server: 2.9.1
Client:
What environment was the server running in?
Kubernetes 1.27.3 Linux AKS
Is this defect reproducible?
Run this; notice when the
NATSSlowConsumerException
happens once it keep happening.We tried to debug the library it self and we notice that in Connection.cs in
processSlowConsumer
lastEx is set. ButlastEx
only gets cleared after a reconnect; which in this scenario, shouldn't be required in our opinion and also not tried by the library.We did try to add an additional method to reset the lastEx but we didnt find a good place to call it; as in our scenario there is only one producer, which is the one ends up in this state. No new messages get send and causing a reset would never happen.
Reproduction:
Given the capability you are leveraging, describe your expectation?
That the exception gets resolved when there are no pending messages anymore
Given the expectation, what is the defect you are observing?
Exception happens and there is no way to resume normal operations except closing the connection and restart connection / subscription
The text was updated successfully, but these errors were encountered: