-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Message sending in ServiceBus queue occasionally fails with error "Cannot send a message when request response channel is disposed" or "Timeout on blocking read" #43877
Comments
Thank you for your feedback. Tagging and routing to the team member best able to assist. |
Hello @s-vivien, thanks for reaching out. Could you share the following information -
After checking these I'll let you know if further details are needed. |
Hello,
|
It looks like one of the dependency is overriding the reactor-core:3.4.41 that azure-core:1.54.1 brings to 3.6.10. Could you explicitly pin reactor-core to 3.4.41 or find that dependency bringing 3.6.10 and use a version of that dependency relies on 3.4.41? the second option is preferable because we've seen cases where the version loaded runtime/packaged differs from the explicitly pinned one (and listed by (We're currently looking into upgrading from 3.4.41 to 3.7.x but ran into a few weird issues related to timers.) So, as a first step, let's see if this random issue persists if we align the dependency versions. |
Also, internally library will indeed retry on "Cannot send a message when request response channel is disposed" event, this event is represented by the exception type – RequestResponseChannelClosedException. The retry for sender on this exception happens here It is possible that the retry or associated recovery was affected by CPU throttling. From the dependency tree, cores seem to be shared across the pools associated with HTTP, database, WebSocket, Service Bus/Event Hubs. When experimenting with 3.4.41 as discussed in the last comment, I would temporarily bump up the cores (for example 4 vcpu) and observe. Probably it worth referring to the first section of this writeup Average CPU usage can be misleading. Also, the Microsoft Java Team recommends at least 2 vCPUs when running Java apps on hosts like App Service. |
Thanks for the additional information. Yes, let's align the dependencies so we can avoid any unknowns. Looking at the graph, it seems the host often reaches ~93-95% and stays in that range 5-6 hours. I'm not sure the timeline across these graphs in sync w.r.t reporting time. but, if we look at the period where there were sent errors, there seems an overlaps between that and such high CPU utilization period. It is possible that, for some reason, the CPU load indirectly causes the connection to lose (it could be transient errors as well). The error we're seeing, "disposed-channel", is a side effect of losing connection. When the current AMQP connection is lost, the single IO-Thread associated with that connection is responsible for cleaning up internal states and resources then initiating retry for new connection. If this IO-thread is throttled (i.e., not made runnable) due to CPU load, it may not progress effectively. Since there is no fairness among which thread gets picked to run, timer threads can still trigger while the IO-Thread lags. The fact that the library eventually resumes sending suggests that the IO-Thread completes its work eventually and new IO-Thread taken charge. When the IO-Thread of a disconnected connection is halted or lags like this, any send calls during this time will wait for a new active connection to be ready, creating a queue of waiting tasks that can time out if a new connection is not available within the set timeout period. From my understanding, the overall ~93-95% sustained CPU peek is concerning. Based on my reading on this topic and discussions I’ve seen internally, the safe target upper range is between 60-70%. I’m not an expert in this topic, and it is case by case, but you may want to explore having multiple instances sharing the traffic / font end to balance the load, so the CPU usage stays in the reasonable range and all components (Redis, Http, Cosmos, Service Bus / Event Hubs etc..) gets its fair share. Furthermore, it is worth examining the garbage collection (GC) events to determine if they are contributing to the CPU spikes. If the application components generate a significant amount of garbage during traffic, and memory is insufficient then frequent major GC events can happen which impede overall performance. Reviewing this can help to decide if heap needs to be tuned. Edit: I tried to find more information about app service CPU metrics, but it is unclear whether this environment is also susceptible to CPU throttling, with no visible correlation to average CPU usage (as noted in the writeup linked in my previous comment). |
After aligning the version and tuning the CPU and/or Memory if required, and if the "channel-closed" error is still occurring under a reasonable CPU and Memory range, could collect SDK DEBUG logs for 20 minutes before and 10 minutes after the first "channel-closed" error? Enabling and collecting logs from only one app instance is sufficient. The instructions to collect the logs can be found here (note: "AMQP transport logs" are NOT required but only SDK logs). A log line with correct formatting will look like below (i.e. columns as |
Describe the bug
My application is deployed on Azure (App Service). This application regularly sends messages to a ServiceBus queue. Most of the time it works fine, but sometimes during an hour or so, most of the sending would fail with one of the following error (see stack traces).
When it fails, the client does not retry to send the message.
Exception or Stack Trace
or
Code Snippet
Expected behavior
I would expect the message to be sent, or at least retried in case of failure.
Setup (please complete the following information):
com.azure:azure-messaging-servicebus:7.17.7
The text was updated successfully, but these errors were encountered: