502 error if request to /api/queues/<vhost> with quorum queues without all nodes online #5134
-
I'm using RabbitMQ 3.8.25 and Erlang 24.1.5 cluster with these plugin enabled:
Our cluster has 4 nodes and we have multiple vhosts. When we turn off 1 node, the Queue tab in Management UI can not be accessed anymore and the API /api/queues/ responses 502 error. If I switch to another vhost without any quorum queue, everything is working properly. The cli I tried to find some useful logs but is there anyway to filter error log from the management plugin only? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
We cannot suggest much without any logs. The fact that your cluster has 4 nodes does not mean that said quorum queues has replicated to three nodes, it can be replicated to just two. The replica management is explicit. RabbitMQ 3.8 is out of general support and goes completely out of support in a few weeks. Consider upgrading to 3.10. |
Beta Was this translation helpful? Give feedback.
-
Hi @michaelklishin we upgraded to 3.12 and today we are facing quite the same problem that some APIs take a lot of time than usual to response, including the Everything becoming fast again when we start the missed node back. Our cluster have 6 nodes and I tried to find some error/warning logs related to the problem but no luck (we don't have debug log enabled). Regarding this case, I properly shrink the stopped node before shutting it down. My service is still working properly so seem like only management interface got the problem. I tried These are the only error logs found during the API request.
There are some other error logs but it not happen during the time we make the API request:
FYI, not shutting down every node cause the problem. Some node can be shutdown easily and cause nothing to the cluster. Some nodes (have been used in the cluster for quite a long time) will cause the slow API response when it's unreachable from the cluster. |
Beta Was this translation helpful? Give feedback.
What you are looking at has bee discussed in a lot of details #9522. Some nodes do not host queue replicas and others do.
The recommendation from that discussion and at least one similar one is universal: use Prometheus for monitoring, it does not exhibit this behavior because nodes do not try to contact their peers to collect their metrics, which means it won't wait for an operations on an unavailable node to time out. Using a Prometheus-compatible external monitoring tool has plenty of other benefits.
What can be done in the short term was done as part of #9874 for 3.13.0. There is one other change relevant #9522 that we have in mind but it's secondary to finishing 3.13 and then working…