502 error if request to /api/queues/<vhost> with quorum queues without all nodes online #5134

truong-hua · 2022-07-02T18:02:10Z

truong-hua
Jul 2, 2022

I'm using RabbitMQ 3.8.25 and Erlang 24.1.5 cluster with these plugin enabled:

[rabbitmq_management,
rabbitmq_sharding,
rabbitmq_peer_discovery_etcd,
rabbitmq_prometheus,
rabbitmq_random_exchange,
rabbitmq_shovel_management].

Our cluster has 4 nodes and we have multiple vhosts. When we turn off 1 node, the Queue tab in Management UI can not be accessed anymore and the API /api/queues/ responses 502 error. If I switch to another vhost without any quorum queue, everything is working properly. The cli rabbitmqctl list_queues --vhost <vhost> is still working normally and I can get my list of queue in that vhost.

I tried to find some useful logs but is there anyway to filter error log from the management plugin only?

Answered by michaelklishin

Dec 7, 2023

What you are looking at has bee discussed in a lot of details #9522. Some nodes do not host queue replicas and others do.

The recommendation from that discussion and at least one similar one is universal: use Prometheus for monitoring, it does not exhibit this behavior because nodes do not try to contact their peers to collect their metrics, which means it won't wait for an operations on an unavailable node to time out. Using a Prometheus-compatible external monitoring tool has plenty of other benefits.

What can be done in the short term was done as part of #9874 for 3.13.0. There is one other change relevant #9522 that we have in mind but it's secondary to finishing 3.13 and then working…

View full answer

michaelklishin · 2022-07-03T07:27:04Z

michaelklishin
Jul 3, 2022
Maintainer

We cannot suggest much without any logs.

The fact that your cluster has 4 nodes does not mean that said quorum queues has replicated to three nodes, it can be replicated to just two. The replica management is explicit.

RabbitMQ 3.8 is out of general support and goes completely out of support in a few weeks. Consider upgrading to 3.10.

0 replies

truong-hua · 2023-12-04T03:56:24Z

truong-hua
Dec 4, 2023
Author

Hi @michaelklishin we upgraded to 3.12 and today we are facing quite the same problem that some APIs take a lot of time than usual to response, including the /api/vhosts. The 502 error above is from the gateway due to the very slow response of these APIs.

Everything becoming fast again when we start the missed node back. Our cluster have 6 nodes and I tried to find some error/warning logs related to the problem but no luck (we don't have debug log enabled).

Regarding this case, I properly shrink the stopped node before shutting it down. My service is still working properly so seem like only management interface got the problem. I tried rabbitmqctl list_vhosts and it's slow too (slower than when I bring back the stopped node)

These are the only error logs found during the API request.

2023-12-04 03:47:51.377386+00:00 [info] <0.21100.5> accepting AMQP connection <0.21100.5> (10.0.6.223:53548 -> 10.0.6.193:5672),
2023-12-04 03:47:28.584118+00:00 [error] <0.1660.0> etcd peer discovery: successfully extracted nodes: [email protected].***.com,[email protected].***.com,[email protected].***.com,[email protected].***.com,[email protected].***.com,
2023-12-04 03:47:28.584304+00:00 [warning] <0.1650.0> Peer discovery: node [email protected].***.com is unreachable,
2023-12-04 03:47:28.584373+00:00 [warning] <0.1650.0> Peer discovery: node [email protected].***.com is unreachable,
2023-12-04 03:47:28.584499+00:00 [warning] <0.1650.0> Peer discovery: node [email protected].***.com is unreachable

There are some other error logs but it not happen during the time we make the API request:

2023-12-04 03:56:26.795559+00:00 [error] <0.799.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 65:,
2023-12-04 03:56:26.787380+00:00 [error] <0.32206.5> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 63:,
2023-12-04 03:56:26.787380+00:00 [error] <0.32206.5> operation queue.declare caused a channel exception not_found: no queue 'channel:2ff499485e4a4136bcbbccde2b3c0dea' in vhost '***_websocket',
2023-12-04 03:56:26.795559+00:00 [error] <0.799.6> operation queue.declare caused a channel exception not_found: no queue 'channel:2ff499485e4a4136bcbbccde2b3c0dea' in vhost '***_websocket',
2023-12-04 03:56:26.848958+00:00 [error] <0.1097.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 66:,
2023-12-04 03:56:26.848958+00:00 [error] <0.1097.6> operation queue.declare caused a channel exception not_found: no queue 'channel:7f6ded4fea304065be83376a8c64ce6a' in vhost '***_websocket',
2023-12-04 03:56:26.850780+00:00 [error] <0.1140.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 67:,
2023-12-04 03:56:26.850780+00:00 [error] <0.1140.6> operation queue.declare caused a channel exception not_found: no queue 'channel:7f6ded4fea304065be83376a8c64ce6a' in vhost '***_websocket',
2023-12-04 03:56:27.054447+00:00 [error] <0.1310.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 69:,
2023-12-04 03:56:27.054447+00:00 [error] <0.1310.6> operation queue.declare caused a channel exception not_found: no queue 'channel:7f6ded4fea304065be83376a8c64ce6a' in vhost '***_websocket',
2023-12-04 03:56:27.210637+00:00 [error] <0.1305.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 68:,
2023-12-04 03:56:27.210637+00:00 [error] <0.1305.6> operation queue.declare caused a channel exception not_found: no queue 'channel:7f6ded4fea304065be83376a8c64ce6a' in vhost '***_websocket',
2023-12-04 03:56:27.404075+00:00 [error] <0.1300.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 67:,
2023-12-04 03:56:27.404075+00:00 [error] <0.1300.6> operation queue.declare caused a channel exception not_found: no queue 'channel:7f6ded4fea304065be83376a8c64ce6a' in vhost '***_websocket',
2023-12-04 03:56:27.712022+00:00 [error] <0.1317.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 69:,
2023-12-04 03:56:27.712022+00:00 [error] <0.1317.6> operation queue.declare caused a channel exception not_found: no queue 'channel:7f6ded4fea304065be83376a8c64ce6a' in vhost '***_websocket',
2023-12-04 03:56:27.927701+00:00 [error] <0.1324.6> Channel error on connection <0.4103.5> (10.0.6.15:35332 -> 10.0.6.249:5672, vhost: '***_websocket', user: '***'), channel 70:,
2023-12-04 03:56:27.927701+00:00 [error] <0.1324.6> operation queue.declare caused a channel exception not_found: no queue 'channel:7f6ded4fea304065be83376a8c64ce6a' in vhost '***_websocket'

FYI, not shutting down every node cause the problem. Some node can be shutdown easily and cause nothing to the cluster. Some nodes (have been used in the cluster for quite a long time) will cause the slow API response when it's unreachable from the cluster.

4 replies

michaelklishin Dec 7, 2023
Maintainer

What you are looking at has bee discussed in a lot of details #9522. Some nodes do not host queue replicas and others do.

The recommendation from that discussion and at least one similar one is universal: use Prometheus for monitoring, it does not exhibit this behavior because nodes do not try to contact their peers to collect their metrics, which means it won't wait for an operations on an unavailable node to time out. Using a Prometheus-compatible external monitoring tool has plenty of other benefits.

What can be done in the short term was done as part of #9874 for 3.13.0. There is one other change relevant #9522 that we have in mind but it's secondary to finishing 3.13 and then working towards 4.0.

In any case, monitoring with Prometheus is the right thing to do and avoids this behavior entirely.

Answer selected by michaelklishin

truong-hua Dec 7, 2023
Author

Hi @michaelklishin I missed the discussion and yes, we are using Prometheus for monitoring too but still need the management interface for more detailed information or a trusted tool when the Prometheus is not available.

But if the problem is just because of failure to contact the node, do you think that using rabbitmqctl forget_cluster_node will resolve the problem?

My cluster have all Auto-delete classic queue with no replication and quorum queues (but I shrinkes the missed node before).

michaelklishin Dec 7, 2023
Maintainer

So you want a system under monitoring (RabbitMQ) to help address a failure scenario of Prometheus (the monitoring tool)?

If the node is not coming back, it should be removed from the cluster. Then shortly after the removal is discovered cluster-wide, nodes won't contact it e.g. to retrieve its metrics before responding to an HTTP API request.

truong-hua Dec 8, 2023
Author

It's just a useful scenario. Other things are such as peak something out from the queue for testing, purging or deleting a queue, reviewing queue policies or adjust policies... so many cases that the management UI is needed.

Yes, in my case, I'm just worry that permanent remove the node will break the cluster forever or make it slow forever. It's great with your clarification. But it's weird that some node does not effect the API speed after shutdown (not remove from the cluster yet).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

502 error if request to /api/queues/<vhost> with quorum queues without all nodes online #5134

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

502 error if request to /api/queues/<vhost> with quorum queues without all nodes online #5134

truong-hua Jul 2, 2022

Replies: 2 comments · 4 replies

michaelklishin Jul 3, 2022 Maintainer

truong-hua Dec 4, 2023 Author

michaelklishin Dec 7, 2023 Maintainer

truong-hua Dec 7, 2023 Author

michaelklishin Dec 7, 2023 Maintainer

truong-hua Dec 8, 2023 Author

truong-hua
Jul 2, 2022

Replies: 2 comments 4 replies

michaelklishin
Jul 3, 2022
Maintainer

truong-hua
Dec 4, 2023
Author

michaelklishin Dec 7, 2023
Maintainer

truong-hua Dec 7, 2023
Author

michaelklishin Dec 7, 2023
Maintainer

truong-hua Dec 8, 2023
Author