Expected quorum queue behavior under churn when fewer than three nodes are online due to cluster node restarts #11105

alexnkdev · 2024-04-26T18:49:58Z

alexnkdev
Apr 26, 2024

Describe the bug

While doing performance test, we noticed that quorum queues often end up in down state while during rolling cluster restart while simultaneously creating and deleting quorum queues with a high rate (>50 queues/second).

We observed this exception in the log:

2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0> ** Reason for termination ==
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0> ** {{exception,partition_parallel_timeout},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>     [{erpc,call,5,[{file,"erpc.erl"},{line,130}]},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>      {rabbit_queue_type_util,erpc_call,5,
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>                              [{file,"rabbit_queue_type_util.erl"},{line,81}]},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>      {rabbit_quorum_queue,start_cluster,1,
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>                           [{file,"rabbit_quorum_queue.erl"},{line,257}]},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>      {rabbit_channel,handle_method,6,
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>                      [{file,"rabbit_channel.erl"},{line,2528}]},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>      {rabbit_channel,handle_method,3,
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>                      [{file,"rabbit_channel.erl"},{line,1636}]},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>      {rabbit_channel,handle_cast,2,[{file,"rabbit_channel.erl"},{line,645}]},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2024-04-24 18:19:22.997525+00:00 [error] <0.5822.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}

Reproduction steps

Start RabbitMQ cluster with 3 nodes.
Create and delete quorum queues with a high rate, we were able to consistently reproduce the issue with the rate of >50 queues/second.
Perform rolling restart.
Some of the queues end up in down state.

Expected behavior

Queues do not end up in a down state during high rate queue creation/deletion usage pattern.

Additional context

We were using performance test:

Queue creation rate 50/s.
Queue deletion rate 50/s.
Queues were deleted in 10 seconds after creation.

michaelklishin · 2024-04-26T19:11:33Z

michaelklishin
Apr 26, 2024
Maintainer

This belongs to a discussion.

If some cluster nodes are stopped when a quorum queue or stream tries to place its initial replicas (3 by default and the supported minimum), the declaration will eventually fail one way or another because there won't be enough replicas online to form a Raft cluster.

Quorum queues are not designed for a 50 queue per second churn and an insufficient number of replicas online. Queue churn and short-lived queues are explicitly mentioned in the When Not to Use Quorum Queues section of the docs.

For the parts of your topology that experience that can of churn, use non-replicated classic queues v2 (CQv2), they are both significantly cheaper to set up, tear down, and require a single online node in order to be successfully declared. Using quorum queues or streams in environments with high churn does not make much sense and won't allow you to benefit from any of the data safety characteristics when the queues (streams) are shorted lived. A lifespan of 10s means these are short-lived queues.

This is a good example of why RabbitMQ has adopted the concept of different queue types with different design goals and characteristics (then further extended to streams). Some of them explicitly do not target high churn scenarios.

Moving to five nodes may or may not change much because when the initial replica placement fails, none of the Raft's usual data protection characteristics apply. If a (Raft, so a single QQ or stream) cluster cannot be formed, the party that creates them (an application) may choose to retry or not.
But at least the probability of keeping around an online quorum will be higher with five nodes.

Again, the recommendation is not to use five nodes, it is to use non-replicated CQv2s for the churning part of your topology and QQs, streams for the mostly stable (static) part that would benefit from data replication and Raft-based recovery.

0 replies

michaelklishin · 2024-04-26T22:06:32Z

michaelklishin
Apr 26, 2024
Maintainer

Someone has suggested that this part of the guide on upgrading RabbitMQ should be mentioned because it can affect Raft-based features specifically, and it's not unheard of to see this approach without much consideration for the online quorum and node identity.

0 replies

kjnilsson · 2024-04-29T12:17:08Z

kjnilsson
Apr 29, 2024
Maintainer

If you can provide full logs we could take a look and see if there is anything that can easily be improved here but @michaelklishin is correct - quorum queues were never designed to be used in this manner, they are designed to be used for long lived queues that need data safety and good availability. For queue churn scenarios classic queues should be used.

That said if there is an improvement that can be identified for this case we may well do that.

2 replies

alexnkdev Apr 30, 2024
Author

We just reproduced this issue again today during our performance test, here are full logs for reference
Uploading down-queue-logs.csv…

kjnilsson May 1, 2024
Maintainer

Not sure the files ever completed uploading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected quorum queue behavior under churn when fewer than three nodes are online due to cluster node restarts #11105

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Expected quorum queue behavior under churn when fewer than three nodes are online due to cluster node restarts #11105

alexnkdev Apr 26, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 2 replies

michaelklishin Apr 26, 2024 Maintainer

michaelklishin Apr 26, 2024 Maintainer

kjnilsson Apr 29, 2024 Maintainer

alexnkdev Apr 30, 2024 Author

kjnilsson May 1, 2024 Maintainer

alexnkdev
Apr 26, 2024

Replies: 3 comments 2 replies

michaelklishin
Apr 26, 2024
Maintainer

michaelklishin
Apr 26, 2024
Maintainer

kjnilsson
Apr 29, 2024
Maintainer

alexnkdev Apr 30, 2024
Author

kjnilsson May 1, 2024
Maintainer