Replies: 3 comments 2 replies
-
This belongs to a discussion. If some cluster nodes are stopped when a quorum queue or stream tries to place its initial replicas (3 by default and the supported minimum), the declaration will eventually fail one way or another because there won't be enough replicas online to form a Raft cluster. Quorum queues are not designed for a 50 queue per second churn and an insufficient number of replicas online. Queue churn and short-lived queues are explicitly mentioned in the When Not to Use Quorum Queues section of the docs. For the parts of your topology that experience that can of churn, use non-replicated classic queues v2 (CQv2), they are both significantly cheaper to set up, tear down, and require a single online node in order to be successfully declared. Using quorum queues or streams in environments with high churn does not make much sense and won't allow you to benefit from any of the data safety characteristics when the queues (streams) are shorted lived. A lifespan of 10s means these are short-lived queues. This is a good example of why RabbitMQ has adopted the concept of different queue types with different design goals and characteristics (then further extended to streams). Some of them explicitly do not target high churn scenarios. Moving to five nodes may or may not change much because when the initial replica placement fails, none of the Raft's usual data protection characteristics apply. If a (Raft, so a single QQ or stream) cluster cannot be formed, the party that creates them (an application) may choose to retry or not. Again, the recommendation is not to use five nodes, it is to use non-replicated CQv2s for the churning part of your topology and QQs, streams for the mostly stable (static) part that would benefit from data replication and Raft-based recovery. |
Beta Was this translation helpful? Give feedback.
-
Someone has suggested that this part of the guide on upgrading RabbitMQ should be mentioned because it can affect Raft-based features specifically, and it's not unheard of to see this approach without much consideration for the online quorum and node identity. |
Beta Was this translation helpful? Give feedback.
-
If you can provide full logs we could take a look and see if there is anything that can easily be improved here but @michaelklishin is correct - quorum queues were never designed to be used in this manner, they are designed to be used for long lived queues that need data safety and good availability. For queue churn scenarios classic queues should be used. That said if there is an improvement that can be identified for this case we may well do that. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
While doing performance test, we noticed that quorum queues often end up in down state while during rolling cluster restart while simultaneously creating and deleting quorum queues with a high rate (>50 queues/second).
We observed this exception in the log:
Reproduction steps
Expected behavior
Queues do not end up in a down state during high rate queue creation/deletion usage pattern.
Additional context
We were using performance test:
Beta Was this translation helpful? Give feedback.
All reactions