race condition between on_node_down and queue.declare #9804
-
Describe the bugIn three node rabbitmq cluster with nodes A, B, and C, both B and C executes on_node_down(A) after node A goes down. on_node_down() removes non-HA classic queues that were declared on node A. While such queue (Q) is not deleted, any request to it is rejected with 'suspended by supervisor' message, but after one node complete deletion of Q, a client may declare it again. Such behavior introduce a race condition between on_node_down(A) execution on multiple nodes and Q re-declaration that results in queue silently deleted from mnesia while a client consumes it. As a side effect, after a channel to such silently deleted queue is closed and the queue is re-declared, the queue is deleted after x-expires timeout again, because x-expires timer is started after channel is closed. Reproduction steps
Expected behaviorqueues not silently disappearing after declaration Additional contextWe have stable reproduction of this issue with RabbitMQ 3.8.16 on Erlang/OTP 23 and OpenStack/oslo-messaging as a client. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
RabbitMQ 3.8 has reached EOL well over one year ago. This is a known behavior and there is no short term solution for non-mirrored queue types. RabbitMQ cannot know if a queue declaration is coming in the near future or not. Even if it were to delay a non-mirrored/transient queue cleanup, that would still hold after this initial delay. This is a race condition between two things that cannot be synchronized because they are initiated by two (or more) different applications on different hosts. So there are a few options:
The same problem exists for connections with a very large number of exclusive queues
Point being is that for as long as transient queues exist and can be deleted in response to a client-initiated even of any sort, this situation cannot be avoided. |
Beta Was this translation helpful? Give feedback.
-
If you use queue expiration, you must be OK with queues being deleted at some point. Channels can run into an exception and be closed, or all consumers on them can be cancelled (an online consumer is what keeps TTL from having effect). If you cannot accept queue deletion by the TTL mechanism in such cases, do not use queue TTL. You very explicitly tell RabbitMQ "if this is unused at some point, delete it". The definition of "used" is simple and specific: it has online consumers. You probably can overprovision consumers and use Single Active Consumer to avoid parallel processing but that sounds like the wrong thing to do. If you cannot afford queue transitivity, use quorum queues with three replicas. |
Beta Was this translation helpful? Give feedback.
-
One more option would be to develop a non-replicated queue type that would always migrate between nodes, even if it means losing data when its hosting node goes down. Always choosing availability over consistency, so deleting them when their home node is down won't be necessary by design. It'd be a reasonable feature to add but we won't start working on it until 4.0 ships in 2024. Right now a lot of other things are much more important, and shipping them would benefit a lot more deployments. |
Beta Was this translation helpful? Give feedback.
RabbitMQ 3.8 has reached EOL well over one year ago.
This is a known behavior and there is no short term solution for non-mirrored queue types.
RabbitMQ cannot know if a queue declaration is coming in the near future or not. Even if it were to delay a non-mirrored/transient queue cleanup, that would still hold after this initial delay. This is a race condition between two things that cannot be synchronized because they are initiated by two (or more) different applications on different hosts.
So there are a few options: