Support vMotion when RabbitMQ is hosted on virtual machines in vSphere #10701
-
Is your feature request related to a problem? Please describe.We're encountering issues with our RabbitMQ clusters hosted on virtual servers, particularly during virtual machine migration (vMotion in vmware vSphere). The RabbitMQ documentation advises against suspending VMs during operation, which aligns very well with our experiences 😅 . Describe the solution you'd likeOne potential approach could involve exploring the integration of vMotion Notifications introduced in vSphere 8 into RabbitMQ. These notifications, as highlighted in the vSphere 8 release notes, aim to improve interoperability with applications sensitive to latency or clustered environments. By considering the feasibility of detecting " Upon receiving a "
Once the "
Describe alternatives you've consideredAs a workaround, we've resorted to restricting VM migration by deactivating vMotion and DRS on specific hosts. However, this solution isn't ideal as it necessitates separate hardware configurations solely for RabbitMQ's requirements in a virtualized environment. Another option we've explored is developing a custom tool to manage RabbitMQ instances during VM migrations. This tool would detect vMotion events and adjust RabbitMQ instances accordingly. However, integrating this functionality directly into RabbitMQ would not only address our needs but also benefit other vSphere users facing similar challenges, streamlining deployment and maintenance processes within this specific virtualization platform. Additional contextWe understand that integrating RabbitMQ into specific virtualized environments may not be a primary focus for the RabbitMQ product 😃 but there's potential for synergies between RabbitMQ and VMware's virtualization technologies that could enhance the overall ecosystem and benefit users leveraging these platforms. https://core.vmware.com/blog/whats-new-vsphere-8-vmotion |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
@jonnepmyra there is nothing vSphere-specific in RabbitMQ. vMotion side effects has been discussed many times on our team, below is a reasonably brief explanation of what can (in fact, has been) or cannot be improved. Its effects are very difficult to cope with for many distributed data services. A moved node disappears for the rest of the cluster but the node itself is not aware of it. This is a very curious case of what Team RabbitMQ calls a partial network partition, where node A detects unavailability of B but B can still communicate just fine with A, or, in the case of vMotion, does not observe the failure for other reasons (it is "frozen", or "stunned" in vMotion terms). Certain features and implementation details in RabbitMQ 3.x have a very hard time dealing with that.
So even in 4.x vMotion cannot be side-effect free for RabbitMQ and, I would argue, pretty much any distributed data service. Raft provides enough safety guarantees for this case but what vMotion does will be treated as leader failure or a follower falling behind and having to "re-join" (reinstall the missing portion of its log from the current leader). And in 3.x, certain features are not guaranteed recover safely from a partial network partition of this kind. But at least in 4.x the effects of vMotion will be mostly safe. How RabbitMQ can "integrate" with vMotion events is an open ended question, and likely won't yield any improvements because a "frozen" node is a form of failure in many distributed system algorithms. |
Beta Was this translation helpful? Give feedback.
-
Starting and stopping nodes in response to events can cause more trouble than it will be worth:
RabbitMQ nodes pausing themselves in reaction to certain events (certain peer failures) is a feature Modern versions have a maintenance mode that can be activated in response to a vMotion event. It won't change much: some replicas With a range of improvements planned for 4.x, all of those effects will be less of a problem, or maybe won't affect most deployments at all. My point is that reaction to vMotion events per se does not change the safety and side effects equation much. |
Beta Was this translation helpful? Give feedback.
-
I was about to suggest doing this. If your tool works as expected, sharing the code with the RabbitMQ community would be a great way to give back. |
Beta Was this translation helpful? Give feedback.
Starting and stopping nodes in response to events can cause more trouble than it will be worth:
RabbitMQ nodes pausing themselves in reaction to certain events (certain peer failures) is a feature
in RabbitMQ 3.x that we would really like to remove compl…