Quorum queue crash during rolling upgrade to 3.13 (install_snapshot_rpc) #11442
Replies: 2 comments
-
😮Indeed, there is nothing prohibiting leaders from sending out snapshots in the new format. The opposite works, followers will accept both old and new. Perhaps the best way to fix is make it send both values under different keys. |
Beta Was this translation helpful? Give feedback.
-
We've seen that there were some leader abdications during this rolling upgrade. We were wondering if and how it could be related to the generation of the snapshot at such unfortunate moment, and more precisely why did 01 jump to term 13? Logs below are a combination of the 3 nodes logs. They've been simplified and marked after the time stamp with the correspondent node number.
|
Beta Was this translation helpful? Give feedback.
-
During a rolling upgrade from 3.12.13 to 3.13.2 (Erlang 26.2.5) of a 3-node cluster a quorum queue follower crashed with the following, when
noproc
I tried todelete_member
thenadd_member
on 03 but it crashed again the same wayI suspect the reason is that PR ra#375 changed the
cluster
field from a list to a map that is included in the#install_snapshot_rpc
. In this case the the RPC message is sent from 01 running the new code to a follower on 03 running the old code which cannot handle a map.Is this really a bug in rabbitmq/ra?
Is there any workaround to push through an upgrade? Maybe using
rabbit_quorum_queue:force_shrink_member_to_current_member/1.
to temporarily shrink the queue to the new node 01, then add members 02 and 03 after the rolling upgrade finished?tagging @illotum as you have a deep understanding of this part of the code (btw thanks for the great feature of non-voters)
Beta Was this translation helpful? Give feedback.
All reactions