You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an edited/updated version of the Ophan issue report raised by @buck06191 as https://github.com/guardian/ophan/pull/3603 - I'm raising it here as I think it's a potential issue that affects all users of elasticsearch-node-rotation and also wanted to make some amendments to the description of the issue.
Pre-conditions for node-rotation to cause issue
The Elasticsearch cluster ASG must already be slightly imbalanced, specifically it must have a size which is not a clean multiple of the number of usable Availability Zones, but instead 1plus that:
ASG_SIZE = 1 + (n * NUM_AZs)
Here, every AZ has n nodes, apart from the overprovisioned AZ, which has n+1.
Additionally, the overprovisioned AZ must contain the oldest node - ie the one that will be targeted for rotation. (Probability of this is (NUM_AZs + ASG_SIZE)/(NUM_AZs * ASG_SIZE) if you want to geek out)
So if you're using 2 availability zones (eg 'B' & 'C' - eg to reduce Regional Data Transfer cost), the issue can arise if your ASG size is not a multiple of 2 before rotation (it's an odd number).
Although we would generally always try to avoid an imbalance like this, they can arise if a previous node-rotation failed, leaving an extra node.
Severity of imbalance cause by node-rotation
After node-rotation, the imbalance is exacerbated from 1 excess node in the original overprovisioned zone to 2 excess nodes in that same zone, with probability 1 / NUM_AZs - eg it's a 50/50 chance if you're only using 2 Availability Zones.
The ASG will then attempt to kill a node in the overprovisioned AZ, threatening the stability of the cluster. If you have a system like Ophan, where we additionally have Elasticsearch Node Protection, the only unprotected node will be the empty old one, otherwise there's a n/(n+1) chance that a populated Data Node will be killed.
How does the imbalance occur?
The Add Node step removes the oldest node (which belongs to the overprovisioned AZ) from the ASG
...this balances the ASG...
The ASG sees that it has one node less than Desired Capacity - and that the ASG is currently perfectly balanced. There is now a 1 / NUM_AZs chance it will choose to create a new instance in the AZ which was the original overprovisioned AZ, and we'll assume that happens.
ReattachOldInstance runs, reattaching the old node to the ASG. The ASG is now seriously imbalanced, with the overprovisioned zone having 2 more instances that the other zones.
The ASG will now attempt to randomly kill a node, as described above.
Suggested solution
Currently the Step Function looks like this:
...slightly simplified this is:
GetOldestNode
AddNode - detaches the node from the ASG, triggering instantiation of a new node
ReattachOldInstance - reattaches the old node to the ASG, bringing ASG up to transition size
MigrateShards & ShardMigrationCheck ... takes a long time
RemoveNode - kills the old node, reduces ASG to the size it was when we started
Fix: Defer selection of Old node
The fix would be to defer selecting the node to remove until after the new node has been allocated, ie:
AddNode - simply add 1 to the desired size of the cluster, and wait for it to come up
GetOldNode - select the oldest node but only from the most-provisioned AZs
MigrateShards & ShardMigrationCheck ... takes a long time
RemoveNode - kill the old node
It's only after you know where the ASG brought up the new node that you know which AZ you want to reduce in size.
The text was updated successfully, but these errors were encountered:
This is an edited/updated version of the Ophan issue report raised by @buck06191 as https://github.com/guardian/ophan/pull/3603 - I'm raising it here as I think it's a potential issue that affects all users of
elasticsearch-node-rotation
and also wanted to make some amendments to the description of the issue.Pre-conditions for node-rotation to cause issue
The Elasticsearch cluster ASG must already be slightly imbalanced, specifically it must have a size which is not a clean multiple of the number of usable Availability Zones, but instead 1 plus that:
Here, every AZ has
n
nodes, apart from the overprovisioned AZ, which hasn+1
.Additionally, the overprovisioned AZ must contain the oldest node - ie the one that will be targeted for rotation. (Probability of this is
(NUM_AZs + ASG_SIZE)/(NUM_AZs * ASG_SIZE)
if you want to geek out)So if you're using 2 availability zones (eg 'B' & 'C' - eg to reduce
Regional Data Transfer
cost), the issue can arise if your ASG size is not a multiple of 2 before rotation (it's an odd number).Although we would generally always try to avoid an imbalance like this, they can arise if a previous node-rotation failed, leaving an extra node.
Severity of imbalance cause by node-rotation
After node-rotation, the imbalance is exacerbated from 1 excess node in the original overprovisioned zone to 2 excess nodes in that same zone, with probability
1 / NUM_AZs
- eg it's a 50/50 chance if you're only using 2 Availability Zones.The ASG will then attempt to kill a node in the overprovisioned AZ, threatening the stability of the cluster. If you have a system like Ophan, where we additionally have Elasticsearch Node Protection, the only unprotected node will be the empty old one, otherwise there's a
n/(n+1)
chance that a populated Data Node will be killed.How does the imbalance occur?
Add Node
step removes the oldest node (which belongs to the overprovisioned AZ) from the ASGDesired Capacity
- and that the ASG is currently perfectly balanced. There is now a1 / NUM_AZs
chance it will choose to create a new instance in the AZ which was the original overprovisioned AZ, and we'll assume that happens.ReattachOldInstance
runs, reattaching the old node to the ASG. The ASG is now seriously imbalanced, with the overprovisioned zone having 2 more instances that the other zones.Suggested solution
Currently the Step Function looks like this:
...slightly simplified this is:
GetOldestNode
AddNode
- detaches the node from the ASG, triggering instantiation of a new nodeReattachOldInstance
- reattaches the old node to the ASG, bringing ASG up to transition sizeMigrateShards
&ShardMigrationCheck
... takes a long timeRemoveNode
- kills the old node, reduces ASG to the size it was when we startedFix: Defer selection of Old node
The fix would be to defer selecting the node to remove until after the new node has been allocated, ie:
AddNode
- simply add 1 to the desired size of the cluster, and wait for it to come upGetOldNode
- select the oldest node but only from the most-provisioned AZsMigrateShards
&ShardMigrationCheck
... takes a long timeRemoveNode
- kill the old nodeIt's only after you know where the ASG brought up the new node that you know which AZ you want to reduce in size.
The text was updated successfully, but these errors were encountered: