Node-Rotation can introduce a 2-instance AZ imbalance on an ASG that is not sized as a multiple of the available AZs #50

rtyley · 2020-01-14T13:05:48Z

This is an edited/updated version of the Ophan issue report raised by @buck06191 as https://github.com/guardian/ophan/pull/3603 - I'm raising it here as I think it's a potential issue that affects all users of elasticsearch-node-rotation and also wanted to make some amendments to the description of the issue.

Pre-conditions for node-rotation to cause issue

The Elasticsearch cluster ASG must already be slightly imbalanced, specifically it must have a size which is not a clean multiple of the number of usable Availability Zones, but instead 1 plus that:

ASG_SIZE = 1 + (n * NUM_AZs)

Here, every AZ has n nodes, apart from the overprovisioned AZ, which has n+1.

Additionally, the overprovisioned AZ must contain the oldest node - ie the one that will be targeted for rotation. (Probability of this is (NUM_AZs + ASG_SIZE)/(NUM_AZs * ASG_SIZE) if you want to geek out)

So if you're using 2 availability zones (eg 'B' & 'C' - eg to reduce Regional Data Transfer cost), the issue can arise if your ASG size is not a multiple of 2 before rotation (it's an odd number).

Although we would generally always try to avoid an imbalance like this, they can arise if a previous node-rotation failed, leaving an extra node.

Severity of imbalance cause by node-rotation

After node-rotation, the imbalance is exacerbated from 1 excess node in the original overprovisioned zone to 2 excess nodes in that same zone, with probability 1 / NUM_AZs - eg it's a 50/50 chance if you're only using 2 Availability Zones.

The ASG will then attempt to kill a node in the overprovisioned AZ, threatening the stability of the cluster. If you have a system like Ophan, where we additionally have Elasticsearch Node Protection, the only unprotected node will be the empty old one, otherwise there's a n/(n+1) chance that a populated Data Node will be killed.

How does the imbalance occur?

The Add Node step removes the oldest node (which belongs to the overprovisioned AZ) from the ASG
...this balances the ASG...
The ASG sees that it has one node less than Desired Capacity - and that the ASG is currently perfectly balanced. There is now a 1 / NUM_AZs chance it will choose to create a new instance in the AZ which was the original overprovisioned AZ, and we'll assume that happens.
ReattachOldInstance runs, reattaching the old node to the ASG. The ASG is now seriously imbalanced, with the overprovisioned zone having 2 more instances that the other zones.
The ASG will now attempt to randomly kill a node, as described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node-Rotation can introduce a 2-instance AZ imbalance on an ASG that is not sized as a multiple of the available AZs #50

Node-Rotation can introduce a 2-instance AZ imbalance on an ASG that is not sized as a multiple of the available AZs #50

rtyley commented Jan 14, 2020 •

edited

Loading

Node-Rotation can introduce a 2-instance AZ imbalance on an ASG that is not sized as a multiple of the available AZs #50

Node-Rotation can introduce a 2-instance AZ imbalance on an ASG that is not sized as a multiple of the available AZs #50

Comments

rtyley commented Jan 14, 2020 • edited Loading

Pre-conditions for node-rotation to cause issue

Severity of imbalance cause by node-rotation

How does the imbalance occur?

Suggested solution

Fix: Defer selection of Old node

rtyley commented Jan 14, 2020 •

edited

Loading