Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-Rotation can introduce a 2-instance AZ imbalance on an ASG that is not sized as a multiple of the available AZs #50

Open
rtyley opened this issue Jan 14, 2020 · 0 comments

Comments

@rtyley
Copy link
Member

rtyley commented Jan 14, 2020

This is an edited/updated version of the Ophan issue report raised by @buck06191 as https://github.com/guardian/ophan/pull/3603 - I'm raising it here as I think it's a potential issue that affects all users of elasticsearch-node-rotation and also wanted to make some amendments to the description of the issue.

Pre-conditions for node-rotation to cause issue

The Elasticsearch cluster ASG must already be slightly imbalanced, specifically it must have a size which is not a clean multiple of the number of usable Availability Zones, but instead 1 plus that:

ASG_SIZE = 1 + (n * NUM_AZs)

Here, every AZ has n nodes, apart from the overprovisioned AZ, which has n+1.

Additionally, the overprovisioned AZ must contain the oldest node - ie the one that will be targeted for rotation. (Probability of this is (NUM_AZs + ASG_SIZE)/(NUM_AZs * ASG_SIZE) if you want to geek out)

So if you're using 2 availability zones (eg 'B' & 'C' - eg to reduce Regional Data Transfer cost), the issue can arise if your ASG size is not a multiple of 2 before rotation (it's an odd number).

Although we would generally always try to avoid an imbalance like this, they can arise if a previous node-rotation failed, leaving an extra node.

Severity of imbalance cause by node-rotation

After node-rotation, the imbalance is exacerbated from 1 excess node in the original overprovisioned zone to 2 excess nodes in that same zone, with probability 1 / NUM_AZs - eg it's a 50/50 chance if you're only using 2 Availability Zones.

The ASG will then attempt to kill a node in the overprovisioned AZ, threatening the stability of the cluster. If you have a system like Ophan, where we additionally have Elasticsearch Node Protection, the only unprotected node will be the empty old one, otherwise there's a n/(n+1) chance that a populated Data Node will be killed.

How does the imbalance occur?

  1. The Add Node step removes the oldest node (which belongs to the overprovisioned AZ) from the ASG
  2. ...this balances the ASG...
  3. The ASG sees that it has one node less than Desired Capacity - and that the ASG is currently perfectly balanced. There is now a 1 / NUM_AZs chance it will choose to create a new instance in the AZ which was the original overprovisioned AZ, and we'll assume that happens.
  4. ReattachOldInstance runs, reattaching the old node to the ASG. The ASG is now seriously imbalanced, with the overprovisioned zone having 2 more instances that the other zones.
  5. The ASG will now attempt to randomly kill a node, as described above.

Suggested solution

Currently the Step Function looks like this:

image

...slightly simplified this is:

  1. GetOldestNode
  2. AddNode - detaches the node from the ASG, triggering instantiation of a new node
  3. ReattachOldInstance - reattaches the old node to the ASG, bringing ASG up to transition size
  4. MigrateShards & ShardMigrationCheck ... takes a long time
  5. RemoveNode - kills the old node, reduces ASG to the size it was when we started

Fix: Defer selection of Old node

The fix would be to defer selecting the node to remove until after the new node has been allocated, ie:

  1. AddNode - simply add 1 to the desired size of the cluster, and wait for it to come up
  2. GetOldNode - select the oldest node but only from the most-provisioned AZs
  3. MigrateShards & ShardMigrationCheck ... takes a long time
  4. RemoveNode - kill the old node

It's only after you know where the ASG brought up the new node that you know which AZ you want to reduce in size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant