If ASG max not >= desired + 1, successive runs of the step function will keep adding instances #51

amyhughes · 2020-03-20T17:11:26Z

This was an issue for the ELK master nodes back in December, where it was reported that the number of master ES nodes had reached 26.

Unfortunately the recorded executions don't go back this far, so I've sketched out how this could happen.

Happy execution path

Initial ASG state
Desired: 3
Max: 4
Actual: 3
Detached: 0

Initial ES cluster state
number_of_nodes: 3

Step	Resulting ASG state	Resulting ES cluster state	Step description
getOldestNode	Desired: 3 Max: 4 Actual: 3 Detached: 0	number_of_nodes: 3	Checks no running executions Gets instances in ASG Finds oldest instance Gets ES node info for oldest instance
clusterStatusCheck	Desired: 3 Max: 4 Actual: 3 Detached: 0	number_of_nodes: 3	Checks no “Unhealthy” instances in ASG Gets ES cluster status
statusIsGreen	Desired: 3 Max: 4 Actual: 3 Detached: 0	number_of_nodes: 3	Checks cluster status “green”
addNode	Desired: 3 Max: 4 Actual: 3 Detached: 1	number_of_nodes: 4	Disables rebalancing on cluster Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance) Set expected_cluster_size = number_of_nodes + 1
clusterSizeCheck	Desired: 3 Max: 4 Actual: 3 Detached: 1	number_of_nodes: 4	Get cluster health Check nodes = expected_cluster_size Return oldest and newest nodes
reattachOldInstance	Desired: 4 Max: 4 Actual: 4 Detached: 0	number_of_nodes: 4	Re-attach oldest instance to ASG
migrateShards	Desired: 4 Max: 4 Actual: 4 Detached: 0	number_of_nodes: 4	Exclude oldest node from shard allocation Enable rebalancing on cluster
shardMigrationCheck	Desired: 4 Max: 4 Actual: 4 Detached: 0	number_of_nodes: 4	Return when oldest node has no documents, the cluster status is green, and reallocating shards is 0
removeNode	Desired: 3 Max: 4 Actual: 3 Detached: 0	number_of_nodes: 3	Terminate instance in ASG ShouldDecrimentDesiredCapacity: true

Unhappy execution path

If step function stops after ‘reattachOldInstance’, but before ‘removeNode’ completes, the cluster and ASG are left in a state which results in a node being added, but no node being terminated with the next run of the step function. The issue is compounded by the fact that the 'reattachOldInstance' step fails in this case, meaning the next run adds yet another node. This is illustrated below.

Initial ASG state
Desired: 4
Max: 4
Actual: 4
Detached: 0

Initial ES cluster state
number_of_nodes: 4

Step	Resulting ASG state	Resulting ES cluster state	Step description
getOldestNode	Desired: 4 Max: 4 Actual: 4 Detached: 0	number_of_nodes: 4	Checks no running executions Gets instances in ASG Finds oldest instance Gets ES node info for oldest instance
clusterStatusCheck	Desired: 4 Max: 4 Actual: 4 Detached: 0	number_of_nodes: 4	Checks no “Unhealthy” instances in ASG Gets ES cluster status
statusIsGreen	Desired: 4 Max: 4 Actual: 4 Detached: 0	number_of_nodes: 4	Checks cluster status “green”
addNode	Desired: 4 Max: 4 Actual: 4 Detached: 1	number_of_nodes: 5	Disables rebalancing on cluster Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance) Set expected_cluster_size = number_of_nodes + 1
clusterSizeCheck	Desired: 4 Max: 4 Actual: 4 Detached: 1	number_of_nodes: 5	Get cluster health Check nodes = expected_cluster_size Return oldest and newest nodes
reattachOldInstance	Desired: 4 Max: 4 Actual: 4 Detached: 1	number_of_nodes: 5	FAILS to re-attach oldest instance to ASG: max-size = desired

Fix

We suggest adding an initial step to the process which checks that the ASG max is greater than the desired capacity (strictly max only needs to be > desired +1, but we should allow for projects where max is set to a value greater than this). If this step fails, no other steps should be executed.

amyhughes assigned amyhughes and unassigned amyhughes Mar 20, 2020

This was referenced Mar 25, 2020

Add initial ASG group check to avoid adding nodes to cluster without also removing a node #52

Merged

Check ASG Max greater than Actual size to avoid adding instances we can't remove #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If ASG max not >= desired + 1, successive runs of the step function will keep adding instances #51

If ASG max not >= desired + 1, successive runs of the step function will keep adding instances #51

amyhughes commented Mar 20, 2020

If ASG max not >= desired + 1, successive runs of the step function will keep adding instances #51

If ASG max not >= desired + 1, successive runs of the step function will keep adding instances #51

Comments

amyhughes commented Mar 20, 2020

Happy execution path

Unhappy execution path

Fix