You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Checks no running executions Gets instances in ASG Finds oldest instance Gets ES node info for oldest instance
clusterStatusCheck
Desired: 3 Max: 4 Actual: 3 Detached: 0
number_of_nodes: 3
Checks no “Unhealthy” instances in ASG Gets ES cluster status
statusIsGreen
Desired: 3 Max: 4 Actual: 3 Detached: 0
number_of_nodes: 3
Checks cluster status “green”
addNode
Desired: 3 Max: 4 Actual: 3 Detached: 1
number_of_nodes: 4
Disables rebalancing on cluster Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance) Set expected_cluster_size = number_of_nodes + 1
clusterSizeCheck
Desired: 3 Max: 4 Actual: 3 Detached: 1
number_of_nodes: 4
Get cluster health Check nodes = expected_cluster_size Return oldest and newest nodes
reattachOldInstance
Desired: 4 Max: 4 Actual: 4 Detached: 0
number_of_nodes: 4
Re-attach oldest instance to ASG
migrateShards
Desired: 4 Max: 4 Actual: 4 Detached: 0
number_of_nodes: 4
Exclude oldest node from shard allocation Enable rebalancing on cluster
shardMigrationCheck
Desired: 4 Max: 4 Actual: 4 Detached: 0
number_of_nodes: 4
Return when oldest node has no documents, the cluster status is green, and reallocating shards is 0
removeNode
Desired: 3 Max: 4 Actual: 3 Detached: 0
number_of_nodes: 3
Terminate instance in ASG ShouldDecrimentDesiredCapacity: true
Unhappy execution path
If step function stops after ‘reattachOldInstance’, but before ‘removeNode’ completes, the cluster and ASG are left in a state which results in a node being added, but no node being terminated with the next run of the step function. The issue is compounded by the fact that the 'reattachOldInstance' step fails in this case, meaning the next run adds yet another node. This is illustrated below.
Checks no running executions Gets instances in ASG Finds oldest instance Gets ES node info for oldest instance
clusterStatusCheck
Desired: 4 Max: 4 Actual: 4 Detached: 0
number_of_nodes: 4
Checks no “Unhealthy” instances in ASG Gets ES cluster status
statusIsGreen
Desired: 4 Max: 4 Actual: 4 Detached: 0
number_of_nodes: 4
Checks cluster status “green”
addNode
Desired: 4 Max: 4 Actual: 4 Detached: 1
number_of_nodes: 5
Disables rebalancing on cluster Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance) Set expected_cluster_size = number_of_nodes + 1
clusterSizeCheck
Desired: 4 Max: 4 Actual: 4 Detached: 1
number_of_nodes: 5
Get cluster health Check nodes = expected_cluster_size Return oldest and newest nodes
reattachOldInstance
Desired: 4 Max: 4 Actual: 4 Detached: 1
number_of_nodes: 5
FAILS to re-attach oldest instance to ASG: max-size = desired
Fix
We suggest adding an initial step to the process which checks that the ASG max is greater than the desired capacity (strictly max only needs to be > desired +1, but we should allow for projects where max is set to a value greater than this). If this step fails, no other steps should be executed.
The text was updated successfully, but these errors were encountered:
This was an issue for the ELK master nodes back in December, where it was reported that the number of master ES nodes had reached 26.
Unfortunately the recorded executions don't go back this far, so I've sketched out how this could happen.
Happy execution path
Initial ASG state
Desired: 3
Max: 4
Actual: 3
Detached: 0
Initial ES cluster state
number_of_nodes: 3
Max: 4
Actual: 3
Detached: 0
Gets instances in ASG
Finds oldest instance
Gets ES node info for oldest instance
Max: 4
Actual: 3
Detached: 0
Gets ES cluster status
Max: 4
Actual: 3
Detached: 0
Max: 4
Actual: 3
Detached: 1
Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance)
Set expected_cluster_size = number_of_nodes + 1
Max: 4
Actual: 3
Detached: 1
Check nodes = expected_cluster_size
Return oldest and newest nodes
Max: 4
Actual: 4
Detached: 0
Max: 4
Actual: 4
Detached: 0
Enable rebalancing on cluster
Max: 4
Actual: 4
Detached: 0
Max: 4
Actual: 3
Detached: 0
ShouldDecrimentDesiredCapacity: true
Unhappy execution path
If step function stops after ‘reattachOldInstance’, but before ‘removeNode’ completes, the cluster and ASG are left in a state which results in a node being added, but no node being terminated with the next run of the step function. The issue is compounded by the fact that the 'reattachOldInstance' step fails in this case, meaning the next run adds yet another node. This is illustrated below.
Initial ASG state
Desired: 4
Max: 4
Actual: 4
Detached: 0
Initial ES cluster state
number_of_nodes: 4
Max: 4
Actual: 4
Detached: 0
Gets instances in ASG
Finds oldest instance
Gets ES node info for oldest instance
Max: 4
Actual: 4
Detached: 0
Gets ES cluster status
Max: 4
Actual: 4
Detached: 0
Max: 4
Actual: 4
Detached: 1
Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance)
Set expected_cluster_size = number_of_nodes + 1
Max: 4
Actual: 4
Detached: 1
Check nodes = expected_cluster_size
Return oldest and newest nodes
Max: 4
Actual: 4
Detached: 1
Fix
We suggest adding an initial step to the process which checks that the ASG max is greater than the desired capacity (strictly max only needs to be > desired +1, but we should allow for projects where max is set to a value greater than this). If this step fails, no other steps should be executed.
The text was updated successfully, but these errors were encountered: