Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If ASG max not >= desired + 1, successive runs of the step function will keep adding instances #51

Open
amyhughes opened this issue Mar 20, 2020 · 0 comments
Assignees

Comments

@amyhughes
Copy link
Contributor

This was an issue for the ELK master nodes back in December, where it was reported that the number of master ES nodes had reached 26.

Unfortunately the recorded executions don't go back this far, so I've sketched out how this could happen.

Happy execution path

Initial ASG state
Desired: 3
Max: 4
Actual: 3
Detached: 0

Initial ES cluster state
number_of_nodes: 3

Step Resulting ASG state Resulting ES cluster state Step description
getOldestNode Desired: 3
Max: 4
Actual: 3
Detached: 0
number_of_nodes: 3 Checks no running executions
Gets instances in ASG
Finds oldest instance
Gets ES node info for oldest instance
clusterStatusCheck Desired: 3
Max: 4
Actual: 3
Detached: 0
number_of_nodes: 3 Checks no “Unhealthy” instances in ASG
Gets ES cluster status
statusIsGreen Desired: 3
Max: 4
Actual: 3
Detached: 0
number_of_nodes: 3 Checks cluster status “green”
addNode Desired: 3
Max: 4
Actual: 3
Detached: 1
number_of_nodes: 4 Disables rebalancing on cluster
Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance)
Set expected_cluster_size = number_of_nodes + 1
clusterSizeCheck Desired: 3
Max: 4
Actual: 3
Detached: 1
number_of_nodes: 4 Get cluster health
Check nodes = expected_cluster_size
Return oldest and newest nodes
reattachOldInstance Desired: 4
Max: 4
Actual: 4
Detached: 0
number_of_nodes: 4 Re-attach oldest instance to ASG
migrateShards Desired: 4
Max: 4
Actual: 4
Detached: 0
number_of_nodes: 4 Exclude oldest node from shard allocation
Enable rebalancing on cluster
shardMigrationCheck Desired: 4
Max: 4
Actual: 4
Detached: 0
number_of_nodes: 4 Return when oldest node has no documents, the cluster status is green, and reallocating shards is 0
removeNode Desired: 3
Max: 4
Actual: 3
Detached: 0
number_of_nodes: 3 Terminate instance in ASG
ShouldDecrimentDesiredCapacity: true

Unhappy execution path

If step function stops after ‘reattachOldInstance’, but before ‘removeNode’ completes, the cluster and ASG are left in a state which results in a node being added, but no node being terminated with the next run of the step function. The issue is compounded by the fact that the 'reattachOldInstance' step fails in this case, meaning the next run adds yet another node. This is illustrated below.

Initial ASG state
Desired: 4
Max: 4
Actual: 4
Detached: 0

Initial ES cluster state
number_of_nodes: 4

Step Resulting ASG state Resulting ES cluster state Step description
getOldestNode Desired: 4
Max: 4
Actual: 4
Detached: 0
number_of_nodes: 4 Checks no running executions
Gets instances in ASG
Finds oldest instance
Gets ES node info for oldest instance
clusterStatusCheck Desired: 4
Max: 4
Actual: 4
Detached: 0
number_of_nodes: 4 Checks no “Unhealthy” instances in ASG
Gets ES cluster status
statusIsGreen Desired: 4
Max: 4
Actual: 4
Detached: 0
number_of_nodes: 4 Checks cluster status “green”
addNode Desired: 4
Max: 4
Actual: 4
Detached: 1
number_of_nodes: 5 Disables rebalancing on cluster
Detaches instance from ASG (ShouldDecrimentDesiredCapacity: false -> EC2 launches new instance)
Set expected_cluster_size = number_of_nodes + 1
clusterSizeCheck Desired: 4
Max: 4
Actual: 4
Detached: 1
number_of_nodes: 5 Get cluster health
Check nodes = expected_cluster_size
Return oldest and newest nodes
reattachOldInstance Desired: 4
Max: 4
Actual: 4
Detached: 1
number_of_nodes: 5 FAILS to re-attach oldest instance to ASG: max-size = desired

Fix

We suggest adding an initial step to the process which checks that the ASG max is greater than the desired capacity (strictly max only needs to be > desired +1, but we should allow for projects where max is set to a value greater than this). If this step fails, no other steps should be executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant