Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falling behind when there are too many nodes to refresh... #87

Open
rtyley opened this issue Nov 17, 2021 · 3 comments
Open

Falling behind when there are too many nodes to refresh... #87

rtyley opened this issue Nov 17, 2021 · 3 comments

Comments

@rtyley
Copy link
Member

rtyley commented Nov 17, 2021

Our team currently has two big-ish Elasticsearch clusters - Elasticsearch 6 & Elasticsearch 7, and that ends up being a lot of ES nodes - about 50 nodes or so. Our current node rotation schedule is failing to keep up:

image

This is because:

  • Even with "indices.recovery.max_bytes_per_sec" : "256mb" set, migrating all the data off a node can take nearly an hour.
  • Ideally, we won't be auto-running node rotations during business hours (eg because it can interfere with manual developer work on the cluster)
  • When node-rotation is automatically run, it's nice if it happens just before developers start work so they can be around to look at any fallout relatively soon.
  • The above means our RotationCronExpression is currently cron(10 4-10 ? * MON-FRI *) (try hourly between 4am to 10am on weekdays)- but this offers only 35 rotations per week, which is not enough to keep up with 50 nodes.

We can widen that rotation period, but precisely scaling cron schedules is quite fiddly (eg currently, we have to be very careful to make sure they are never more frequent than the slowest possible migration). It would be nice to have a better way to scale this...

Let's respect ageThresholdInDays - don't stop until all nodes are younger

Thanks to #68, we now have the ageThresholdInDays parameter (for Ophan, it is 7 days). At the moment it just means:

Don't rotate any node that is younger than ageThresholdInDays

How about if instead it meant:

The Step Function will not terminate until all nodes are younger than ageThresholdInDays

...then, rather than scheduling the Step Function to run multiple times a day, we could just cron it to run once per day.

How can the ENR Step Function achieve that?

A few options:

  • Just loop around the whole function, not terminating until that terminating ageThresholdInDays condition has been achieved.
  • Rotate more nodes at a time - all the nodes older than ageThresholdInDays!
@rtyley rtyley changed the title Coping with falling behind when there are too many nodes to refresh Falling behind when there are too many nodes to refresh... Nov 17, 2021
@twrichards
Copy link
Contributor

how about we add another step to the end of the process, which checks if there are further nodes to rotate (that meet the criteria) and kicks off another instance of itself (the step function) - that way each rotation has its own step function invocation - making debugging easier.

@twrichards
Copy link
Contributor

this proposed behaviour could be configurable via the input event, perhaps with an optional maxRotations (defaults to one, if not present) then the schedule could be set to say 10 in your case (given 5 weekdays)

@twrichards
Copy link
Contributor

also, this seems to be a dupe of #34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants