Falling behind when there are too many nodes to refresh... #87

rtyley · 2021-11-17T12:40:34Z

Our team currently has two big-ish Elasticsearch clusters - Elasticsearch 6 & Elasticsearch 7, and that ends up being a lot of ES nodes - about 50 nodes or so. Our current node rotation schedule is failing to keep up:

This is because:

Even with "indices.recovery.max_bytes_per_sec" : "256mb" set, migrating all the data off a node can take nearly an hour.
Ideally, we won't be auto-running node rotations during business hours (eg because it can interfere with manual developer work on the cluster)
When node-rotation is automatically run, it's nice if it happens just before developers start work so they can be around to look at any fallout relatively soon.
The above means our RotationCronExpression is currently cron(10 4-10 ? * MON-FRI *) (try hourly between 4am to 10am on weekdays)- but this offers only 35 rotations per week, which is not enough to keep up with 50 nodes.

We can widen that rotation period, but precisely scaling cron schedules is quite fiddly (eg currently, we have to be very careful to make sure they are never more frequent than the slowest possible migration). It would be nice to have a better way to scale this...

Let's respect `ageThresholdInDays` - don't stop until all nodes are younger

Thanks to #68, we now have the ageThresholdInDays parameter (for Ophan, it is 7 days). At the moment it just means:

Don't rotate any node that is younger than ageThresholdInDays

How about if instead it meant:

The Step Function will not terminate until all nodes are younger than ageThresholdInDays

...then, rather than scheduling the Step Function to run multiple times a day, we could just cron it to run once per day.

How can the ENR Step Function achieve that?

A few options:

Just loop around the whole function, not terminating until that terminating ageThresholdInDays condition has been achieved.
Rotate more nodes at a time - all the nodes older than ageThresholdInDays!

The text was updated successfully, but these errors were encountered:

twrichards · 2022-01-17T22:05:12Z

how about we add another step to the end of the process, which checks if there are further nodes to rotate (that meet the criteria) and kicks off another instance of itself (the step function) - that way each rotation has its own step function invocation - making debugging easier.

twrichards · 2022-01-17T22:10:28Z

this proposed behaviour could be configurable via the input event, perhaps with an optional maxRotations (defaults to one, if not present) then the schedule could be set to say 10 in your case (given 5 weekdays)

twrichards · 2022-01-17T22:13:57Z

also, this seems to be a dupe of #34

rtyley changed the title ~~Coping with falling behind when there are too many nodes to refresh~~ Falling behind when there are too many nodes to refresh... Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falling behind when there are too many nodes to refresh... #87

Falling behind when there are too many nodes to refresh... #87

rtyley commented Nov 17, 2021 •

edited

Loading

twrichards commented Jan 17, 2022

twrichards commented Jan 17, 2022

twrichards commented Jan 17, 2022

Falling behind when there are too many nodes to refresh... #87

Falling behind when there are too many nodes to refresh... #87

Comments

rtyley commented Nov 17, 2021 • edited Loading

Let's respect ageThresholdInDays - don't stop until all nodes are younger

How can the ENR Step Function achieve that?

twrichards commented Jan 17, 2022

twrichards commented Jan 17, 2022

twrichards commented Jan 17, 2022

rtyley commented Nov 17, 2021 •

edited

Loading

Let's respect `ageThresholdInDays` - don't stop until all nodes are younger