You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our team currently has two big-ish Elasticsearch clusters - Elasticsearch 6 & Elasticsearch 7, and that ends up being a lot of ES nodes - about 50 nodes or so. Our current node rotation schedule is failing to keep up:
This is because:
Even with "indices.recovery.max_bytes_per_sec" : "256mb" set, migrating all the data off a node can take nearly an hour.
Ideally, we won't be auto-running node rotations during business hours (eg because it can interfere with manual developer work on the cluster)
When node-rotation is automatically run, it's nice if it happens just before developers start work so they can be around to look at any fallout relatively soon.
The above means our RotationCronExpression is currently cron(10 4-10 ? * MON-FRI *) (try hourly between 4am to 10am on weekdays)- but this offers only 35 rotations per week, which is not enough to keep up with 50 nodes.
We can widen that rotation period, but precisely scaling cron schedules is quite fiddly (eg currently, we have to be very careful to make sure they are never more frequent than the slowest possible migration). It would be nice to have a better way to scale this...
Let's respect ageThresholdInDays - don't stop until all nodes are younger
Thanks to #68, we now have the ageThresholdInDays parameter (for Ophan, it is 7 days). At the moment it just means:
Don't rotate any node that is younger than ageThresholdInDays
How about if instead it meant:
The Step Function will not terminate until all nodes are younger than ageThresholdInDays
...then, rather than scheduling the Step Function to run multiple times a day, we could just cron it to run once per day.
How can the ENR Step Function achieve that?
A few options:
Just loop around the whole function, not terminating until that terminating ageThresholdInDays condition has been achieved.
Rotate more nodes at a time - all the nodes older than ageThresholdInDays!
The text was updated successfully, but these errors were encountered:
rtyley
changed the title
Coping with falling behind when there are too many nodes to refresh
Falling behind when there are too many nodes to refresh...
Nov 17, 2021
how about we add another step to the end of the process, which checks if there are further nodes to rotate (that meet the criteria) and kicks off another instance of itself (the step function) - that way each rotation has its own step function invocation - making debugging easier.
this proposed behaviour could be configurable via the input event, perhaps with an optional maxRotations (defaults to one, if not present) then the schedule could be set to say 10 in your case (given 5 weekdays)
Our team currently has two big-ish Elasticsearch clusters - Elasticsearch 6 & Elasticsearch 7, and that ends up being a lot of ES nodes - about 50 nodes or so. Our current node rotation schedule is failing to keep up:
This is because:
"indices.recovery.max_bytes_per_sec" : "256mb"
set, migrating all the data off a node can take nearly an hour.RotationCronExpression
is currentlycron(10 4-10 ? * MON-FRI *)
(try hourly between 4am to 10am on weekdays)- but this offers only 35 rotations per week, which is not enough to keep up with 50 nodes.We can widen that rotation period, but precisely scaling cron schedules is quite fiddly (eg currently, we have to be very careful to make sure they are never more frequent than the slowest possible migration). It would be nice to have a better way to scale this...
Let's respect
ageThresholdInDays
- don't stop until all nodes are youngerThanks to #68, we now have the
ageThresholdInDays
parameter (for Ophan, it is 7 days). At the moment it just means:How about if instead it meant:
...then, rather than scheduling the Step Function to run multiple times a day, we could just cron it to run once per day.
How can the ENR Step Function achieve that?
A few options:
ageThresholdInDays
condition has been achieved.ageThresholdInDays
!The text was updated successfully, but these errors were encountered: