Reduce cluster rebalance time #55

davidfurey · 2020-11-17T17:16:12Z

We have noticed that the Ophan cluster spends many hours after a node rotation rebalancing the cluster.

When vacating a node using node exclusion, Elasticsearch will move the shards away, but that doesn’t mean they will all go to the new node. So while the old node quickly passes its shards on to other nodes, and then is terminated, after this there is a lot of rebalancing, at a slow pace to minimize the impact cluster performance, as heuristics are being met.

Elastic have suggested that another option would be to move the shards from the old to the new node using Cluster Reroute. As all shards are moved to the new node, this should cause minimal rebalance, if any.

davidfurey · 2020-11-17T17:16:59Z

@jacobwinch interested in what your thoughts are on this potential change to the node rotator.

jacobwinch · 2020-11-18T09:14:12Z

@davidfurey - I agree that the extended period of rebalancing is undesirable; it was always our intention to migrate all data from the old node onto the newest node. Perhaps this happened to work OK on the smaller clusters that we were testing with, or perhaps we just never fulfilled this requirement correctly... either way I definitely think the suggestion from you/Elastic would be an improvement on the current behaviour.

@tomrf1 might have thoughts on this too? (Perhaps your memory is better than mine!)

tomrf1 · 2020-11-18T09:24:06Z

My memory of this is not great...
But I do remember us spending a lot of time making sure the shards go from the old node to the new node without any further re-allocation. Perhaps something has changed since?

Cluster Reroute is new to me, but that sounds like exactly what we want!

davidfurey assigned jacobwinch Nov 17, 2020

rtyley mentioned this issue Jan 5, 2021

Only Data nodes should require a 'shard-migration'/'no-documents' check for shutdown #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce cluster rebalance time #55

Reduce cluster rebalance time #55

davidfurey commented Nov 17, 2020

davidfurey commented Nov 17, 2020

jacobwinch commented Nov 18, 2020

tomrf1 commented Nov 18, 2020

Reduce cluster rebalance time #55

Reduce cluster rebalance time #55

Comments

davidfurey commented Nov 17, 2020

davidfurey commented Nov 17, 2020

jacobwinch commented Nov 18, 2020

tomrf1 commented Nov 18, 2020