-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discourage rebalance, warn against stopping it #1298
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions and comments for consideration.
I'll take another look after others have had their say.
source/operations/concepts.rst
Outdated
@@ -149,12 +149,14 @@ For more information on write preference calculation logic, see :ref:`Writing Fi | |||
Rebalancing data across all pools after an expansion is an expensive operation that requires scanning the entire deployment and moving objects between pools. | |||
This may take a long time to complete depending on the amount of data to move. | |||
|
|||
Starting with MinIO Client version RELEASE.2022-11-07T23-47-39Z, you can manually initiate a rebalancing operation across all server pools using :mc:`mc admin rebalance`. | |||
MinIO does not recommend manual rebalancing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the recommendation text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kannappanr you mean we should say it's ok to manually rebalance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead use a cautionary statement of using only with consultation with MinIO Engineering?
The ask we had for this PR was specifically to discourage use of this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct we have made a release already, and the fixes are also in the EOS binaries so to some extent we have addressed this already.
We should perhaps talk about a more broader tone that rebalance is not a real requirement if you size your pools properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically discourage budget setups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't expand in this manner
first pool
- 100 nodes which is now 90% used
you botched buying hardware now you just expand by 20 nodes
- 20 nodes is second pool
This 20 nodes will take all the I/O hit causing significant slowness, the sizing must be appropriate to the load that 100 node was handling. if 20 can handle and its a new hardware no problem but if its not then it is going to cause outage etc.
A cautionary guidance on why rebalance don't solve the problem of high utilization the second pool. IT may look like that but it won't solve the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mention the "2 years" guidance in the hardware checklist, although there isn't a specific and obvious section along the lines of "How big should I make my pool?" (Noticed that AJ's blog from January recommends minimum 3 years of capacity.)
I think we can reinforce this in the Storage section of the hardware checklist, maybe mention it elsewhere too. Like the concepts page. To make the point that tacking on a bit of new capacity here and there doesn't go well and is not a reliable plan.
For deployments with multiple server pools, each individual pool may have its own hardware configuration. | ||
However, significant capacity differences between pools may temporarily result in high loads on a new pool's nodes during :ref:`expansion <expand-minio-distributed>`. For more information, see :ref:`How do I manage object distribution across a MinIO deployment? <minio-rebalance>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to say this in the hardware checklist?
As the new pool fills, write operations eventually balance out across all pools in the deployment. | ||
Until then, the new pool's nodes may experience higher loads and slower writes. | ||
|
||
To reduce this temporary performance impact, MinIO recommends expanding a deployment well before its existing pools are near capacity and with new pools of a similar size. | ||
For more information on write preference calculation logic, see :ref:`Writing Files <minio-writing-files>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accurate? Sufficient?
Other mentions of pool sizing link to this section
Since a pool with more free space has a higher probability of being written to, the nodes of that pool may experience higher loads as free space equalizes. | ||
|
||
If required, you can manually initiate a rebalance procedure with :mc:`mc admin rebalance`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what happens if pools have very different available free space. Is this text an accurate characterization?
.. admonition:: Stopping a rebalance job on previous versions of MinIO may cause data loss | ||
:class: warning | ||
|
||
A bug in MinIO prior to :minio-release:`RELEASE.2024-08-17T01-24-54Z` can overwrite objects while stopping a in-progress rebalance operation. | ||
Interrupting rebalance on these older versions may result in data loss. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a usual way we reference things like this? What should we say about the now fixed bug?
@kannappanr @harshavardhana made several edits with proposed text that is less scary about rebalance. Appreciate another look. Left the warning about stop, but for older versions. What should we say about that? |
Starting point for temporary guidance while we fix issues with stopping a rebalance operation.
Staged
Warning on stop for older versions:
Discourage manual rebalance:
Mention unbalanced pool capacity in hardware checklist:
http://192.241.195.202:9000/staging/rebalance-stop-guidance/linux/operations/checklists/hardware.html#use-consistent-drive-type-and-capacity
cc @kannappanr @harshavardhana @krisis