Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update bootimage management enhancement #1698

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yuqi-zhang
Copy link
Contributor

Add some timeline and enforcement options.

cc @dustymabe @jlebon

Add some timeline and enforcement options.
Copy link
Contributor

openshift-ci bot commented Oct 10, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2024
Copy link
Contributor

openshift-ci bot commented Oct 10, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from yuqi-zhang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


#### Enforcement options

Some combination of the following mechanisms should be implemented to alert users, particularly non-machineset backed scaled environments. The options generally fall under proactive enforcement (require users to updated and acknowledge upon upgrading to a new version) vs. reactive enforcement (only fail when a non-compliant bootimage is being used to scale into the cluster).
Copy link
Member

@wking wking Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"require users to updated and acknowledge upon upgrading to a new version" is still reactive, isn't it? The admin ack approach is proactive, so the admin is aware (and ideally can set things up ahead of time), before updating from 4.y to a 4.(y+1) that would require a newer boot image. Can we change "updated" to "update" and "upon" to "before" here?

4. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.


RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a skew issue, right? E.g.

  1. Cluster is happy on an OCP that likes RHEL 9.
  2. ClusterVersion update requested for a new release that likes RHEL 10.
  3. MCO sets the MachineConfigPools up for a transition from the RHEL 9 target to the RHEL 10 target.
  4. Up through this point, MCO will target RHEL 9, and if you have a RHEL 10 boot image, the MCP will fail to scale.
  5. The first node being updated in the MCP successfully goes Ready=True (with disk space, and the other things that MCPs watch to decide the node is happy) on RHEL 10.
  6. From this point on, MCO will target RHEL 10 for new nodes scaling into this MCP, and if you have a RHEL 9 boot image, the MCP will fail to scale.

Trying to time the bootimage bump to exactly match the "MCP associated with these MachineSets has decided new nodes will be RHEL 10" seems tricky. But to avoid that timing issue, you'd either need RHEL 10 targets to be more flexible about boot image matching (e.g. both RHEL 9 and RHEL 10 boot images would work with RHEL 10 MCPs), or some way to select from RHEL 9 or RHEL 10 boot images at Machine-creation time depending on what the target MCP was expecting (and even then, there would still be a race if you selected a RHEL 9 boot image but the MCP got a happy RHEL 10 node before the new RHEL 9 Machine made it's MCS Ignition request).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants