Update bootimage management enhancement #1698

yuqi-zhang · 2024-10-10T17:53:46Z

Add some timeline and enforcement options.

openshift-ci · 2024-10-10T17:53:50Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-10-10T17:54:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from yuqi-zhang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2024-10-30T21:53:21Z

enhancements/machine-config/manage-boot-images.md

+
+#### Enforcement options
+
+Some combination of the following mechanisms should be implemented to alert users, particularly non-machineset backed scaled environments. The options generally fall under proactive enforcement (require users to updated and acknowledge upon upgrading to a new version) vs. reactive enforcement (only fail when a non-compliant bootimage is being used to scale into the cluster).


"require users to updated and acknowledge upon upgrading to a new version" is still reactive, isn't it? The admin ack approach is proactive, so the admin is aware (and ideally can set things up ahead of time), before updating from 4.y to a 4.(y+1) that would require a newer boot image. Can we change "updated" to "update" and "upon" to "before" here?

wking · 2024-10-30T22:03:32Z

enhancements/machine-config/manage-boot-images.md

+4. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.
+
+
+RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage.


This creates a skew issue, right? E.g.

Cluster is happy on an OCP that likes RHEL 9.

ClusterVersion update requested for a new release that likes RHEL 10.

MCO sets the MachineConfigPools up for a transition from the RHEL 9 target to the RHEL 10 target.

Up through this point, MCO will target RHEL 9, and if you have a RHEL 10 boot image, the MCP will fail to scale.

The first node being updated in the MCP successfully goes Ready=True (with disk space, and the other things that MCPs watch to decide the node is happy) on RHEL 10.

From this point on, MCO will target RHEL 10 for new nodes scaling into this MCP, and if you have a RHEL 9 boot image, the MCP will fail to scale.

Trying to time the bootimage bump to exactly match the "MCP associated with these MachineSets has decided new nodes will be RHEL 10" seems tricky. But to avoid that timing issue, you'd either need RHEL 10 targets to be more flexible about boot image matching (e.g. both RHEL 9 and RHEL 10 boot images would work with RHEL 10 MCPs), or some way to select from RHEL 9 or RHEL 10 boot images at Machine-creation time depending on what the target MCP was expecting (and even then, there would still be a race if you selected a RHEL 9 boot image but the MCP got a happy RHEL 10 node before the new RHEL 9 Machine made it's MCS Ignition request).

Update bootimage management enhancement

6aa7029

Add some timeline and enforcement options.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2024

wking reviewed Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update bootimage management enhancement #1698

Update bootimage management enhancement #1698

yuqi-zhang commented Oct 10, 2024

openshift-ci bot commented Oct 10, 2024

openshift-ci bot commented Oct 10, 2024

wking Oct 30, 2024 •

edited

Loading

wking Oct 30, 2024


		#### Enforcement options

		Some combination of the following mechanisms should be implemented to alert users, particularly non-machineset backed scaled environments. The options generally fall under proactive enforcement (require users to updated and acknowledge upon upgrading to a new version) vs. reactive enforcement (only fail when a non-compliant bootimage is being used to scale into the cluster).

		4. Add a service to be shipped via RHCOS/MCO templates, which will do a check on incoming OS container image vs currently booted RHCOS version. This runs on firstboot right after the MCD pulls the new image, and will prevent the node to rebase to the updated image if the drift is too far.


		RHEL major versions will no longer be cross-compatible. i.e. if you wish to have a RHEL10 machineconfigpool, you must use a RHEL10 bootimage.

Update bootimage management enhancement #1698

Are you sure you want to change the base?

Update bootimage management enhancement #1698

Conversation

yuqi-zhang commented Oct 10, 2024

openshift-ci bot commented Oct 10, 2024

openshift-ci bot commented Oct 10, 2024

wking Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

wking Oct 30, 2024

Choose a reason for hiding this comment

wking Oct 30, 2024 •

edited

Loading