You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The health-check provided has a few blindspots when updating resources which were Ready in the past, but then move to a Degraded state.
Here's a sample set of resource conditions which incorrectly returned Healthy. This is because the logic provided shortcuts and returns healthy as soon type: Ready and Status: True.
Example 1 - AWS Nodegroup
conditions:
- lastTransitionTime: '2024-10-10T19:44:11Z'reason: Availablestatus: 'True'type: Ready
- lastTransitionTime: '2024-11-15T14:17:38Z'message: >- update failed: async update failed: refuse to update the external resource because the following update requires replacing it: cannot change the value of the argument "capacity_type" from "SPOT" to "ON_DEMAND"reason: ReconcileErrorstatus: 'False'type: Synced
- lastTransitionTime: '2024-11-15T14:17:38Z'message: >- async update failed: refuse to update the external resource because the following update requires replacing it: cannot change the value of the argument "capacity_type" from "SPOT" to "ON_DEMAND"reason: AsyncUpdateFailurestatus: 'False'type: LastAsyncOperation
Example 2 - EC2 LaunchTemplate
conditions:
- lastTransitionTime: '2024-10-10T16:12:09Z'reason: Availablestatus: 'True'type: Ready
- lastTransitionTime: '2024-11-18T19:52:41Z'message: >- cannot patch the managed resource via server-side apply: failed to create typed patch object (/; ec2.aws.upbound.io/v1beta1, Kind=LaunchTemplate): .spec.forProvider.vpcSecurityGroupIds: element 0: associative list without keys has an element that's an explicit nullreason: ReconcileErrorstatus: 'False'type: Synced
- lastTransitionTime: '2024-10-10T16:12:07Z'reason: Successstatus: 'True'type: LastAsyncOperation
To address this, I would recommend not shortcutting, and process each status in chronological order. below is what I've come up with so far. Also, I find that marking resources as Degraded when they are waiting for input from other resources seems incorrect. As such, I have softened the status when condition.type == "Synced" and condition.status == "False" to include a check on the message "cannot resolve references" (though this may fail if the messages changes in the future...but then it would revert back to the Degraded .
Example 3 - Softening Degraded State
status:
atProvider: {}conditions:
- lastTransitionTime: '2024-11-26T17:39:54Z'message: >- cannot resolve references: mg.Spec.ForProvider.FileSystemID: referenced field was empty (referenced resource may not yet be ready)reason: ReconcileErrorstatus: 'False'type: Synced
Lastly, the health-check does not provide user feedback when a resource is paused. such as:
I believe we should return the status of Suspended as per Argo documentation Suspended - the resource is suspended and waiting for some external event to resume (e.g. suspended CronJob or paused Deployment)
localhealth_status= {}
localfunctioncontains (table, val)
fori, vinipairs(table) doifv==valthenreturntrueendendreturnfalseendlocalfunctionto_timestamp(date_str)
returnos.time({year=string.sub(date_str, 1, 4),
month=string.sub(date_str, 6, 7),
day=string.sub(date_str, 9, 10),
hour=string.sub(date_str, 12, 13),
min=string.sub(date_str, 15, 16),
sec=string.sub(date_str, 18, 19),
isdst=false})
endlocalhas_no_status= {
"ProviderConfig",
"ProviderConfigUsage",
"Composition",
"CompositionRevision",
"DeploymentRuntimeConfig",
"ControllerConfig",
}
ifobj.status==nilornext(obj.status) ==nilandcontains(has_no_status, obj.kind) thenhealth_status.status="Healthy"health_status.message="Resource is up-to-date."returnhealth_statusendifobj.status==nilornext(obj.status) ==nilorobj.status.conditions==nilthenifobj.kind=="ProviderConfig" andobj.status.users~=nilthenhealth_status.status="Healthy"health_status.message="Resource is in use."returnhealth_statusendreturnhealth_statusend-- Shortcut for resources with atProvider state such as repositories.argocd.crossplane.ioifobj.status.atProviderthenifobj.status.atProvider.connectionStatethenifobj.status.atProvider.connectionState.status=="Failed" thenhealth_status.status="Degraded"health_status.message=obj.status.atProvider.connectionState.messagereturnhealth_statusendendend-- Custom sorting function based on the 'lastTransitionTime' fieldifobj.status~=nilandobj.status.conditionsthentable.sort(obj.status.conditions, function(a, b)
localtime_a=to_timestamp(a.lastTransitionTime)
localtime_b=to_timestamp(b.lastTransitionTime)
returntime_a<time_b-- Sort in ascending order (earliest first)end)
end-- Process all the states in from oldest to newest. (sorted in L26)fori, conditioninipairs(obj.status.conditions) doifcondition.type=="LastAsyncOperation" thenifcondition.status=="False" thenhealth_status.status="Degraded"health_status.message=condition.messageendendifcondition.type=="Synced" thenifcondition.status=="False" andstring.match(condition.message, "cannot resolve references") thenhealth_status.status="Progressing"health_status.message=condition.messageelseifcondition.status=="False" andcondition.reason=="ReconcilePaused" thenhealth_status.status="Suspended"health_status.message=condition.messageelseifcondition.status=="False" thenhealth_status.status="Degraded"health_status.message=condition.messageendendifcontains({"Ready", "Healthy", "Offered", "Established"}, condition.type) thenifcondition.status=="True" thenhealth_status.status="Healthy"health_status.message="Resource is up-to-date."elseifcondition.status=="False" andcondition.reason=="Creating" thenhealth_status.status="Progressing"health_status.message=condition.messageendendendreturnhealth_status
The text was updated successfully, but these errors were encountered:
I've added my examples in a repo to help demonstrate. (this is effectively a fork of the ArgoCD repository and cleaned it so it only has my test cases). See the README https://github.com/patpicos/crossplane-health-checks
I've updated the logic in the original post based on a few edge cases we encountered after deploying against a broader set of resources.
More defensive checks
Exposing cases that should show the status as Suspended
More accurately representing when the resource is progressing or waiting for a previous resource info before progressing. (original health-check on the crossplane documentation was erroneously showing as Degraded right off the bat
The health-check provided has a few blindspots when updating resources which were
Ready
in the past, but then move to aDegraded
state.Here's a sample set of resource conditions which incorrectly returned
Healthy
. This is because the logic provided shortcuts and returns healthy as soontype: Ready
andStatus: True
.Example 1 - AWS Nodegroup
Example 2 - EC2 LaunchTemplate
To address this, I would recommend not shortcutting, and process each status in chronological order. below is what I've come up with so far. Also, I find that marking resources as
Degraded
when they are waiting for input from other resources seems incorrect. As such, I have softened the status whencondition.type == "Synced"
andcondition.status == "False"
to include a check on the message"cannot resolve references"
(though this may fail if the messages changes in the future...but then it would revert back to theDegraded
.Example 3 - Softening Degraded State
Lastly, the health-check does not provide user feedback when a resource is paused. such as:
I believe we should return the status of
Suspended
as per Argo documentationSuspended - the resource is suspended and waiting for some external event to resume (e.g. suspended CronJob or paused Deployment)
URL: https://docs.crossplane.io/latest/guides/crossplane-with-argo-cd/
The text was updated successfully, but these errors were encountered: