Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Web Bug] - Configuring Crossplane with Argo CD #838

Open
patpicos opened this issue Nov 26, 2024 · 3 comments
Open

[Web Bug] - Configuring Crossplane with Argo CD #838

patpicos opened this issue Nov 26, 2024 · 3 comments

Comments

@patpicos
Copy link

patpicos commented Nov 26, 2024

The health-check provided has a few blindspots when updating resources which were Ready in the past, but then move to a Degraded state.
Here's a sample set of resource conditions which incorrectly returned Healthy. This is because the logic provided shortcuts and returns healthy as soon type: Ready and Status: True.

Example 1 - AWS Nodegroup

  conditions:
    - lastTransitionTime: '2024-10-10T19:44:11Z'
      reason: Available
      status: 'True'
      type: Ready
    - lastTransitionTime: '2024-11-15T14:17:38Z'
      message: >-
        update failed: async update failed: refuse to update the external
        resource because the following update requires replacing it: cannot
        change the value of the argument "capacity_type" from "SPOT" to
        "ON_DEMAND"
      reason: ReconcileError
      status: 'False'
      type: Synced
    - lastTransitionTime: '2024-11-15T14:17:38Z'
      message: >-
        async update failed: refuse to update the external resource because the
        following update requires replacing it: cannot change the value of the
        argument "capacity_type" from "SPOT" to "ON_DEMAND"
      reason: AsyncUpdateFailure
      status: 'False'
      type: LastAsyncOperation

Example 2 - EC2 LaunchTemplate

  conditions:
    - lastTransitionTime: '2024-10-10T16:12:09Z'
      reason: Available
      status: 'True'
      type: Ready
    - lastTransitionTime: '2024-11-18T19:52:41Z'
      message: >-
        cannot patch the managed resource via server-side apply: failed to
        create typed patch object (/; ec2.aws.upbound.io/v1beta1,
        Kind=LaunchTemplate): .spec.forProvider.vpcSecurityGroupIds: element 0:
        associative list without keys has an element that's an explicit null
      reason: ReconcileError
      status: 'False'
      type: Synced
    - lastTransitionTime: '2024-10-10T16:12:07Z'
      reason: Success
      status: 'True'
      type: LastAsyncOperation

To address this, I would recommend not shortcutting, and process each status in chronological order. below is what I've come up with so far. Also, I find that marking resources as Degraded when they are waiting for input from other resources seems incorrect. As such, I have softened the status when
condition.type == "Synced" and condition.status == "False" to include a check on the message "cannot resolve references" (though this may fail if the messages changes in the future...but then it would revert back to the Degraded .

Example 3 - Softening Degraded State

status:
  atProvider: {}
  conditions:
    - lastTransitionTime: '2024-11-26T17:39:54Z'
      message: >-
        cannot resolve references: mg.Spec.ForProvider.FileSystemID: referenced
        field was empty (referenced resource may not yet be ready)
      reason: ReconcileError
      status: 'False'
      type: Synced

Lastly, the health-check does not provide user feedback when a resource is paused. such as:

  Type:                  Synced
  Status:                False
  Reason:                ReconcilePaused

I believe we should return the status of Suspended as per Argo documentation
Suspended - the resource is suspended and waiting for some external event to resume (e.g. suspended CronJob or paused Deployment)

URL: https://docs.crossplane.io/latest/guides/crossplane-with-argo-cd/

local health_status = {}

local function contains (table, val)
  for i, v in ipairs(table) do
    if v == val then
      return true
    end
  end
  return false
end

local function to_timestamp(date_str)
  return os.time({year = string.sub(date_str, 1, 4),
                  month = string.sub(date_str, 6, 7),
                  day = string.sub(date_str, 9, 10),
                  hour = string.sub(date_str, 12, 13),
                  min = string.sub(date_str, 15, 16),
                  sec = string.sub(date_str, 18, 19),
                  isdst = false})
end

local has_no_status = {
  "ProviderConfig",
  "ProviderConfigUsage",
  "Composition",
  "CompositionRevision",
  "DeploymentRuntimeConfig",
  "ControllerConfig",
}

if obj.status == nil or next(obj.status) == nil and contains(has_no_status, obj.kind) then
  health_status.status = "Healthy"
  health_status.message = "Resource is up-to-date."
  return health_status
end

if obj.status == nil or next(obj.status) == nil or obj.status.conditions == nil then
  if obj.kind == "ProviderConfig" and obj.status.users ~= nil then
    health_status.status = "Healthy"
    health_status.message = "Resource is in use."
    return health_status
  end
  return health_status
end

-- Shortcut for resources with atProvider state such as repositories.argocd.crossplane.io
if obj.status.atProvider then
  if obj.status.atProvider.connectionState then
    if obj.status.atProvider.connectionState.status == "Failed" then
      health_status.status = "Degraded"
      health_status.message = obj.status.atProvider.connectionState.message
      return health_status
    end
  end
end

-- Custom sorting function based on the 'lastTransitionTime' field
if obj.status ~= nil and obj.status.conditions then
  table.sort(obj.status.conditions, function(a, b)
    local time_a = to_timestamp(a.lastTransitionTime)
    local time_b = to_timestamp(b.lastTransitionTime)
    return time_a < time_b  -- Sort in ascending order (earliest first)
  end)
end

-- Process all the states in from oldest to newest. (sorted in L26)
for i, condition in ipairs(obj.status.conditions) do
  if condition.type == "LastAsyncOperation" then
    if condition.status == "False" then
      health_status.status = "Degraded"
      health_status.message = condition.message
    end
  end

  if condition.type == "Synced" then
    if condition.status == "False" and string.match(condition.message, "cannot resolve references") then
      health_status.status = "Progressing"
      health_status.message = condition.message
    elseif condition.status == "False" and condition.reason == "ReconcilePaused" then
      health_status.status = "Suspended"
      health_status.message = condition.message
    elseif condition.status == "False" then
      health_status.status = "Degraded"
      health_status.message = condition.message
    end
  end

  if contains({"Ready", "Healthy", "Offered", "Established"}, condition.type) then
    if condition.status == "True" then
      health_status.status = "Healthy"
      health_status.message = "Resource is up-to-date."
    elseif condition.status == "False" and condition.reason == "Creating" then
      health_status.status = "Progressing"
      health_status.message = condition.message
    end
  end
end
return health_status


@patpicos
Copy link
Author

@negz I would love your feedback on this. We had discussed in a past thread about the health state and transitions of resources

@patpicos
Copy link
Author

I've added my examples in a repo to help demonstrate. (this is effectively a fork of the ArgoCD repository and cleaned it so it only has my test cases). See the README
https://github.com/patpicos/crossplane-health-checks

@patpicos
Copy link
Author

patpicos commented Dec 3, 2024

I've updated the logic in the original post based on a few edge cases we encountered after deploying against a broader set of resources.

  • More defensive checks
  • Exposing cases that should show the status as Suspended
  • More accurately representing when the resource is progressing or waiting for a previous resource info before progressing. (original health-check on the crossplane documentation was erroneously showing as Degraded right off the bat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant