Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update #2674

milieere · 2025-01-20T09:39:22Z

Terraform Version, Provider Version and Kubernetes Version

Terraform version: v1.5.1
Kubernetes provider version: 2.21.1
Kubernetes version: 1.29

Affected Resource(s)

kubernetes_manifest (when managing CRDs, specifically Karpenter's EC2NodeClass)

Terraform Configuration Files

resource "kubernetes_manifest" "ec2-node-class-without-snapshot" {
  count           = var.snapshot_id == "" ? 1 : 0
  computed_fields = ["spec.blockDeviceMappings"]
  field_manager {
    force_conflicts = true
  }
  manifest = {
    "apiVersion" = "karpenter.k8s.aws/v1"
    "kind"       = "EC2NodeClass"
    "metadata" = {
      "name" = var.name
    }
    "spec" = {
      # ... other fields ...
      "kubelet" = {
        "clusterDNS" = ["XXX.XXX.X.X"]  # Critical field that must maintain this value
      }
    }
  }
}

Steps to Reproduce

Create EC2NodeClass CRD with kubelet.clusterDNS field set
Make any change to the EC2NodeClass (e.g., update AMI version)
Run terraform plan - shows expected changes
Run terraform apply
Apply fails with state inconsistency error

Expected Behavior

The provider should maintain the clusterDNS field value during apply operations, especially when:

No other controllers are modifying the field (verified via managedFields)
Field management is properly configured with force_conflicts = true
The field has a specific, required value that must be maintained

Actual Behavior

Provider fails with:

When applying changes to
module.eks-1.module.ec2-node-class-cpu[0].kubernetes_manifest.ec2-node-class-without-snapshot[0],
provider "registry.terraform.io/hashicorp/kubernetes" produced an unexpected
new value: .object.spec.kubelet.clusterDNS: was
cty.ListVal([]cty.Value{cty.StringVal("XXX.XXX.X.X")}), but now null.

Important Factoids

This started occurring after transition to Karpenter version 1.0.2
Issue occurs across different node groups referencing the same module
The field is managed correctly in some cases but fails inconsistently, i.e. in some node groups, you can see terraform successfully assigning this field value during apply, in other node group it is missing
managedFields output shows no other controllers attempting to modify this field at that time, therefore unlikely to happen due to a race condition
The clusterDNS field is crucial for proper node-local-dns operation

References

#2185

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

milieere · 2025-01-24T08:17:47Z

This behavior is likely related to: aws/karpenter-provider-aws#7235 and https://karpenter.sh/v1.0/upgrading/v1-migration/#kubelet-configuration-migration

Will proceed in accordance with the described steps.

milieere · 2025-01-24T12:10:42Z

UPDATE

Actually, the fix described above did not fix the issue. I am attaching a full description of observed behavior:

Annotations for conversions between v1beta1 and v1 were removed from nodeppol associated to this ec2nodeclass —> https://karpenter.sh/v1.0/upgrading/v1-migration/#kubelet-configuration-migration but this did not mitigate the issue.

Deleting the ec2nodeclass manually from the cluster and creating the again via terraform fixes the issue permanently (tested in infra-dev).

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to
│ module.eks-1.module.ec2-node-class-gpu-arm64[0].kubernetes_manifest.ec2-node-class-without-snapshot[0],
│ provider
│ "module.eks-1.provider[\"registry.terraform.io/hashicorp/kubernetes\"]"
│ produced an unexpected new value: .object.spec.kubelet.clusterDNS: was
│ cty.ListVal([]cty.Value{cty.StringVal("169.254.5.5")}), but now null.
│ 
│ This is a bug in the provider, which should be reported in the provider's
│ own issue tracker.

OBSERVED BEHAVIOR

This error occurs again after transiently solving this with kubectl patch ( kubectl patch ec2nodeclass karpenter-gpu-arm64 --type=merge -p '{"spec":{"kubelet":{"clusterDNS":["169.254.5.5"]}}}’ ) - it exhibits two successful applies afterwards (1st apply without changes after kubectl-patch passes, second apply with changes to ec2nodeclass after kubectl-patchpasses, third apply with changes to ec2nodeclass fails with this error). Third apply with modifications causes removal of kubelet field from ec2nodeclass definition, even though in the code, it is defined:

  "kubelet" = {
    "clusterDNS" = ["169.254.5.5"]
  }

What fixed the issue

Removing the ec2nodeclasses manually (deleting them from the cluster), and recreating again via terraform solved the issue. Unfortunately, this is not a viable option to do in our production env, only in test. Therefore we parked this issue and will deal with it during next Karpenter upgrade.

milieere added the bug label Jan 20, 2025

github-actions bot assigned alexsomesan Jan 20, 2025

milieere changed the title ~~Inconsistent state management of fields in EC2NodeClass CRD during apply~~ Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update Jan 20, 2025

milieere closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update #2674

Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update #2674

milieere commented Jan 20, 2025 •

edited

Loading

milieere commented Jan 24, 2025

milieere commented Jan 24, 2025 •

edited

Loading

Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update #2674

Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update #2674

Comments

milieere commented Jan 20, 2025 • edited Loading

Terraform Version, Provider Version and Kubernetes Version

Affected Resource(s)

Terraform Configuration Files

Steps to Reproduce

Expected Behavior

Actual Behavior

Important Factoids

References

Community Note

milieere commented Jan 24, 2025

milieere commented Jan 24, 2025 • edited Loading

UPDATE

OBSERVED BEHAVIOR

What fixed the issue

milieere commented Jan 20, 2025 •

edited

Loading

milieere commented Jan 24, 2025 •

edited

Loading