Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update #2674

Closed
milieere opened this issue Jan 20, 2025 · 2 comments
Assignees
Labels

Comments

@milieere
Copy link

milieere commented Jan 20, 2025

Terraform Version, Provider Version and Kubernetes Version

Terraform version: v1.5.1
Kubernetes provider version: 2.21.1
Kubernetes version: 1.29

Affected Resource(s)

  • kubernetes_manifest (when managing CRDs, specifically Karpenter's EC2NodeClass)

Terraform Configuration Files

resource "kubernetes_manifest" "ec2-node-class-without-snapshot" {
  count           = var.snapshot_id == "" ? 1 : 0
  computed_fields = ["spec.blockDeviceMappings"]
  field_manager {
    force_conflicts = true
  }
  manifest = {
    "apiVersion" = "karpenter.k8s.aws/v1"
    "kind"       = "EC2NodeClass"
    "metadata" = {
      "name" = var.name
    }
    "spec" = {
      # ... other fields ...
      "kubelet" = {
        "clusterDNS" = ["XXX.XXX.X.X"]  # Critical field that must maintain this value
      }
    }
  }
}

Steps to Reproduce

  • Create EC2NodeClass CRD with kubelet.clusterDNS field set
  • Make any change to the EC2NodeClass (e.g., update AMI version)
  • Run terraform plan - shows expected changes
  • Run terraform apply
  • Apply fails with state inconsistency error

Expected Behavior

The provider should maintain the clusterDNS field value during apply operations, especially when:

  • No other controllers are modifying the field (verified via managedFields)
  • Field management is properly configured with force_conflicts = true
  • The field has a specific, required value that must be maintained

Actual Behavior

Provider fails with:

When applying changes to
module.eks-1.module.ec2-node-class-cpu[0].kubernetes_manifest.ec2-node-class-without-snapshot[0],
provider "registry.terraform.io/hashicorp/kubernetes" produced an unexpected
new value: .object.spec.kubelet.clusterDNS: was
cty.ListVal([]cty.Value{cty.StringVal("XXX.XXX.X.X")}), but now null.

Important Factoids

  • This started occurring after transition to Karpenter version 1.0.2
  • Issue occurs across different node groups referencing the same module
  • The field is managed correctly in some cases but fails inconsistently, i.e. in some node groups, you can see terraform successfully assigning this field value during apply, in other node group it is missing
  • managedFields output shows no other controllers attempting to modify this field at that time, therefore unlikely to happen due to a race condition
  • The clusterDNS field is crucial for proper node-local-dns operation

References

#2185

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@milieere milieere added the bug label Jan 20, 2025
@milieere milieere changed the title Inconsistent state management of fields in EC2NodeClass CRD during apply Inconsistent state management of fields in EC2NodeClass CRD during apply after Karpenter version update Jan 20, 2025
@milieere
Copy link
Author

This behavior is likely related to: aws/karpenter-provider-aws#7235 and https://karpenter.sh/v1.0/upgrading/v1-migration/#kubelet-configuration-migration

Will proceed in accordance with the described steps.

@milieere
Copy link
Author

milieere commented Jan 24, 2025

UPDATE

Actually, the fix described above did not fix the issue. I am attaching a full description of observed behavior:

Annotations for conversions between v1beta1 and v1 were removed from nodeppol associated to this ec2nodeclass —> https://karpenter.sh/v1.0/upgrading/v1-migration/#kubelet-configuration-migration but this did not mitigate the issue.

Deleting the ec2nodeclass manually from the cluster and creating the again via terraform fixes the issue permanently (tested in infra-dev).

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to
│ module.eks-1.module.ec2-node-class-gpu-arm64[0].kubernetes_manifest.ec2-node-class-without-snapshot[0],
│ provider
│ "module.eks-1.provider[\"registry.terraform.io/hashicorp/kubernetes\"]"
│ produced an unexpected new value: .object.spec.kubelet.clusterDNS: was
│ cty.ListVal([]cty.Value{cty.StringVal("169.254.5.5")}), but now null.
│ 
│ This is a bug in the provider, which should be reported in the provider's
│ own issue tracker.

OBSERVED BEHAVIOR

This error occurs again after transiently solving this with kubectl patch ( kubectl patch ec2nodeclass karpenter-gpu-arm64 --type=merge -p '{"spec":{"kubelet":{"clusterDNS":["169.254.5.5"]}}}’ ) - it exhibits two successful applies afterwards (1st apply without changes after kubectl-patch passes, second apply with changes to ec2nodeclass after kubectl-patchpasses, third apply with changes to ec2nodeclass fails with this error). Third apply with modifications causes removal of kubelet field from ec2nodeclass definition, even though in the code, it is defined:

  "kubelet" = {
    "clusterDNS" = ["169.254.5.5"]
  }

What fixed the issue

Removing the ec2nodeclasses manually (deleting them from the cluster), and recreating again via terraform solved the issue. Unfortunately, this is not a viable option to do in our production env, only in test. Therefore we parked this issue and will deal with it during next Karpenter upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants