Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter is taking longer time to fallback to lower weighted nodepool when ICE errors are hit #1899

Open
bparamjeet opened this issue Jan 3, 2025 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@bparamjeet
Copy link

Description

Observed Behavior:
Karpenter is taking too long to fallback to a lower-weighted node pool when ICE errors occur. Sudden increase in pod replica count leaves all pods in a pending state for over a period of time.

Expected Behavior:
Karpenter should fallback to lower weighted nodepool immediately when ICE errors occur.

Reproduction Steps (Please include YAML):

  • Create multiple nodepools with multiple weights.
  • Increase the replica count of a deployment to a larger number.
  • Karpenter won't be able to create new nodes due to ICE errors and pods will be accumulated in pending state.
  • Karpenter will show ICE errors over a period of time till it fallback

Versions:

  • Karpenter Version: 1.0.5
  • Kubernetes Version (kubectl version): v1.31
Screenshot 2025-01-03 at 2 42 55 PM Screenshot 2025-01-03 at 2 43 43 PM Screenshot 2025-01-03 at 2 55 07 PM
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@bparamjeet bparamjeet added the kind/bug Categorizes issue or PR as related to a bug. label Jan 3, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 3, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Vacant2333
Copy link

Vacant2333 commented Jan 7, 2025

Is a necessary condition for this problem to occur? karpenter will cache InstanceTypes that failed to create and will not try again for a few minutes @bparamjeet
In your tests, how long would it take for karpenter to fall back into the lower-weighted node pool?

@bparamjeet
Copy link
Author

In your tests, how long would it take for karpenter to fall back into the lower-weighted node pool ?

  • Karpenter did not fallback to the standby nodepools which is of lower weights.
  • We intervened in between to mark the lower weights for c5.9x which then helped to create nodes.
  • We have multiple nodepools with same weight c7i.8x, c7i.12x, c6i.8x, c6i.12x, c5.9x, c5.12x , After ICE'd error for c7i.8x,c6i.8x karpenter is not fallbacking to c5.9x, c5.12x . why it preferring to choose c7i, c6i ? why karpeter not provisioning instance in c5 when getting ice with c6i and c7i ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants