Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubektl stabilization: move provider to root #3095

Draft
wants to merge 16 commits into
base: develop
Choose a base branch
from

Conversation

annuay-google
Copy link
Contributor

@annuay-google annuay-google commented Oct 3, 2024

Issue Description

We have a blueprint that applied some manifests with kubectl-apply module, if we remove a kubectl-apply block from the blueprint fully, we will get 'Error: Provider configuration not present'.

For example, if we have a block like the below:

  - id: workload_manager_install
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      kueue:
        install: true
      jobset:
        install: true

  - id: workload_manager_config
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      apply_manifests:
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/config-map.yaml.tftpl
        template_vars: {name: "a3plus-benchmark-resources-configmap", num_nodes: "1"}
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/kueue-credentials.yaml.tftpl
        template_vars: {num_chips: "8"}

Removing workload_manager_config and recreating and redeploying gives us the error:

Error: Provider configuration not present

To work with
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan) its original provider configuration at
module.workload_manager_config.provider["registry.terraform.io/gavinbunney/kubectl"]
is required, but it has been removed. This occurs when a provider
configuration is removed while objects created by that provider still exist
in the state. Re-add the provider configuration to destroy
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan), after which you can remove the provider configuration again.

Root Cause

Defining the provider file, modules/management/kubectl-apply/providers.tf inside the module is the root cause. We define all other terraform providers at root. That's not the case here, it's in a child module.

When you delete the blueprint config for the child module, the corresponding module and provider code is also removed (the folder is still there, but terraform doesn't know about it).

So it does not know which provider to use to delete the resources associated with the module you're trying to destroy (kubectl_apply_manifests in this case). See this answer for more details: https://stackoverflow.com/a/58403262

Approach

Move kubectl provider to root module

Testing

  • Use the same setup as in issue description. Remove the workload_manager_config module and recreate, redeploy. Verify that the error does not occur and infra is removed as expected

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@annuay-google annuay-google force-pushed the annuay/fix-orphaned-resource-states branch from 19c5409 to e53ee50 Compare October 3, 2024 14:41
@annuay-google annuay-google added the release-improvements Added to release notes under the "Improvements" heading. label Oct 3, 2024
@annuay-google
Copy link
Contributor Author

Following changes need to be made to made to toolkit core:

Add this provider to primary/providers.tf:

provider "kubectl" {
  host                   = "https://${module.gke_cluster.gke_cluster_endpoint}"
  token                  = module.gke_cluster.access_token
  cluster_ca_certificate = base64decode(module.gke_cluster.cluster_ca_certificate)
  load_config_file       = false
  apply_retry_count      = 15 # Terraform may apply resources in parallel, leading to potential dependency issues. This retry mechanism ensures that if a resource's dependencies aren't ready, Terraform will attempt to apply it again.
}

Add this provider version to primary/versions.tf:

    kubectl = {
      source  = "gavinbunney/kubectl"
      version = ">= 1.7.0"
    }

@annuay-google annuay-google marked this pull request as draft October 3, 2024 14:59
@annuay-google annuay-google force-pushed the annuay/fix-orphaned-resource-states branch 2 times, most recently from e4390e3 to 0fbd519 Compare October 3, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant