Kubektl stabilization: move provider to root #3095

annuay-google · 2024-10-03T14:38:30Z

Issue Description

We have a blueprint that applied some manifests with kubectl-apply module, if we remove a kubectl-apply block from the blueprint fully, we will get 'Error: Provider configuration not present'.

For example, if we have a block like the below:

  - id: workload_manager_install
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      kueue:
        install: true
      jobset:
        install: true

  - id: workload_manager_config
    source: modules/management/kubectl-apply
    use: [gke_cluster]
    settings:
      apply_manifests:
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/config-map.yaml.tftpl
        template_vars: {name: "a3plus-benchmark-resources-configmap", num_nodes: "1"}
      - source: $(ghpc_stage("maxtext-gke-a3-files"))/kueue-credentials.yaml.tftpl
        template_vars: {num_chips: "8"}

Removing workload_manager_config and recreating and redeploying gives us the error:

Error: Provider configuration not present

To work with
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan) its original provider configuration at
module.workload_manager_config.provider["registry.terraform.io/gavinbunney/kubectl"]
is required, but it has been removed. This occurs when a provider
configuration is removed while objects created by that provider still exist
in the state. Re-add the provider configuration to destroy
module.workload_manager_config.module.kubectl_apply_manifests["1"].kubectl_manifest.apply_doc["2"]
(orphan), after which you can remove the provider configuration again.

Root Cause

Defining the provider file, modules/management/kubectl-apply/providers.tf inside the module is the root cause. We define all other terraform providers at root. That's not the case here, it's in a child module.

When you delete the blueprint config for the child module, the corresponding module and provider code is also removed (the folder is still there, but terraform doesn't know about it).

So it does not know which provider to use to delete the resources associated with the module you're trying to destroy (kubectl_apply_manifests in this case). See this answer for more details: https://stackoverflow.com/a/58403262

Approach

Move kubectl provider to root module

Testing

Use the same setup as in issue description. Remove the workload_manager_config module and recreate, redeploy. Verify that the error does not occur and infra is removed as expected

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

…kit into annuay/fix-orphaned-resource-states

annuay-google · 2024-10-03T14:57:23Z

Following changes need to be made to made to toolkit core:

Add this provider to primary/providers.tf:

provider "kubectl" {
  host                   = "https://${module.gke_cluster.gke_cluster_endpoint}"
  token                  = module.gke_cluster.access_token
  cluster_ca_certificate = base64decode(module.gke_cluster.cluster_ca_certificate)
  load_config_file       = false
  apply_retry_count      = 15 # Terraform may apply resources in parallel, leading to potential dependency issues. This retry mechanism ensures that if a resource's dependencies aren't ready, Terraform will attempt to apply it again.
}

Add this provider version to primary/versions.tf:

    kubectl = {
      source  = "gavinbunney/kubectl"
      version = ">= 1.7.0"
    }

annuay-google added 9 commits October 2, 2024 16:25

update blueprint to replicate issue

7f71bb0

iterating on temp files

d09b73c

working stuff

4f70331

all working well

d594100

all working well

a3932f3

all working well

2b0af31

all working

cd4d59d

add changes to toolkit modules

bc1ce7b

update terraform modules to use provider configured in root

e53ee50

annuay-google force-pushed the annuay/fix-orphaned-resource-states branch from 19c5409 to e53ee50 Compare October 3, 2024 14:41

annuay-google added 3 commits October 3, 2024 14:44

Merge branch 'develop' of github.com:GoogleCloudPlatform/cluster-tool…

225b07f

…kit into annuay/fix-orphaned-resource-states

Update output descriptions

9b41637

remove updated blueprints

a2712ad

annuay-google added the release-improvements Added to release notes under the "Improvements" heading. label Oct 3, 2024

annuay-google marked this pull request as draft October 3, 2024 14:59

annuay-google added 4 commits October 3, 2024 15:03

remove updated blueprints

0f822bc

fix validate hook

3b90fe2

fix validate hook

855d178

fix validate hook

0fbd519

annuay-google force-pushed the annuay/fix-orphaned-resource-states branch 2 times, most recently from e4390e3 to 0fbd519 Compare October 3, 2024 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubektl stabilization: move provider to root #3095

Kubektl stabilization: move provider to root #3095

annuay-google commented Oct 3, 2024 •

edited

Loading

annuay-google commented Oct 3, 2024

Kubektl stabilization: move provider to root #3095

Are you sure you want to change the base?

Kubektl stabilization: move provider to root #3095

Conversation

annuay-google commented Oct 3, 2024 • edited Loading

Issue Description

Root Cause

Approach

Testing

Submission Checklist

annuay-google commented Oct 3, 2024

annuay-google commented Oct 3, 2024 •

edited

Loading