Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Apply failure will cause release state to be completely empty #1367

Open
ffforest opened this issue Dec 30, 2024 · 1 comment · May be fixed by #1390
Open

Bug: Apply failure will cause release state to be completely empty #1367

ffforest opened this issue Dec 30, 2024 · 1 comment · May be fixed by #1390
Labels
kind/bug Something isn't working

Comments

@ffforest
Copy link
Contributor

ffforest commented Dec 30, 2024

What happened?

Right now func (ao *ApplyOperation) Apply is considered one atomic operation, if anything in the process failed, the entire operation fails and return with an empty release in the operation.ApplyResponse, which then gets updated in the release file in the backend.
But is that reasonable?
As a consequence, an empty state could lead to unexpected behavior:


During kusion preview and apply, kusion uses apimachinery.CreateThreeWayJSONMergePatch() to create a three way json merge patch for kubernetes resources.
There are three resource manifests in play here:

  1. original - resource in the last kusion state
  2. modified - resource in the new kusion spec
  3. current - live resource in the cluster

The CreateThreeWayJSONMergePatch() method will do a couple of things:

  1. Create an add and update patch between current and modified
  2. Create a delete patch between original and modified
  3. Check for conflicts and merge the two patches to create one final patch

If something gets added to the resource manifest by something other than Kusion, for example a webhook, there could be unexpected preview/apply results depending on whether the resource is created/updated or unchanged:

If a resource is unchanged in the last apply and it does NOT exist in the release state (Yes this could happen because the entire func (ao *ApplyOperation) Apply is considered one atomic operation. So you'll end up with a live resource in the cluster, a plan resource in the spec, and NO prior kusion state.

In this case, its kusion state will be refreshed with the live resource in the cluster, returned by rt.Import(), which gets its result from rt.Read() method in the pkg/engine/runtime/kubernetes package.
If a resource is created/updated in the last apply, its state either gets updated using:

  • the planned resource (the one in the spec, if this resource does NOT exist in the state), or
  • the prior resource in the state (if it DOES exist)

What did you expect to happen?

The logic with the state should be consistent, either break up func (ao *ApplyOperation) Apply to be non-atomic, so partial apply success can yield a partial state, or update the logic in the engine with regard to unchanged resources.

How can we reproduce it (as minimally and precisely as possible)?

A re-trace of the issue:
Step 1: A resource gets created by kusion apply (apply #1), the resource (live) in the cluster shows:

Attributes:
  metadata:
      labels:
          aaa: aaa

But the apply operation fails halfway
Right now, the state (original) would be empty:

state:
    resources: []

Step 2: The resource gets patched by the webhook, the resource (live) in the cluster shows:

Attributes:
  metadata:
      labels:
          aaa: aaa
          bbb: bbb

The state does not get affected by the webhook patch.

Step 3: Do kusion apply again (apply #2), the delete patch gets produced between live and original, so the delete patch is empty (Both state and cluster do not have bbb label). Add patch is also empty. This resource is considered Unchanged in apply #2.
In this case, kusion will try to read prior from the last state, if it's not found, it will try to import it based on the live resource in the cluster. So kusion state gets updated by rt.Import(), which calls rt.Read(), which does a GET from the live resource in the cluster, which will refresh the state with label bbb.

Step 4: Do kusion apply again (apply #3), the delete patch gets produced between live and original, now the delete patch will contain the deletion of the label bbb, because the state (original) now contains label bbb but the modified (new spec) resource does not. Label bbb will be deleted in this case.

Anything else we need to know?

Relevant code
ApplyResponse from func (ao *ApplyOperation) Apply will result in an empty release if it failed:
https://github.com/KusionStack/kusion/blob/main/pkg/engine/api/apply.go#L142

rt.Apply() updates state with planned resource:
https://github.com/KusionStack/kusion/blob/main/pkg/engine/runtime/kubernetes/kubernetes_runtime.go#L165

rt.Read() updates state with live resource:
https://github.com/KusionStack/kusion/blob/main/pkg/engine/runtime/kubernetes/kubernetes_runtime.go#L189

Kusion version

$ kusion version
# paste output here

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

@ffforest ffforest added the kind/bug Something isn't working label Dec 30, 2024
@riven-blade
Copy link
Contributor

riven-blade commented Jan 17, 2025

Hi, I looked at the code for this bug.
pkg/engine/operation/apply.go Line43

Image

I think that in the preparation stage of this apply, if an error occurs, the old release state can be returned instead of empty.

Image

In the wait function, this key is the specific resource. We can separate the resources that have been deployed successfully and those that have failed, and then update the resources that have been successfully deployed, and the resources that have failed to be deployed retain the previous release status.

Image

Finally, the deployment is completed. Regardless of success or failure, the release state is updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants