Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle update network status failed after add and del attachments #237

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Longchuanzheng
Copy link

@Longchuanzheng Longchuanzheng commented Jul 23, 2024

What this PR does / why we need it:
This pr delete all attachments added in this processNextWorkItem when set NetworkStatus failed in handleResult. Because DEL is Idempotence we can always return to the state before dynamic handle.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #235

Special notes for your reviewer (optional):

@Longchuanzheng Longchuanzheng changed the title handle Interface already exists due to SetNetworkStatus failed Handle update network status failed after add and del attachments Jul 23, 2024
@Longchuanzheng
Copy link
Author

This pr will cancel(delete) the added attachment in two cases:

  1. When update network status failed in handleResult, we delete all attachment added in this reconcile.
  2. When add attachments failed in reconcile, we try to delete add failed and all subsequent attachments that need to be added.

Copy link
Collaborator

@LionelJouin LionelJouin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature makes sense. I don't know if it would be possible to add unit tests for it?

pkg/controller/pod.go Outdated Show resolved Hide resolved
pkg/controller/pod.go Outdated Show resolved Hide resolved
pkg/controller/pod.go Outdated Show resolved Hide resolved
@Longchuanzheng
Copy link
Author

I think this feature makes sense. I don't know if it would be possible to add unit tests for it?

ok, I will try.

@Longchuanzheng
Copy link
Author

Right now, it will return an err if we want to delet net1 in ut. If we change this, some previous ut will failed. So my ut expect FailedRemovingInterface for remove the net1.
Or do we need do more in this pr to improve ut (I think it will change more codes in ut.)?

@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch 2 times, most recently from 79be62f to 1c54179 Compare July 30, 2024 09:11
if shouldDeleteattachmentsAdded {
attachmentsToCancel = attachmentsToAdd
} else if addfailedIndex != -1 {
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[addfailedIndex])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[addfailedIndex])
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[:addfailedIndex])

Why having reverted this changes? All non failing attachment added must be reverted no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point ...

I agree with @LionelJouin; shouldn't you delete all the attachments you've just added ? i.e. setting the pod back to its original state ?

otherwise, the pod's network-status and the pod's actual interfaces will not be consistent.

Eventually(<-eventRecorder.Events).Should(Equal(expectedEventPayload))
})

It("the pod network-status dosen't change", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It("the pod network-status dosen't change", func() {
It("the pod network-status doesn't change", func() {

Eventually(<-eventRecorder.Events).Should(Equal(expectedEventPayload))

expectedEventPayload = fmt.Sprintf(
"Warning FailedRemovingInterface pod [%s]: failed removing interface %s from network: %s",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add networkConfig(multuscni.CmdDel, "net1", macAddr), to the fake multus client line 126 to test the simple case if you want (the simple case I mean: everything works except status update).

Eventually(<-eventRecorder.Events).Should(Equal(expectedEventPayload))
})

It("the pod network-status dosen't change", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is required as the status update has been prevented, so this might test Kubernetes.

Comment on lines +644 to +663
expectedError := errors.New("someerror")
k8sClient.PrependReactor("update", "pods", func(_ k8stesting.Action) (bool, runtime.Object, error) {
return true, nil, expectedError
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be moved to the JustBeforeEach

@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch 2 times, most recently from 13eaa7b to 2cd1a8d Compare August 30, 2024 06:31
@Longchuanzheng
Copy link
Author

@maiqueb Hi, can you review this pr, thank you in advance. I think this #235 will cause pod add/del attachnet stuck.
It is not very friendly if we stuck when try to hotplugin nic for vm with kubevirt.

Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. I agree we should keep the pod's interfaces consistent with the reported network-status at all times. Extra thanks for complying and adding the unit test @LionelJouin requested.

I'm requesting some changes to improve the overall code quality, and I also wonder if your approach (only deleting one interface) wouldn't break the pod interface / pod network status consistency.

Comment on lines 246 to 248
defer func() {
pnc.handleResult(err, podNamespacedName, pod, results, attachmentsToAdd, netnsPath, podSandboxID, addFailedIndex)
}()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how this looks.

How about we define a Rollback interface and pass that to the handle result func ?

It would be something like

type interface Rollback {
    Rollback() error
}

type asd struct {
    netnsPath string
    podSandboxID string
    attachmentsToRollback []nadv1.NetworkSelectionElement
}

func (a *asd) Rollback() error {
//     ... invoke DEL for attachments on asd.attachments
}

Then, on handleResult you just invoke rollback if err != nil

pkg/controller/pod_test.go Outdated Show resolved Hide resolved
pkg/controller/pod_test.go Outdated Show resolved Hide resolved
pkg/controller/pod_test.go Outdated Show resolved Hide resolved
pkg/controller/pod_test.go Outdated Show resolved Hide resolved
if shouldDeleteattachmentsAdded {
attachmentsToCancel = attachmentsToAdd
} else if addfailedIndex != -1 {
attachmentsToCancel = append(attachmentsToCancel, attachmentsToAdd[addfailedIndex])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point ...

I agree with @LionelJouin; shouldn't you delete all the attachments you've just added ? i.e. setting the pod back to its original state ?

otherwise, the pod's network-status and the pod's actual interfaces will not be consistent.

@Longchuanzheng Longchuanzheng force-pushed the handleSetNetworkStatusFailed1 branch 2 times, most recently from 13eaa7b to 2aa70ec Compare September 27, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Dynamic network plugin stuck due to updating pod network status failed
3 participants