You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to prevent automatic reboots of any nodes where a particular set of pods are running. To that end we have configured a --blocking-pod-selector. The idea being that while that particular node will not be rebooted at the moment, any other nodes will still be allowed to reboot if they can.
The issue with that seems to be that the lock is taken before the check for blocking pods, but not released when a blocking pod is found. Since the lock is still being held, this prevents any other nodes that could reboot from doing so.
Have I misunderstood anything? Is this a bug or the expected behavior?
The text was updated successfully, but these errors were encountered:
Yeah, when reading the code, it seems that the lock is acquired and not released in case that the reboot is blocked.
There are two possibilities for beeing blocked:
--alert-filter-regexp which queries prometheus and is independent of the current processed node (reboots are blocked in the whole cluster)
--blocking-pod-selector which is looking for specific pods on the processed node.
In case of the prometheus-query it is okay that the lock is held for longer and not released again as the reboots are blocked anyway. For the pod-selector it seems better to change the behaviour to release the lock again.
After re-reading the code a few times, I think the appropriate thing to do in this case is to simply check for blocking conditions prior to acquiring the lock. I've opened up a PR here that does that:
We want to prevent automatic reboots of any nodes where a particular set of pods are running. To that end we have configured a --blocking-pod-selector. The idea being that while that particular node will not be rebooted at the moment, any other nodes will still be allowed to reboot if they can.
The issue with that seems to be that the lock is taken before the check for blocking pods, but not released when a blocking pod is found. Since the lock is still being held, this prevents any other nodes that could reboot from doing so.
Have I misunderstood anything? Is this a bug or the expected behavior?
The text was updated successfully, but these errors were encountered: