-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkable: send state notifications after suppression if and only if the state differs compared to before the suppression started #9207
Conversation
Please do a first round of review so that we agree on the general change before I continue with addressing the remaining TODOs (some of which involve a discussion anyways). |
lib/icinga/checkable-check.cpp
Outdated
@@ -477,7 +477,7 @@ void Checkable::ProcessCheckResult(const CheckResult::Ptr& cr, const MessageOrig | |||
|
|||
if (send_notification && !is_flapping) { | |||
if (!IsPaused()) { | |||
if (suppress_notification) { | |||
if (suppress_notification || (GetSuppressedNotifications() & (NotificationRecovery|NotificationProblem))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK... this looks like if suppression is over, but the suppressed haven't been fired yet, let's neutralize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not neutralize but rather keep the suppression ongoing until suppressed notifications are fired. This is needed because otherwise, a notification might be sent because there was a state change compared to the last state within the suppression period.
But I should probably add a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this...
lib/icinga/checkable-check.cpp
Outdated
@@ -494,12 +494,22 @@ void Checkable::ProcessCheckResult(const CheckResult::Ptr& cr, const MessageOrig | |||
int suppressed_types_before (GetSuppressedNotifications()); | |||
int suppressed_types_after (suppressed_types_before | suppressed_types); | |||
|
|||
for (int conflict : {NotificationProblem | NotificationRecovery, NotificationFlappingStart | NotificationFlappingEnd}) { | |||
/* E.g. problem and recovery notifications neutralize each other. */ | |||
int conflict = NotificationFlappingStart | NotificationFlappingEnd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be const (for completeness).
lib/icinga/checkable-check.cpp
Outdated
if ((suppressed_types_after & conflict) == conflict) { | ||
suppressed_types_after &= ~conflict; | ||
} | ||
int stateNotifications = NotificationRecovery | NotificationProblem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be const (for completeness).
suppressed_types_after &= ~conflict; | ||
} | ||
int stateNotifications = NotificationRecovery | NotificationProblem; | ||
if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, stared long enough on this line to give up all this for now. We can continue this together. On a whiteboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extended the PR description (the numbered list on top is new), maybe this helps to understand. This condition is the check if either of the two flags is set for the very first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And... if none are set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly do you mean? Set as in added to suppressed_notifications
in general by this function? Or in some of the variables specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean: to me this looks like intended as XOR, but implemented as NAND – one more trigger case. Or does the latter never happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's easier to read the conditions the other way around:
suppressed_types & stateNotifications
: This means that the check result would have triggered a state notification (i.e. either a problem or a recovery notification) if it wasn't suppressed and therefore is stored for later.!(suppressed_types_before & stateNotifications)
: This part checks that neitherNotificationProblem
norNotificationRecovery
were set before. So this is the very first time a state notification is generated during the current suppression period.
With the new logic, it isn't really necessary to differentiate between the two types of state notifications. All the code would also work if only a single "NotificationState" flag was added to suppressed_notifications
. It's mainly done this way as the bitset and NotificationType
enum already exist in their current form.
For this PR, you could basically ignore states where only one of NotificationProblem
/NotificationRecovery
is set and treat them as if both were set. The code in FireSuppressedNotifications
will figure out which one is appropriate in the end, if either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I get it! (!BEFORE) && AFTER
5709d9c
to
3eb127d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now as we don't over-neutralise the bitmask, #9146 is definitely covered.
And due to the new attribute collateral damage is definitely avoided.
@@ -175,6 +175,9 @@ abstract class Checkable : CustomVarObject | |||
[state, no_user_view, no_user_modify] int suppressed_notifications { | |||
default {{{ return 0; }}} | |||
}; | |||
[state, enum, no_user_view, no_user_modify] ServiceState state_before_suppression { | |||
default {{{ return ServiceOK; }}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This choice feels definitely right (implies #7815 a bit).
return cr && !IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression(); | ||
} | ||
case NotificationRecovery: | ||
{ | ||
auto cr (GetLastCheckResult()); | ||
return cr && IsStateOK(cr->GetState()); | ||
return cr && IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This respects that there's not only one problem state, good.
@@ -224,12 +239,12 @@ bool Checkable::NotificationReasonApplies(NotificationType type) | |||
case NotificationProblem: | |||
{ | |||
auto cr (GetLastCheckResult()); | |||
return cr && !IsStateOK(cr->GetState()) && GetStateType() == StateTypeHard; | |||
return cr && !IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This keeps soft states...
|
||
continue; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and that suppresses them. Fine.
suppressed_types_after &= ~conflict; | ||
} | ||
int stateNotifications = NotificationRecovery | NotificationProblem; | ||
if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I get it! (!BEFORE) && AFTER
* WARNING -> OK -> CRITICAL generates both types once, but there should still be a notification. | ||
*/ | ||
// TODO: Is there a better way to get this value than this case distinction? | ||
SetStateBeforeSuppression(old_stateType == StateTypeHard ? old_state : ServiceOK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's the ?: for, but it definitely doesn’t harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the last state before the current state that was a hard state. The TODO is there because I didn't find an existing variable or function that gives me this exact value, but I might have missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon closer inspection, I think cr->GetPreviousHardState()
should work there but doesn't because of #9254 (comment)
suppressed_types_after &= ~conflict; | ||
} | ||
const int stateNotifications = NotificationRecovery | NotificationProblem; | ||
if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now problem|rec. don’t neutralise each other here but – I guess – a timer and the changed NotificationReasonApplies do this job.
3eb127d
to
e9e7234
Compare
@Al2Klimov Before I do the final cleanup and testing, can you please provide some feedback on one specific question? Do you think it's fine to have |
If you need to sync it at all (because it's not computed on the other node the same way due to $CHECKRESULT) I'd make one signal+handler per property. |
Yes, setting the attribute depends on the other notification logic in
So that there are two JSON-RPC messages sent when |
e9e7234
to
a2bd87d
Compare
Yes, but that's just my personal preference. |
e4514ff
to
3280440
Compare
48dc544
to
a74b101
Compare
https://github.com/Icinga/icinga2/runs/5377835600?check_suite_focus=true#step:7:47590
But only on Raspbian bullseye. I'll see if I'm able to reproduce this. But I don't think that has happened on previous versions of the PR on GH Actions and also |
There are data races between Reducing the timer interval allows reproducing the issue on other platforms:
(The timer started first in |
Apart from the race condition, there's also a small logic error: The following icinga2/lib/icinga/checkable-notification.cpp Lines 190 to 198 in a74b101
|
This commit changes the Checkable notification suppression logic (notifications are currently suppressed on the Checkable if it is unreachable, in a downtime, or acknowledged) to that after the suppression reason ends, a state notification is sent if and only if the first hard state after is different from the last hard state from before. If the checkable is in a soft state after the suppression ends, the notification is further suppressed until a hard state is reached. To achieve this behavior, a new attribute state_before_suppression is added to Checkable. This attribute is set to the last hard state the first time either a PROBLEM or a RECOVERY notification is suppressed. Compared to from before, neither of these two flags in the suppressed_notification will ever be cleared while the supression is still ongoing but only after the suppression ended and the current state is compared with the old state stored in state_before_suppression.
This ensures that in case of a failover in an HA zone, the other can take over properly and has the required state to send the proper notifications.
a74b101
to
90848f6
Compare
…tions Checkable: send state notifications after suppression if and only if the state differs compared to before the suppression started
This PR changes the behavior of state notifications (problem, recovery) following a suppression condition (unreachable, in downtime, acknowledged) so that notification is sent if and only if the hard state after the suppression is different from the (last) hard state from before the suppression. If the suppression ends while the checkable is in a soft state, the notification is further delayed until after a hard state was reached.
This is achieved by the following changes:
NotificationProblem
andNotificationRecovery
flags are never removed from thesuppressed_notifications
bitset, they may only be added. If they became obsolete is only checked after the suppression reason is gone when actually trying to send them. So actually both flags may be set at the same time. They can be interpreted as "at some point during the suppression, this notification would have been sent if there was no suppression".state_before_suppression
. When the suppression ends, the current state is compared with this value to decide if a notification should actually be sent.Note for reviewers: better look at the two commits separately. The one adding the tests has to expose the
FireSuppressedNotifications
function that was previously only accessible from the same compilation unit. This adds quite some bloat in the overall diff that does not introduce any change in behavior.Limitations
suppressed_notifications = NotificationProblem
).Tests
Automatic tests
The PR adds test case is added which iterates over all sequences of service states of length 4, applies them to a service while in downtime and checks if the appropriate notifications as described before are generated. This should test most of the new logic.
Manual tests
In order to test the new cluster sync message and also have some general test that everything also looks fine outside of the test suite, I performed the following test:
state_before_suppression
attribute).Snippet from my test configuration (you have to add standard apply rules for notifications):
Note that the object names were chosen carefully to that the authority for the notification objects for *-1 is on master-1 and for *-2 on master-2 when both nodes are running to avoid triggering bug #9265. Otherwise some of the notifications observed in step 8 may be duplicated on the second node in step 9.
I've written a test script for my local test cluster in docker compose. This probably won't work anywhere else out of the box but should give enough information to reproduce the tests in detail.
My test script
Before (db321b9)
Both nodes send a notification for the CRITICAL -> WARNING -> CRITICAL service (none-1/2) and miss the notification for the CRTICAL -> WARNING -> OK (rec-1/2) service.
master-1:
master-2:
Full output
After (90848f6)
Both nodes
master-1:
master-2:
Full output
Note: during my tests, sometimes notifications were only requested but not actually sent. With notice log enabled, this then looked like this:
So the new code was correctly requesting the notification but it got skipped later on. I believe this happens as the cluster was restarted between the initial problem notification was sent and the notification is triggered and some state is lost there.
fixes #9054
closes #9090
closes #9146