Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consult Checkable#state_before_suppression only if there's a suppression #10032

Conversation

Al2Klimov
Copy link
Member

It's only set along with adding NotificationProblem or NotificationRecovery to Checkable#suppressed_notifications and never reset. Also its default, ServiceOK, is an actual state. Not being aware of these facts confuses Checkable#NotificationReasonApplies() which is a condition for cleaning Notification#suppressed_notifications in FireSuppressedNotifications(). I.e. suppressed notifications can get lost.

fixes #10025

@cla-bot cla-bot bot added the cla/signed label Mar 28, 2024
@icinga-probot icinga-probot bot added area/configuration DSL, parser, compiler, error handling bug Something isn't working ref/IP labels Mar 28, 2024
@Al2Klimov
Copy link
Member Author

Test

  1. Get checkable hard not-OK with a problem notification
  2. Await being outside your notification time period
  3. Get checkable OK again
  4. Await being inside your notification time period

Before: no recovery notification

After: a recovery notification

@Al2Klimov Al2Klimov added this to the 2.15.0 milestone Apr 5, 2024
@Al2Klimov Al2Klimov requested a review from julianbrost June 10, 2024 15:46
@Al2Klimov Al2Klimov self-assigned this Aug 20, 2024
@julianbrost
Copy link
Contributor

Can you please rebase this for up-to-date GitHub Actions checks? There are even failed runs where to logs were already purged so one can't tell if this was some kind of transient error or actually something being broken by this PR.

It's only set along with adding NotificationProblem or NotificationRecovery
to Checkable#suppressed_notifications and never reset. Also its default,
ServiceOK, is an actual state. Not being aware of these facts confuses
Checkable#NotificationReasonApplies() which is a condition for cleaning
Notification#suppressed_notifications in FireSuppressedNotifications().
I.e. suppressed notifications can get lost.
@Al2Klimov Al2Klimov force-pushed the missing-notification-recovery-outside-time-period-10025 branch from e18b15a to cafbe5e Compare August 21, 2024 08:04
@julianbrost
Copy link
Contributor

The longer I look at the code, the more I think Checkable::NotificationReasonApplies() wasn't the right place to add the cr->GetState() != GetStateBeforeSuppression() check back in #9207.

For easy reference, the relevant code places from the current master:

/**
* Returns whether sending a notification of type type right now would represent *this' current state correctly.
*
* @param type The type of notification to send (or not to send).
*
* @return Whether to send the notification.
*/
bool Checkable::NotificationReasonApplies(NotificationType type)
{
switch (type) {
case NotificationProblem:
{
auto cr (GetLastCheckResult());
return cr && !IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression();
}
case NotificationRecovery:
{
auto cr (GetLastCheckResult());
return cr && IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression();
}
case NotificationFlappingStart:
return IsFlapping();
case NotificationFlappingEnd:
return !IsFlapping();
default:
VERIFY(!"Checkable#NotificationReasonStillApplies(): given type not implemented");
return false;
}
}

if (suppressed_types & (NotificationProblem|NotificationRecovery)) {
CheckResult::Ptr cr = GetLastCheckResult();
NotificationType type = cr && IsStateOK(cr->GetState()) ? NotificationRecovery : NotificationProblem;
bool state_suppressed = NotificationReasonSuppressed(NotificationProblem) || NotificationReasonSuppressed(NotificationRecovery);
/* Only process (i.e. send or dismiss) suppressed state notifications if the following conditions are met:
*
* 1. State notifications are not suppressed at the moment. State notifications must only be removed from
* the suppressed notifications bitset after the reason for the suppression is gone as these bits are
* used as a marker for when to set the state_before_suppression attribute.
* 2. The checkable is in a hard state. Soft states represent a state where we are not certain yet about
* the actual state and wait with sending notifications. If we want to immediately send a notification,
* we might send a recovery notification for something that just started failing or a problem
* notification which might be for an intermittent problem that would have never received a
* notification if there was no suppression as it still was in a soft state. Both cases aren't ideal so
* better wait until we are certain.
* 3. The checkable isn't likely checked soon. For example, if a downtime ended, give the checkable a
* chance to recover afterwards before sending a notification.
* 4. No parent recovered recently. Similar to the previous condition, give the checkable a chance to
* recover after one of its dependencies recovered before sending a notification.
*
* If any of these conditions is not met, processing the suppressed notification is further delayed.
*/
if (!state_suppressed && GetStateType() == StateTypeHard && !IsLikelyToBeCheckedSoon() && !wasLastParentRecoveryRecent.Get()) {
if (NotificationReasonApplies(type)) {
Checkable::OnNotificationsRequested(this, type, cr, "", "", nullptr);
}
subtract |= NotificationRecovery|NotificationProblem;
}
}

Now look at the if (NotificationReasonApplies(type)) { (line 206 in the second code block): what does it actually check? If you look above at line 184, type is set based on IsStateOK(cr->GetState()), so that's the same thing Checkable::NotificationReasonApplies() checks1. So the only thing that remains in that function is the GetStateBeforeSuppression() check.

Now the following idea:

-if (NotificationReasonApplies(type)) {
+if (cr->GetState() != GetStateBeforeSuppression()) {

That should move the check to where it's relevant, namely where a notification might actually be sent after the suppression ended. That would actually allow us to get pretty close to you suggested in #10025 (comment), i.e. the GetStateBeforeSuppression() check could be removed from NotificationReasonApplies(). IMHO this would also better align the function with what its documentation comment currently says.

The other callers would have to be looked into as well. I believe the following one would have to be changed:

for (auto type : {NotificationProblem, NotificationRecovery, NotificationFlappingStart, NotificationFlappingEnd}) {
if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
subtract |= type;
suppressedTypes &= ~type;
}
}

I think the NotificationProblem and NotificationRecovery should never be cleared before the suppression actually ended, so simply removing them from the iteration should do the job.

What do you think of that idea?

Footnotes

  1. Assuming there's no weird race condition where the result GetLastCheckResult() of changes between the calls.

@Al2Klimov
Copy link
Member Author

I think the NotificationProblem and NotificationRecovery should never be cleared before the suppression actually ended,

Why not? If a hard critical (suppressed += NotificationProblem) recovers within the same downtime (= suppression), why not to clean up (suppressed -= NotificationProblem) right away?

Generally speaking, you asked why I stick to already existing solutions for a problem in unfavor of new ones. I don't. I stick to straightforward ones fixing that particular problem. (Like this PR.) Because the more you touch...

The other callers would have to be looked into as well.

... the more likely you'll touch a running system where you didn't know you don't want to. Especially for a bugfix release!

@julianbrost
Copy link
Contributor

I think the NotificationProblem and NotificationRecovery should never be cleared before the suppression actually ended,

Why not? If a hard critical (suppressed += NotificationProblem) recovers within the same downtime (= suppression), why not to clean up (suppressed -= NotificationProblem) right away?

Because NotificationProblem and NotificationRecovery don't necessarily cancel out each other. Take WARNING -> OK -> CRITICAL, that's one recovery and one problem, yet there was a state change to notify about, which was the point of #9207.

Generally speaking, you asked why I stick to already existing solutions for a problem in unfavor of new ones. I don't. I stick to straightforward ones fixing that particular problem. (Like this PR.)

This PR makes some condition more complex. The goal of my suggestion would be to remove complexity by moving the different parts of the checks to where they are relevant. One problem of NotificationReasonApplies() is that it's called both for the "do we still need to send a notification?" and "can we discard that notification?". These sound like they're opposites but in fact they aren't, see the example with state changes above, where the answer can be "maybe, depends on what happens in the future". Returning a boolean can only work for that situation when the function internally decides/guesses (call it how you want) how the return value will be used. I'd say that's overall more complex and complexity is bad for maintainability.

Because the more you touch...

The other callers would have to be looked into as well.

... the more likely you'll touch a running system where you didn't know you don't want to. Especially for a bugfix release!

That sounds nice in theory, but in practice, also for your change, you have to verify how this affects the different callers.

@julianbrost
Copy link
Contributor

The other callers would have to be looked into as well. I believe the following one would have to be changed:

for (auto type : {NotificationProblem, NotificationRecovery, NotificationFlappingStart, NotificationFlappingEnd}) {
if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
subtract |= type;
suppressedTypes &= ~type;
}
}

I think the NotificationProblem and NotificationRecovery should never be cleared before the suppression actually ended, so simply removing them from the iteration should do the job.

Looking at this with a bit more context around:

int suppressedTypes (notification->GetSuppressedNotifications());
if (!suppressedTypes)
return;
int subtract = 0;
auto checkable (notification->GetCheckable());
for (auto type : {NotificationProblem, NotificationRecovery, NotificationFlappingStart, NotificationFlappingEnd}) {
if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
subtract |= type;
suppressedTypes &= ~type;
}
}

That place is actually related to Notification::GetSuppressedNotifications(), not Checkable::GetSuppressedNotifications(). So I'd say the following reasoning in principle applies there as well:

Because NotificationProblem and NotificationRecovery don't necessarily cancel out each other. Take WARNING -> OK -> CRITICAL, that's one recovery and one problem, yet there was a state change to notify about, which was the point of #9207.

However, suppression on the Notification object level (and probably on the User level as well, that one has a period attribute as well) there's no tracking of the old state like with Checkable::GetStateBeforeSuppression(). So I fear any suppression on that level will still be affected by a bug similar to #9207, i.e. the two state changes in WARNING -> OK -> CRITICAL cancelling out each other even if they shouldn't.

For this PR specifically, this poses the question whether there can be situations where the condition !(cr->GetState() == GetStateBeforeSuppression() && GetSuppressedNotifications() & stateNotifications) could still Checkable::GetStateBeforeSuppression() even though this should be irrelevant for a suppression on the Notification level.

@yhabteab yhabteab removed this from the 2.15.0 milestone Oct 21, 2024
@icinga-probot icinga-probot bot deleted the missing-notification-recovery-outside-time-period-10025 branch October 24, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/configuration DSL, parser, compiler, error handling bug Something isn't working cla/signed ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing notification after recovery outside the time period
3 participants