-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service notifications despite parent host being down #873
Comments
It seems someone else is experiencing similar issues where their child hosts are still giving off service notifications even though the parent is down/unreachable. |
On a
|
Well ..... It's working now as expected. The childs show up as unreachable en their services are no longer critical.
So it does work as expected for me (after all :D ) |
@tonoitp If you wish to re-create this, I believe the hosts/services on the lab network must be detected as UP/OK in a specific order. Let's say that you have the following relationship, assuming gw1 and host1 are on the lab network: Nagios -> gw1 -> host1. When performing the recovery on the lab network, you must make sure that host1 and its services are detected by Nagios as UP/OK before gw1. At least that was the scenario that triggered the false notifications in my case. I suppose you can create this scenario either by increasing check_interval of gw1, or simply manually forcing a check of host and services on host1 in order for Nagios to pick up its change in status before it detects that gw1 is up. |
Well, I re-used an test server I had, and found it's still a bit off.
Five hosts show as unreachable which is correct. But ... |
@djerveren Thnx for the tip, But IF that is a requirement, should nagios not take care of running detection in the right order? It's made aware of the dependencies. |
As far as I know, after a HARD DOWN/CRITICAL, Nagios just keeps running checks based on the Regardless, if the parent (prod-gw) is in a DOWN state, Nagios shouldn't send notifications for CRITICAL services on the child host (prod-mssql-1), period. But in my case it did. So I went here to report it. |
I see this check in
But I do not see something equivalent with the I'm curious whether a few of the above issues could be fixed with host and service dependencies Regardless, I'm curious as to why a similar check was never added. It seems that some people are aware of this, as this is in the documentation at the bottom
|
In summary, I think the service notifications are being sent because there is never any check that propagates upward to check the parents of the parents and so on. It just checks if it's host is up or if it's service parents are up (amongst other things) Basically what @djerveren said
|
I kind of agree, but since it doesn't perform any propagated upwards checks, it should then remember that prod-gw was still HARD DOWN (at least as far as Nagios was aware), which means that prod-mssql-1 should be considered UNREACHABLE and simply suppress the service recovery notifications based on that fact alone. I remember when working at op5 many years ago, our devs made Merlin suppress recovery notifications for hosts/services it hadn't sent out problem notifications for, which would've helped in this case. |
From what I can gather, a possible solution appears in two parts.
|
Fixed with #995 |
Hello all.
Recently we noticed some weirdness regarding child/parent relationships and notifications where two services notified despite the host's parent being down during a VPN connection drop. The parent/child configuration of hosts is as follows:
Nagios -> prod-gw -> prod-mssql-1
Where obviously the Nagios machine is parent of prod-gw, and prod-gw is parent of prod-mssql-1.
Here it all began. No notifications at this point, since the host went into SOFT DOWN before the HARD CRITICAL of the second service:
Nagios picks up that the prod-gw is in SOFT DOWN and sets prod-mssql-1 to UNREACHABLE:
A few minutes later, the prod-gw host finally reaches HARD DOWN and a notification is correctly sent out:
So far, everything has behaved as expected, no undesired notifications.
But after a while, the VPN connection is restored, and Nagios happens to mark prod-mssql-1 as UP before anything else:
[2022-07-20 21:33:34] HOST ALERT: prod-mssql-1;UP;HARD;1;OK - 10.128.0.6 rta 9.971ms lost 0%
This results in notifications for the two service at the top being sent out, activating our on-call staff for no reason:
Finally, prod-gw is checked and considered UP again, and correctly notifies about it:
Shouldn't those service notifications be suppressed regardless of the state of the host prod-mssql-1? From a parent/child perspective, prod-mssql-1 was still UNREACHABLE (or should be considered to be from a notification-perspective) due to prod-gw still being considered DOWN, and it was simply a race condition that caused these notifications, and in turn an unnecessary activation of on-call resources.
The text was updated successfully, but these errors were encountered: