Icinga sends notifications for hosts a second after getting into soft state (1 out of 3 tries) #10262

mihaiste · 2024-12-03T10:12:29Z

Hello everybody,

Some bit of context on the issue my team is facing.
We have an Icinga based environment set up on Kubernetes consisting of 2 masters and 2 satellites.
Our environment involves a down-up communication model, meaning that the agents (monitored VMs) connect to the satellites and the satellites connect to the masters.
The host template we are using for the monitored VMs is the following:
{ "accept_config": true, "check_command": "cluster-zone", "check_interval": "120", "max_check_attempts": "3", "retry_interval": "60", "enable_active_checks": true, "enable_flapping": true, "enable_passive_checks": false, "enable_perfdata": false, "has_agent": true, "master_should_connect": false, "object_type": "template", "vars": { "entity_of": "", "entity_type": "", "subscriptions": [ "INIT" ] }, "volatile": false }'

Our notification object is configured as:
{ "apply_to": "host", "assign_filter": "host.vars.team=%22MyTeam%22&host.zone=%22satellite%22", "imports": [ "template_mail-host-notification" ], "object_name": "mail-host-notification", "object_type": "apply", "period": "24x7", "states": [ "Down", "Up" ], "types": [ "Acknowledgement", "DowntimeEnd", "DowntimeRemoved", "DowntimeStart", "FlappingEnd", "FlappingStart", "Problem" ], "users": [ "my.user" ], "notification_interval": "3600", "times_begin": "0" }

Describe the bug

We have a CI/CD pipeline that updates or enforces the configuration to the Icingaweb Director component.
When applying the Director configuration, a few hosts changes their states to DOWN, but they get into a soft state first.
Although our host configuration implies max_check_attempts being set to 3, sometimes Icinga sends notifications for these hosts exactly 1 second after running the first check (see the screenshots).

To Reproduce

The issue at hand is not reproducible at every Director apply.

Expected behavior

Icinga to send notification when the object gets into Hard state.

Screenshots

Your Environment

Include as many relevant details about the environment you experienced the problem in

Version used (icinga2 --version): v2.14.2
Operating System and version: N/A (deployed on Kubernetes)
Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb influxdb2 journald livestatus opentsdb perfdata syslog mainlog
Enabled features: api checker icingadb notification
Icinga Web 2 version and modules (System - About):
Icinga Web 2 - 2.12.1
Loaded Modules
icingadb - 1.1.3
cube - 1.3.3
director - 1.11.1
incubator - 0.22.0
reporting - 1.0.2
x509 - 1.3.2
Config validation (icinga2 daemon -C):
[2024-12-03 09:57:51 +0000] information/cli: Icinga application loader (version: v2.14.2)
[2024-12-03 09:57:51 +0000] information/cli: Loading configuration file(s).
[2024-12-03 09:57:51 +0000] information/ConfigItem: Committing config item(s).
[2024-12-03 09:57:51 +0000] information/ApiListener: My API identity: satellite-0
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 7 Downtimes.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 59 Users.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 2 TimePeriods.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1837 Services.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 162 Zones.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 5 NotificationCommands.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 2770 Notifications.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 236 Hosts.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 16 HostGroups.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 162 Endpoints.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 ApiUser.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 540 CheckCommands.
[2024-12-03 09:57:52 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2024-12-03 09:57:52 +0000] information/cli: Finished validating the configuration file(s).
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
Here is the zones.conf from one of the satellites:
object Endpoint "satellite-0" {
// this is me
}
// the masters
object Endpoint "master-0" {
host = "master-0"
port = "443"
}
// the masters
object Endpoint "master-1" {
host = "master-1"
port = "443"
}

// the other satellites
object Endpoint "satellite-1" {
host = "satellite-1"
port = "443"
}

object Zone "master" {
endpoints = [
"master-1",
"master-0"]
}

object Zone "satellite" {
endpoints = [
"satellite-1",
"satellite-0"]
parent = "master"
}

object Zone "global-templates" {
global = true
}

object Zone "director-global" {
global = true
}

Additional context

Not sure what other details to provide in this context, please advise.

Thanks!

The text was updated successfully, but these errors were encountered:

oxzi · 2024-12-04T16:01:36Z

Thanks for creating this issue.

Could you please post your (redacted) Notification object for the Host in question? You should be able to find it with icinga2 object list -t Notification or further filtering based on your Host and Notification name.

Nevertheless, soft states should not result in a notification. Thus, could you please post the (redacted) icinga2.log around the time the state changed?

As the Director is involved in a CI/CD scenario, is the Host object in question being altered or even re-added? If so, could you please post the icinga2.log regarding the object creation including state changes?

Btw, please upgrade your Icinga 2 to the latest version 2.14.3 immediately as the 2.14.2 contains a known critical vulnerability: https://icinga.com/blog/icinga2-security-pre-announcement/, https://icinga.com/blog/critical-icinga-2-security-releases-2-14-3/, https://icinga.com/blog/uncovering-a-client-certificate-verification-bypass-in-icinga/, https://github.com/Icinga/icinga2/releases/tag/v2.14.3.

mihaiste · 2024-12-06T12:58:39Z

Hello,

I apologize for the delayed reply.
We have investigated a bit more on the topic and we will come back as soon as possible with the requested data. It's a fuss exporting the logs from cold storage.

Thanks for understanding!

Btw, please upgrade your Icinga 2 to the latest version 2.14.3 immediately as the 2.14.2 contains a known critical vulnerability: https://icinga.com/blog/icinga2-security-pre-announcement/, https://icinga.com/blog/critical-icinga-2-security-releases-2-14-3/, https://icinga.com/blog/uncovering-a-client-certificate-verification-bypass-in-icinga/, https://github.com/Icinga/icinga2/releases/tag/v2.14.3.

We have upgraded the version, thank you very much for the tip!

mihaiste · 2024-12-09T10:49:23Z

Hello,

Coming back with some additional details and the requested information so that maybe some light will shed over our environment.
We have actually narrowed things down to two "a bit more exact" scenarios in which we get notified for hosts in soft state.

Scenario 1 (this is the one described in the original post):

Host X is OK and checked with cluster-zone command.
Host X is down and gets into soft state (1st try out of 3).
Exactly one second later, the notification appears in Icinga Web on the host history and in the History->Notifications.
The notification is sent by one of the master entities.
At the next run of the host check, everything is fine => host X is OK.

Requested screenshot of the host notification object:

Requested logs from all components (2 masters and 2 satellites):
scenario_1-with_notif_in_webui.zip

Scenario 2:

Host X is OK and checked with cluster-zone command.
Host X is down and gets into soft state (1st try out of 3).
NO notification appears in Icinga Web on the host history and/or in the History->Notifications.
A notification for host X being down is sent by one of the master entities.
At the next run of the host check, everything is fine => host X is OK.

Screenshots:

Requested screenshot of the host notification object:

Logs from all components (2 masters and 2 satellites):
scenario_2-without_notif_in_webui.zip

Let me know if anything else is needed to get to the bottom of this mystery :)

aval13 · 2024-12-09T12:01:55Z

Hello,

Just to add a bit more information about this issue.
The CI/CD pipeline may only change service templates, commands, notifications and assigns. Some objects may get forced rewritten (due to Director limitations) or not touched.
Host objects are not touched by the CI/CD.
Nothing touches existing hosts.

As @mihaiste said, the design is top-down, we have 2 masters top (in the master zone), 2 satellites mid (in the satellite_ zone) and all the agents under the satellites connecting to both satellites.

Some time ago when we were testing Icinga, we noticed we were receiving notifications about an event (not sure, but I believe it was both Host and Service) from both a masters and a satellite, duplicating the emails. So we tweaked the email sending NotificationCommand script so that the masters send only if there is master related event (somehow this tweak fails for the cases we are seeing, we will need to look into that) and satellites notify only on non-master events.

One last things that we can't figure out is that for these events, the satellite correctly does not decide to send a notification, but the master does.

The only guess I have at this moment is that while the master processes any new (maybe) configuration update received from the Director, the Host check executed on the satellite hiccups, the satellite correctly thinks, hey, this is a soft state, nothing to do, reports the check result to the masters and somehow while being busy with the new configuration received the masters decide to notify although the check is in soft state.

Please let us know if any other information might be of use.

Thank you.

Al2Klimov assigned mihaiste Dec 5, 2024

mihaiste removed their assignment Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icinga sends notifications for hosts a second after getting into soft state (1 out of 3 tries) #10262

Icinga sends notifications for hosts a second after getting into soft state (1 out of 3 tries) #10262

mihaiste commented Dec 3, 2024

oxzi commented Dec 4, 2024 •

edited

Loading

mihaiste commented Dec 6, 2024

mihaiste commented Dec 9, 2024

aval13 commented Dec 9, 2024

Icinga sends notifications for hosts a second after getting into soft state (1 out of 3 tries) #10262

Icinga sends notifications for hosts a second after getting into soft state (1 out of 3 tries) #10262

Comments

mihaiste commented Dec 3, 2024

Describe the bug

To Reproduce

Expected behavior

Screenshots

Your Environment

Additional context

oxzi commented Dec 4, 2024 • edited Loading

mihaiste commented Dec 6, 2024

mihaiste commented Dec 9, 2024

aval13 commented Dec 9, 2024

oxzi commented Dec 4, 2024 •

edited

Loading