-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga sends notifications for hosts a second after getting into soft state (1 out of 3 tries) #10262
Comments
Thanks for creating this issue. Could you please post your (redacted) Notification object for the Host in question? You should be able to find it with Nevertheless, soft states should not result in a notification. Thus, could you please post the (redacted) icinga2.log around the time the state changed? As the Director is involved in a CI/CD scenario, is the Host object in question being altered or even re-added? If so, could you please post the icinga2.log regarding the object creation including state changes? Btw, please upgrade your Icinga 2 to the latest version 2.14.3 immediately as the 2.14.2 contains a known critical vulnerability: https://icinga.com/blog/icinga2-security-pre-announcement/, https://icinga.com/blog/critical-icinga-2-security-releases-2-14-3/, https://icinga.com/blog/uncovering-a-client-certificate-verification-bypass-in-icinga/, https://github.com/Icinga/icinga2/releases/tag/v2.14.3. |
Hello, I apologize for the delayed reply. Thanks for understanding!
We have upgraded the version, thank you very much for the tip! |
Hello, Coming back with some additional details and the requested information so that maybe some light will shed over our environment. Scenario 1 (this is the one described in the original post):
Requested screenshot of the host notification object: Requested logs from all components (2 masters and 2 satellites): Scenario 2:
Requested screenshot of the host notification object: Logs from all components (2 masters and 2 satellites): Let me know if anything else is needed to get to the bottom of this mystery :) |
Hello, Just to add a bit more information about this issue. As @mihaiste said, the design is top-down, we have 2 masters top (in the master zone), 2 satellites mid (in the satellite_ zone) and all the agents under the satellites connecting to both satellites. Some time ago when we were testing Icinga, we noticed we were receiving notifications about an event (not sure, but I believe it was both Host and Service) from both a masters and a satellite, duplicating the emails. So we tweaked the email sending NotificationCommand script so that the masters send only if there is master related event (somehow this tweak fails for the cases we are seeing, we will need to look into that) and satellites notify only on non-master events. One last things that we can't figure out is that for these events, the satellite correctly does not decide to send a notification, but the master does. The only guess I have at this moment is that while the master processes any new (maybe) configuration update received from the Director, the Host check executed on the satellite hiccups, the satellite correctly thinks, hey, this is a soft state, nothing to do, reports the check result to the masters and somehow while being busy with the new configuration received the masters decide to notify although the check is in soft state. Please let us know if any other information might be of use. Thank you. |
Hello everybody,
Some bit of context on the issue my team is facing.
We have an Icinga based environment set up on Kubernetes consisting of 2 masters and 2 satellites.
Our environment involves a down-up communication model, meaning that the agents (monitored VMs) connect to the satellites and the satellites connect to the masters.
The host template we are using for the monitored VMs is the following:
{ "accept_config": true, "check_command": "cluster-zone", "check_interval": "120", "max_check_attempts": "3", "retry_interval": "60", "enable_active_checks": true, "enable_flapping": true, "enable_passive_checks": false, "enable_perfdata": false, "has_agent": true, "master_should_connect": false, "object_type": "template", "vars": { "entity_of": "", "entity_type": "", "subscriptions": [ "INIT" ] }, "volatile": false }'
Our notification object is configured as:
{ "apply_to": "host", "assign_filter": "host.vars.team=%22MyTeam%22&host.zone=%22satellite%22", "imports": [ "template_mail-host-notification" ], "object_name": "mail-host-notification", "object_type": "apply", "period": "24x7", "states": [ "Down", "Up" ], "types": [ "Acknowledgement", "DowntimeEnd", "DowntimeRemoved", "DowntimeStart", "FlappingEnd", "FlappingStart", "Problem" ], "users": [ "my.user" ], "notification_interval": "3600", "times_begin": "0" }
Describe the bug
We have a CI/CD pipeline that updates or enforces the configuration to the Icingaweb Director component.
When applying the Director configuration, a few hosts changes their states to DOWN, but they get into a soft state first.
Although our host configuration implies max_check_attempts being set to 3, sometimes Icinga sends notifications for these hosts exactly 1 second after running the first check (see the screenshots).
To Reproduce
The issue at hand is not reproducible at every Director apply.
Expected behavior
Icinga to send notification when the object gets into Hard state.
Screenshots
Your Environment
Include as many relevant details about the environment you experienced the problem in
Version used (
icinga2 --version
): v2.14.2Operating System and version: N/A (deployed on Kubernetes)
Enabled features (
icinga2 feature list
):Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb influxdb2 journald livestatus opentsdb perfdata syslog mainlog
Enabled features: api checker icingadb notification
Icinga Web 2 version and modules (System - About):
Icinga Web 2 - 2.12.1
Loaded Modules
icingadb - 1.1.3
cube - 1.3.3
director - 1.11.1
incubator - 0.22.0
reporting - 1.0.2
x509 - 1.3.2
Config validation (
icinga2 daemon -C
):[2024-12-03 09:57:51 +0000] information/cli: Icinga application loader (version: v2.14.2)
[2024-12-03 09:57:51 +0000] information/cli: Loading configuration file(s).
[2024-12-03 09:57:51 +0000] information/ConfigItem: Committing config item(s).
[2024-12-03 09:57:51 +0000] information/ApiListener: My API identity: satellite-0
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 7 Downtimes.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 59 Users.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 2 TimePeriods.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1837 Services.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 162 Zones.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 5 NotificationCommands.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 2770 Notifications.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 236 Hosts.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 16 HostGroups.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 162 Endpoints.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 ApiUser.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 540 CheckCommands.
[2024-12-03 09:57:52 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2024-12-03 09:57:52 +0000] information/cli: Finished validating the configuration file(s).
If you run multiple Icinga 2 instances, the
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.Here is the zones.conf from one of the satellites:
object Endpoint "satellite-0" {
// this is me
}
// the masters
object Endpoint "master-0" {
host = "master-0"
port = "443"
}
// the masters
object Endpoint "master-1" {
host = "master-1"
port = "443"
}
// the other satellites
object Endpoint "satellite-1" {
host = "satellite-1"
port = "443"
}
object Zone "master" {
endpoints = [
"master-1",
"master-0"]
}
object Zone "satellite" {
endpoints = [
"satellite-1",
"satellite-0"]
parent = "master"
}
object Zone "global-templates" {
global = true
}
object Zone "director-global" {
global = true
}
Additional context
Not sure what other details to provide in this context, please advise.
Thanks!
The text was updated successfully, but these errors were encountered: