Bugfix: configobject shutdown order #10191

w1ll-i-code · 2024-10-17T07:22:25Z

This PR fixes #10179.

Shutdown the object in the correct order.

During a restart the objects are shut down in the reversed order they should be. With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Increase activation_priority for key config objects to ensure correct startup and shutdown order

Make sure all configobjects are started and shut down in the right order, so that each object has its possible dependencies loaded before starting execution.

cla-bot · 2024-10-17T07:22:28Z

Thank you for your pull request. Before we can look at it, you'll need to sign a Contributor License Agreement (CLA).

Please follow instructions at https://icinga.com/company/contributor-agreement to sign the CLA.

After that, please reply here with a comment and we'll verify.

Contributors that have not signed yet: @w1ll-i-code

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Please contact us if you think this is the case.
If you signed the CLA as a corporation, your GitHub username may not have been submitted to us. Please reach out to the responsible person in your organization.

w1ll-i-code · 2024-10-17T11:41:42Z

How do I retrigger the CLA check?

bobapple · 2024-10-17T12:38:32Z

@cla-bot check

lippserd · 2024-10-22T13:38:04Z

Hi @w1ll-i-code,

Thanks for the PR. We will look at the changes as soon as possible.

Best regards,
Eric

yhabteab · 2024-10-22T15:09:40Z

Hi, thank you for your contribution!

With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Can you confirm that you have tested this PR and that it resolves your problem described in the referenced issue? Did you notice that the checker is also subscribed to the checkable events? Meaning, this shouldn't provide you any benefit, since the checker will drop the checkables from its queue if they are stopped first.

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup.
To illustrate, while a host object is active on one node, its associated notification object can become active on another node. Consequently, each time the host object encounters a problem state, this node must sync the information to the other node to trigger a notification. However, such cluster events can disappear mysteriously if the server is restarted or reloaded at a wrong time. There is already an issue (#10157) and a draft PR #10161 to address this.

w1ll-i-code · 2024-10-23T07:45:28Z

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue. I was hoping on a sooner response, but the fix is now running in production and I am sure if anything is still missing, I'll get some feedback in. With the setup I provided in the reproduce, I could observe several lost entries to the ido on every restart. A problem which did no longer occur, once I made the change to the shutdown order. Knowing that I also took the liberty to rearrange some other objects in the startup/shutdown order to behave in accordance with my findings.

Did you notice that the checker is also subscribed to the checkable events?

I have to admit, I am usually not a C++ developer and have had no prior experience with this particular part of the icinga2 code base. I did, in fact, not know that. However I can tell you that this particular fix did resolve the issue, from the test I did perform...

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup.

I don't think that's the case, as neither I for the reproducing of the issue, nor the client first reporting this issue have a cluster setup for icinga2. You can have a look at the docker-compose I attached to the issue, to reproduce the bug, to confirm this for yourself. This can be observed on a single icinga2 instance running by its own, without any other cluster nodes or even satellites.

Al2Klimov · 2024-10-23T10:40:18Z

lib/base/configobject.cpp

+	// The higher the activation priority, the later the config object will be
+	// loaded, with the CheckerComponent at 300 being last. To make sure they
+	// are shutting down in the right order, we need to shut down the objects
+	// with the highest priority first.


This is true and already the case since 8ad1717.

Al2Klimov · 2024-10-23T10:50:21Z

lib/base/configobject.cpp

-		if (a->GetActivationPriority() > b->GetActivationPriority())
-			return true;
-		return false;
+		return a->GetActivationPriority() < b->GetActivationPriority();


But this reverts it and does the exact opposite now. This lambda "returns true if the first argument (...) is ordered before (...) the second." That's what the code did. First argument (a) has a higher (>) GetActivationPriority = first argument is ordered before the second.

Thanks, you're right. I seem to have tested the changes in tandem, as they would not be effective together. I'll update the PR accordingly.

Al2Klimov · 2024-10-23T10:53:37Z

lib/base/configobject.cpp

+				<< "Deactivate() called for config object '"
+				<< object->GetName()
+				<< "' with type '"
+				<< type->GetName()
+				<< "' and priority "
+				<< type->GetActivationPriority()
+				<< ".";


While too long lines may be hard to read, the other extreme neither contributes to readability.

Suggested change

<< "Deactivate() called for config object '"

<< object->GetName()

<< "' with type '"

<< type->GetName()

<< "' and priority "

<< type->GetActivationPriority()

<< ".";

<< "Deactivate() called for config object '" << object->GetName()

<< "' with type '" << type->GetName()

<< "' and priority " << type->GetActivationPriority() << ".";

Al2Klimov · 2024-10-23T10:55:59Z

lib/db_ido/dbconnection.ti

@@ -10,6 +10,8 @@ namespace icinga

 abstract class DbConnection : ConfigObject
 {
+	activation_priority 250;
+


activation_priority is to be used on final types as Ido*sqlConnection, not abstract classes.

yhabteab · 2024-10-23T10:58:36Z

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue.

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component. As I already said, if the checkable is stopped first the checker will immediately drop it from its queues. Likewise, the change for the API listener or the endpoint shouldn't be necessary at all, as you don't want to accept new connections while you're on the shutdown phase.

Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history.

Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (#10179 (comment))? As far as I can see, there is nothing wrong with that service history and I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup:

[15823/144776] Analyzing state changes for host-icinga-lost-statehistory-00946!service-icinga-lost-statechange-05
[15824/144776] Analyzing state changes for icinga2!disk
thread 'main' panicked at src/icingadb.rs:133:75:
called `Option::unwrap()` on a `None` value

But at least it claimed to have found an error for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04:

[217/144776] Analyzing state changes for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04
Idk what's going on host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04: ObjectNotifications { service_id: "004D960037D7EAE178416B5CD110F13F28FD1542", hard_state_changes: 1, notification_count: 2 }

As you can see in Icinga DB Web, this is not an error, the first notification is triggered due to the state change from PENDING to WARNING and the second one is a reminder notification, since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

{
    "results": [
        {
            "attrs": {
                "__name": "host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04!apply-nt-log-notifications",
                "active": true,
                "command": "nc-log-notification",
                "ha_mode": 0,
                "host_name": "host-icinga-lost-statehistory-00868",
                "interval": 1800,
...

w1ll-i-code · 2024-10-23T13:18:10Z

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component.

No it is not. I made those changes only after I confirmed it works with the reversed shutdown order....

Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (#10179 (comment))?

I'm sorry, I should have clarified more. From the bottom up, you see a hard state change from ok to warning and the notification that was sent. After that, you see a state change from hard ok to soft warning, meaning the service had a previous hard state changed to ok and it and the corresponding notification are not present in the history.

I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup

had removed those services manually as I did not need them. I did not think that it would break my script.

since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

IYou are right, I only changed the notification interval on the test machine, but forgot to do the same locally before sending the basket. I apologize for that...

Al2Klimov

Ok, so these are the current APs:

lib/base/*logger.ti:12: activation_priority -100;
lib/icinga/icingaapplication.ti:12:     activation_priority -50;
lib/icinga/downtime.ti:23:      activation_priority -10;
lib/*/*.ti:*:      //activation_priority 0;
lib/icinga/scheduleddowntime.ti:24:     activation_priority 20;
lib/remote/apilistener.ti:15:   activation_priority 50;

lib/compat/compatlogger.ti:13:  activation_priority 100;
lib/compat/externalcommandlistener.ti:13:       activation_priority 100;
lib/db_ido_*sql/ido*sqlconnection.ti:12:        activation_priority 100;
lib/icingadb/icingadb.ti:13:    activation_priority 100;
lib/livestatus/livestatuslistener.ti:12:        activation_priority 100;
lib/perfdata/*writer.ti:12:     activation_priority 100;

lib/notification/notificationcomponent.ti:12:   activation_priority 200;
lib/checker/checkercomponent.ti:12:     activation_priority 300;

All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.

<walloftext for="devs">
Now, if we travel back in time to better understand how we got here via git log -p -U0 and use the pager to search for "activation_priority", we get:

+- whole .ti files

and the stuff that matters

</walloftext>

lib/remote/apilistener.ti

lib/icinga/notification.ti

cla-bot · 2024-11-11T09:30:29Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: William Calliari.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

w1ll-i-code · 2024-11-11T09:33:24Z

@Al2Klimov Thanks for the hints. Now everything should be in order.

Al2Klimov · 2024-11-11T11:36:09Z

lib/icingadb/icingadb.ti

@@ -10,7 +10,7 @@ namespace icinga

 class IcingaDB : ConfigObject
 {
-	activation_priority 100;
+	activation_priority 250;


All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.

The problem is that NotificationComponent has a activation_priority of 200, which means, there is some time, where notifications are already/still being sent, but the event is not written into the db. I'd like to avoid that entirely.

Then lower NotificationComponent's activation_priority below 100 instead.

Al2Klimov

Much clearer the diff is! 👍
However:

Al2Klimov · 2024-11-12T11:26:05Z

lib/remote/endpoint.ti

@@ -10,6 +10,8 @@ namespace icinga

 class Endpoint : ConfigObject
 {
+	activation_priority 300;
+


What is this for? Neither Endpoint overrides #Start() or #Stop(), nor any code calls Endpoint#IsActive().

Because I did not notice till rn, that all communication with the endpoints also counts as ApiListener.... I expected them to behave like the other config objects... My brain hurts now.

Al2Klimov · 2024-11-12T11:29:31Z

lib/notification/notificationcomponent.ti

@@ -9,7 +9,7 @@ namespace icinga

 class NotificationComponent : ConfigObject
 {
-	activation_priority 200;
+	activation_priority 75;


#10191 (comment) sounds like you'd like to do the opposite, i.e to leave this as-is. Imagine, all data outputs (100) are shut down or not yet online, but notifications (75) get sent – silently.

#10191 (comment) This was your suggestion.... I honestly find this code so confusing that I just accepted it as is.

Sorry, I was fully focused on cleaning up the diff.
Anyway, now after thinking about what's left, I suspect what matters in your case is Notification -5 and this one is not needed or even harmful.

w1ll-i-code · 2024-11-13T08:07:15Z

Somewhere down the line while applying your suggested improvements, the fix broke again. I'll try to reproduce exactly what went wrong.

Make sure all configobjects are started and shut down in an order, such that all state changes of monitoring objects are correctly processed during the shutdown.

Al2Klimov · 2024-11-25T14:00:46Z

lib/icingadb/icingadb.ti

@@ -10,7 +10,7 @@ namespace icinga

 class IcingaDB : ConfigObject
 {
-	activation_priority 100;
+	activation_priority -50;


No matter whether Icinga DB purges Redis and re-inserts everything – or performs a diff – given #10151, not yet activated objects – all checkables – will effectively be deleted. This is not acceptable, so if -50 is actually needed here, we have to:

merge Call ConfigObject#Register() after #OnAllConfigLoaded() for runtime creation #10057 not to need IcingaDB: Don't sync partially initialised objects #10151

revert IcingaDB: Don't sync partially initialised objects #10151

not yet activated objects – all checkables – will effectively be deleted

Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.

icinga2/lib/icingadb/icingadb-objects.cpp

Lines 2781 to 2782 in 3218908

} else if (!object->IsActive() &&

object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp

I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority field. So I wanted to wait until tomorrow and discuss it in our next meeting.

yhabteab · 2024-11-18T09:51:25Z

lib/base/configobject.cpp

@@ -598,10 +598,10 @@ void ConfigObject::StopObjects()
 			continue;

 		for (const ConfigObject::Ptr& object : dtype->GetObjects()) {
-#ifdef I2_DEBUG


Please revert this!

yhabteab · 2024-11-18T10:49:57Z

lib/base/configobject.cpp

-#endif /* I2_DEBUG */
+				<< "Deactivate() called for config object '" << object->GetName()
+				<< "' with type '" << type->GetName()
+				<< "and priority " << type->GetActivationPriority() << "'.";


Suggested change

<< "and priority " << type->GetActivationPriority() << "'.";

<< "' and priority '" << type->GetActivationPriority() << "'.";

yhabteab · 2024-11-25T16:30:59Z

lib/icingadb/icingadb.ti

@@ -10,7 +10,7 @@ namespace icinga

 class IcingaDB : ConfigObject
 {
-	activation_priority 100;
+	activation_priority -50;


not yet activated objects – all checkables – will effectively be deleted

Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.

icinga2/lib/icingadb/icingadb-objects.cpp

Lines 2781 to 2782 in 3218908

} else if (!object->IsActive() &&

object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp

I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority field. So I wanted to wait until tomorrow and discuss it in our next meeting.

w1ll-i-code · 2024-11-27T09:01:40Z

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

yhabteab · 2024-11-27T12:43:08Z

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

Hi @w1ll-i-code, I'm afraid that changing the order of activation_priority won't fully solve the problem, or if it partially does, it will break the existing activation_priority on startup. I'm really sorry for the inconvenience, but I think the underlying problem is much deeper than just changing the activation and deactivation order of objects. It's going to take a lot more time and testing to fix it properly. I don't want to say that this PR is irrelevant, but even with this PR, remotely generated events will still get lost, so we'd love to put in more effort and provide a fix in addition to this PR. However, as we are currently fully occupied with preparations for the next Icinga 2 bugfix release, which unfortunately does not include this, we'll give you an update after the next 2-3 weeks once Icinga 2.14.4 is ready.

w1ll-i-code · 2024-11-27T13:26:17Z

@yhabteab Thanks for the quick reply. I have still some questions:

What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?
What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

yhabteab · 2024-11-28T13:52:45Z

What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?

Yes, I'm referring to check result events coming from other endpoints.

What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

Not exactly something is not working, but so far IcingaDB is activated after hosts, services, downtimes etc. (originally I thought it was actually the other way round, i.e. IcingaDB is activated first and then the other objects, that's my mistake), but with this PR its activation priority has been set to -50, which means it will definitely be activated before anything else and ~~probably run into problems like in #10151 more prominently~~ (edit: NVM! that's irrelevant here, but I'll have to test it out properly, which I don't have that much time for at the moment, but I'll give you an update in the coming weeks).

w1ll-i-code · 2024-11-28T14:15:20Z

Yes, I'm referring to check result events coming from other endpoints.

@yhabteab The ApiListener should already be shut down at that point (activation_priority 50), right? So it wouldn't impact it?

yhabteab · 2024-11-29T09:57:10Z

Unfortunately, stopping the ApiListener object does not stop the listener from accepting new connections and does not immediately disconnect all already connected endpoints either. All the active connections will only be terminated when the process is terminated, not after the API listener has been stopped. @julianbrost was/is working on fixing this, i.e. when the listener is stopped, to manually disconnect all connected endpoints and reject any new connection attempts, but as mentioned before, this will require excessive testing and more time to invest in and will therefore not make it into the next bugfix release.

w1ll-i-code changed the title ~~Bugfix/configobject shutdown order 10179~~ Bugfix: configobject shutdown order Oct 17, 2024

cla-bot bot added the cla/signed label Oct 17, 2024

Al2Klimov self-requested a review October 23, 2024 10:31

Al2Klimov requested changes Oct 23, 2024

View reviewed changes

w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from 4fe3ba3 to 141dedf Compare November 5, 2024 10:19

Al2Klimov self-requested a review November 5, 2024 13:21

Al2Klimov reviewed Nov 5, 2024

View reviewed changes

lib/remote/apilistener.ti Outdated Show resolved Hide resolved

lib/icinga/notification.ti Show resolved Hide resolved

cla-bot bot removed the cla/signed label Nov 11, 2024

w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from c664d8b to f3ee453 Compare November 11, 2024 10:55

cla-bot bot added the cla/signed label Nov 11, 2024

Al2Klimov requested changes Nov 11, 2024

View reviewed changes

w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from f3ee453 to c25007c Compare November 12, 2024 10:44

Al2Klimov self-requested a review November 12, 2024 11:07

Al2Klimov reviewed Nov 12, 2024

View reviewed changes

Al2Klimov requested review from julianbrost and yhabteab November 12, 2024 15:31

Al2Klimov removed request for julianbrost and yhabteab November 13, 2024 13:45

Al2Klimov assigned w1ll-i-code Nov 13, 2024

yhabteab mentioned this pull request Nov 13, 2024

Icinga2 looses state history and notifications during restart #10179

Open

w1ll-i-code added 2 commits November 15, 2024 17:26

Adjust activation_priority for key config objects

71c0b96

Make sure all configobjects are started and shut down in an order, such that all state changes of monitoring objects are correctly processed during the shutdown.

Update AUTHORS

4a39831

w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from c25007c to 4a39831 Compare November 15, 2024 16:29

w1ll-i-code requested a review from Al2Klimov November 19, 2024 11:58

w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from ac12be8 to b97894d Compare November 22, 2024 08:24

Activate users and notifications at the same time.

39977d9

w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from b97894d to 39977d9 Compare November 22, 2024 09:54

Al2Klimov reviewed Nov 25, 2024

View reviewed changes

yhabteab reviewed Nov 25, 2024

View reviewed changes

	} else if (!object->IsActive() &&
	object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp

	<< "and priority " << type->GetActivationPriority() << "'.";
	<< "' and priority '" << type->GetActivationPriority() << "'.";

Bugfix: configobject shutdown order #10191

Are you sure you want to change the base?

Bugfix: configobject shutdown order #10191

Conversation

w1ll-i-code commented Oct 17, 2024

Shutdown the object in the correct order.

Increase activation_priority for key config objects to ensure correct startup and shutdown order

cla-bot bot commented Oct 17, 2024

w1ll-i-code commented Oct 17, 2024

bobapple commented Oct 17, 2024

lippserd commented Oct 22, 2024

yhabteab commented Oct 22, 2024

w1ll-i-code commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhabteab commented Oct 23, 2024 • edited Loading

w1ll-i-code commented Oct 23, 2024 • edited Loading

Al2Klimov left a comment

Choose a reason for hiding this comment

+- whole .ti files

and the stuff that matters

cla-bot bot commented Nov 11, 2024

w1ll-i-code commented Nov 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Al2Klimov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

w1ll-i-code commented Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

w1ll-i-code commented Nov 27, 2024

yhabteab commented Nov 27, 2024

w1ll-i-code commented Nov 27, 2024 • edited Loading

yhabteab commented Nov 28, 2024 • edited Loading

w1ll-i-code commented Nov 28, 2024 • edited Loading

yhabteab commented Nov 29, 2024

yhabteab commented Oct 23, 2024 •

edited

Loading

w1ll-i-code commented Oct 23, 2024 •

edited

Loading

w1ll-i-code commented Nov 13, 2024 •

edited

Loading

w1ll-i-code commented Nov 27, 2024 •

edited

Loading

yhabteab commented Nov 28, 2024 •

edited

Loading

w1ll-i-code commented Nov 28, 2024 •

edited

Loading