Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: configobject shutdown order #10191

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

w1ll-i-code
Copy link

This PR fixes #10179.

Shutdown the object in the correct order.

During a restart the objects are shut down in the reversed order they should be. With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Increase activation_priority for key config objects to ensure correct startup and shutdown order

Make sure all configobjects are started and shut down in the right order, so that each object has its possible dependencies loaded before starting execution.

Copy link

cla-bot bot commented Oct 17, 2024

Thank you for your pull request. Before we can look at it, you'll need to sign a Contributor License Agreement (CLA).

Please follow instructions at https://icinga.com/company/contributor-agreement to sign the CLA.

After that, please reply here with a comment and we'll verify.

Contributors that have not signed yet: @w1ll-i-code

  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Please contact us if you think this is the case.

  • If you signed the CLA as a corporation, your GitHub username may not have been submitted to us. Please reach out to the responsible person in your organization.

@w1ll-i-code w1ll-i-code changed the title Bugfix/configobject shutdown order 10179 Bugfix: configobject shutdown order Oct 17, 2024
@w1ll-i-code
Copy link
Author

How do I retrigger the CLA check?

@bobapple
Copy link
Member

@cla-bot check

@cla-bot cla-bot bot added the cla/signed label Oct 17, 2024
@lippserd
Copy link
Member

Hi @w1ll-i-code,

Thanks for the PR. We will look at the changes as soon as possible.

Best regards,
Eric

@yhabteab
Copy link
Member

Hi, thank you for your contribution!

With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.

Can you confirm that you have tested this PR and that it resolves your problem described in the referenced issue? Did you notice that the checker is also subscribed to the checkable events? Meaning, this shouldn't provide you any benefit, since the checker will drop the checkables from its queue if they are stopped first.

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup.
To illustrate, while a host object is active on one node, its associated notification object can become active on another node. Consequently, each time the host object encounters a problem state, this node must sync the information to the other node to trigger a notification. However, such cluster events can disappear mysteriously if the server is restarted or reloaded at a wrong time. There is already an issue (#10157) and a draft PR #10161 to address this.

@w1ll-i-code
Copy link
Author

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue. I was hoping on a sooner response, but the fix is now running in production and I am sure if anything is still missing, I'll get some feedback in. With the setup I provided in the reproduce, I could observe several lost entries to the ido on every restart. A problem which did no longer occur, once I made the change to the shutdown order. Knowing that I also took the liberty to rearrange some other objects in the startup/shutdown order to behave in accordance with my findings.

Did you notice that the checker is also subscribed to the checkable events?

I have to admit, I am usually not a C++ developer and have had no prior experience with this particular part of the icinga2 code base. I did, in fact, not know that. However I can tell you that this particular fix did resolve the issue, from the test I did perform...

The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup.

I don't think that's the case, as neither I for the reproducing of the issue, nor the client first reporting this issue have a cluster setup for icinga2. You can have a look at the docker-compose I attached to the issue, to reproduce the bug, to confirm this for yourself. This can be observed on a single icinga2 instance running by its own, without any other cluster nodes or even satellites.

@Al2Klimov Al2Klimov self-requested a review October 23, 2024 10:31
// The higher the activation priority, the later the config object will be
// loaded, with the CheckerComponent at 300 being last. To make sure they
// are shutting down in the right order, we need to shut down the objects
// with the highest priority first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true and already the case since 8ad1717.

if (a->GetActivationPriority() > b->GetActivationPriority())
return true;
return false;
return a->GetActivationPriority() < b->GetActivationPriority();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this reverts it and does the exact opposite now. This lambda "returns ​true if the first argument (...) is ordered before (...) the second." That's what the code did. First argument (a) has a higher (>) GetActivationPriority = first argument is ordered before the second.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you're right. I seem to have tested the changes in tandem, as they would not be effective together. I'll update the PR accordingly.

Comment on lines 605 to 611
<< "Deactivate() called for config object '"
<< object->GetName()
<< "' with type '"
<< type->GetName()
<< "' and priority "
<< type->GetActivationPriority()
<< ".";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While too long lines may be hard to read, the other extreme neither contributes to readability.

Suggested change
<< "Deactivate() called for config object '"
<< object->GetName()
<< "' with type '"
<< type->GetName()
<< "' and priority "
<< type->GetActivationPriority()
<< ".";
<< "Deactivate() called for config object '" << object->GetName()
<< "' with type '" << type->GetName()
<< "' and priority " << type->GetActivationPriority() << ".";

@@ -10,6 +10,8 @@ namespace icinga

abstract class DbConnection : ConfigObject
{
activation_priority 250;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activation_priority is to be used on final types as Ido*sqlConnection, not abstract classes.

@yhabteab
Copy link
Member

yhabteab commented Oct 23, 2024

Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue.

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component. As I already said, if the checkable is stopped first the checker will immediately drop it from its queues. Likewise, the change for the API listener or the endpoint shouldn't be necessary at all, as you don't want to accept new connections while you're on the shutdown phase.

Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history.

Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (#10179 (comment))? As far as I can see, there is nothing wrong with that service history and I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup:

[15823/144776] Analyzing state changes for host-icinga-lost-statehistory-00946!service-icinga-lost-statechange-05
[15824/144776] Analyzing state changes for icinga2!disk
thread 'main' panicked at src/icingadb.rs:133:75:
called `Option::unwrap()` on a `None` value

But at least it claimed to have found an error for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04:

[217/144776] Analyzing state changes for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04
Idk what's going on host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04: ObjectNotifications { service_id: "004D960037D7EAE178416B5CD110F13F28FD1542", hard_state_changes: 1, notification_count: 2 }
Bildschirmfoto 2024-10-23 um 12 48 47

As you can see in Icinga DB Web, this is not an error, the first notification is triggered due to the state change from PENDING to WARNING and the second one is a reminder notification, since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

{
    "results": [
        {
            "attrs": {
                "__name": "host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04!apply-nt-log-notifications",
                "active": true,
                "command": "nc-log-notification",
                "ha_mode": 0,
                "host_name": "host-icinga-lost-statehistory-00868",
                "interval": 1800,
...

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Oct 23, 2024

If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the DbConnection class and the notification component.

No it is not. I made those changes only after I confirmed it works with the reversed shutdown order....

Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (#10179 (comment))?

I'm sorry, I should have clarified more. From the bottom up, you see a hard state change from ok to warning and the notification that was sent. After that, you see a state change from hard ok to soft warning, meaning the service had a previous hard state changed to ok and it and the corresponding notification are not present in the history.

I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup

had removed those services manually as I did not need them. I did not think that it would break my script.

since you did not set the notification.interval to zero, you will be re-notified after exactly 30m, which is the case here.

IYou are right, I only changed the notification interval on the test machine, but forgot to do the same locally before sending the basket. I apologize for that...

@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from 4fe3ba3 to 141dedf Compare November 5, 2024 10:19
@Al2Klimov Al2Klimov self-requested a review November 5, 2024 13:21
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so these are the current APs:

lib/base/*logger.ti:12: activation_priority -100;
lib/icinga/icingaapplication.ti:12:     activation_priority -50;
lib/icinga/downtime.ti:23:      activation_priority -10;
lib/*/*.ti:*:      //activation_priority 0;
lib/icinga/scheduleddowntime.ti:24:     activation_priority 20;
lib/remote/apilistener.ti:15:   activation_priority 50;

lib/compat/compatlogger.ti:13:  activation_priority 100;
lib/compat/externalcommandlistener.ti:13:       activation_priority 100;
lib/db_ido_*sql/ido*sqlconnection.ti:12:        activation_priority 100;
lib/icingadb/icingadb.ti:13:    activation_priority 100;
lib/livestatus/livestatuslistener.ti:12:        activation_priority 100;
lib/perfdata/*writer.ti:12:     activation_priority 100;

lib/notification/notificationcomponent.ti:12:   activation_priority 200;
lib/checker/checkercomponent.ti:12:     activation_priority 300;

All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.

<walloftext for="devs">
Now, if we travel back in time to better understand how we got here via git log -p -U0 and use the pager to search for "activation_priority", we get:

+- whole .ti files

and the stuff that matters

</walloftext>

lib/remote/apilistener.ti Outdated Show resolved Hide resolved
lib/icinga/notification.ti Show resolved Hide resolved
Copy link

cla-bot bot commented Nov 11, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: William Calliari.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla/signed label Nov 11, 2024
@w1ll-i-code
Copy link
Author

@Al2Klimov Thanks for the hints. Now everything should be in order.

@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from c664d8b to f3ee453 Compare November 11, 2024 10:55
@cla-bot cla-bot bot added the cla/signed label Nov 11, 2024
@@ -10,7 +10,7 @@ namespace icinga

class IcingaDB : ConfigObject
{
activation_priority 100;
activation_priority 250;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that NotificationComponent has a activation_priority of 200, which means, there is some time, where notifications are already/still being sent, but the event is not written into the db. I'd like to avoid that entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then lower NotificationComponent's activation_priority below 100 instead.

@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from f3ee453 to c25007c Compare November 12, 2024 10:44
@Al2Klimov Al2Klimov self-requested a review November 12, 2024 11:07
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer the diff is! 👍
However:

@@ -10,6 +10,8 @@ namespace icinga

class Endpoint : ConfigObject
{
activation_priority 300;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Neither Endpoint overrides #Start() or #Stop(), nor any code calls Endpoint#IsActive().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I did not notice till rn, that all communication with the endpoints also counts as ApiListener.... I expected them to behave like the other config objects... My brain hurts now.

@@ -9,7 +9,7 @@ namespace icinga

class NotificationComponent : ConfigObject
{
activation_priority 200;
activation_priority 75;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10191 (comment) sounds like you'd like to do the opposite, i.e to leave this as-is. Imagine, all data outputs (100) are shut down or not yet online, but notifications (75) get sent – silently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10191 (comment) This was your suggestion.... I honestly find this code so confusing that I just accepted it as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was fully focused on cleaning up the diff.
Anyway, now after thinking about what's left, I suspect what matters in your case is Notification -5 and this one is not needed or even harmful.

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Nov 13, 2024

Somewhere down the line while applying your suggested improvements, the fix broke again. I'll try to reproduce exactly what went wrong.

Make sure all configobjects are started and shut down in an order, such
that all state changes of monitoring objects are correctly processed
during the shutdown.
@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from c25007c to 4a39831 Compare November 15, 2024 16:29
@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from ac12be8 to b97894d Compare November 22, 2024 08:24
@w1ll-i-code w1ll-i-code force-pushed the bugfix/configobject-shutdown-order-10179 branch from b97894d to 39977d9 Compare November 22, 2024 09:54
@@ -10,7 +10,7 @@ namespace icinga

class IcingaDB : ConfigObject
{
activation_priority 100;
activation_priority -50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter whether Icinga DB purges Redis and re-inserts everything – or performs a diff – given #10151, not yet activated objects – all checkables – will effectively be deleted. This is not acceptable, so if -50 is actually needed here, we have to:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet activated objects – all checkables – will effectively be deleted

Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.

} else if (!object->IsActive() &&
object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp

I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority field. So I wanted to wait until tomorrow and discuss it in our next meeting.

@@ -598,10 +598,10 @@ void ConfigObject::StopObjects()
continue;

for (const ConfigObject::Ptr& object : dtype->GetObjects()) {
#ifdef I2_DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this!

#endif /* I2_DEBUG */
<< "Deactivate() called for config object '" << object->GetName()
<< "' with type '" << type->GetName()
<< "and priority " << type->GetActivationPriority() << "'.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< "and priority " << type->GetActivationPriority() << "'.";
<< "' and priority '" << type->GetActivationPriority() << "'.";

@@ -10,7 +10,7 @@ namespace icinga

class IcingaDB : ConfigObject
{
activation_priority 100;
activation_priority -50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet activated objects – all checkables – will effectively be deleted

Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.

} else if (!object->IsActive() &&
object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp

I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority field. So I wanted to wait until tomorrow and discuss it in our next meeting.

@w1ll-i-code
Copy link
Author

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

@yhabteab
Copy link
Member

Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue?

Hi @w1ll-i-code, I'm afraid that changing the order of activation_priority won't fully solve the problem, or if it partially does, it will break the existing activation_priority on startup. I'm really sorry for the inconvenience, but I think the underlying problem is much deeper than just changing the activation and deactivation order of objects. It's going to take a lot more time and testing to fix it properly. I don't want to say that this PR is irrelevant, but even with this PR, remotely generated events will still get lost, so we'd love to put in more effort and provide a fix in addition to this PR. However, as we are currently fully occupied with preparations for the next Icinga 2 bugfix release, which unfortunately does not include this, we'll give you an update after the next 2-3 weeks once Icinga 2.14.4 is ready.

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Nov 27, 2024

@yhabteab Thanks for the quick reply. I have still some questions:

  1. What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?
  2. What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

@yhabteab
Copy link
Member

yhabteab commented Nov 28, 2024

  1. What exactly do you mean with remotely generated events will still get lost? Messages between two master nodes? Or are you talking about CheckResults coming in over the ApiListener?

Yes, I'm referring to check result events coming from other endpoints.

  1. What does it will break the existing activation_priority on startup. mean exactly? What isn't working anymore afterwards?

Not exactly something is not working, but so far IcingaDB is activated after hosts, services, downtimes etc. (originally I thought it was actually the other way round, i.e. IcingaDB is activated first and then the other objects, that's my mistake), but with this PR its activation priority has been set to -50, which means it will definitely be activated before anything else and probably run into problems like in #10151 more prominently (edit: NVM! that's irrelevant here, but I'll have to test it out properly, which I don't have that much time for at the moment, but I'll give you an update in the coming weeks).

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Nov 28, 2024

Yes, I'm referring to check result events coming from other endpoints.

@yhabteab The ApiListener should already be shut down at that point (activation_priority 50), right? So it wouldn't impact it?

@yhabteab
Copy link
Member

Unfortunately, stopping the ApiListener object does not stop the listener from accepting new connections and does not immediately disconnect all already connected endpoints either. All the active connections will only be terminated when the process is terminated, not after the API listener has been stopped. @julianbrost was/is working on fixing this, i.e. when the listener is stopped, to manually disconnect all connected endpoints and reject any new connection attempts, but as mentioned before, this will require excessive testing and more time to invest in and will therefore not make it into the next bugfix release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Icinga2 looses state history and notifications during restart
5 participants