-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: configobject shutdown order #10191
base: master
Are you sure you want to change the base?
Bugfix: configobject shutdown order #10191
Conversation
Thank you for your pull request. Before we can look at it, you'll need to sign a Contributor License Agreement (CLA). Please follow instructions at https://icinga.com/company/contributor-agreement to sign the CLA. After that, please reply here with a comment and we'll verify. Contributors that have not signed yet: @w1ll-i-code
|
How do I retrigger the CLA check? |
@cla-bot check |
Hi @w1ll-i-code, Thanks for the PR. We will look at the changes as soon as possible. Best regards, |
Hi, thank you for your contribution!
Can you confirm that you have tested this PR and that it resolves your problem described in the referenced issue? Did you notice that the checker is also subscribed to the checkable events? Meaning, this shouldn't provide you any benefit, since the checker will drop the checkables from its queue if they are stopped first. The primary issue here is not the deactivation order of the objects, but rather how they are distributed in a cluster setup. |
Yes, I can confirm that this PR did fix the issues we were experiencing, at least from the specific tests I run and which can be found attached to the original issue. I was hoping on a sooner response, but the fix is now running in production and I am sure if anything is still missing, I'll get some feedback in. With the setup I provided in the reproduce, I could observe several lost entries to the ido on every restart. A problem which did no longer occur, once I made the change to the shutdown order. Knowing that I also took the liberty to rearrange some other objects in the startup/shutdown order to behave in accordance with my findings.
I have to admit, I am usually not a C++ developer and have had no prior experience with this particular part of the icinga2 code base. I did, in fact, not know that. However I can tell you that this particular fix did resolve the issue, from the test I did perform...
I don't think that's the case, as neither I for the reproducing of the issue, nor the client first reporting this issue have a cluster setup for icinga2. You can have a look at the docker-compose I attached to the issue, to reproduce the bug, to confirm this for yourself. This can be observed on a single icinga2 instance running by its own, without any other cluster nodes or even satellites. |
lib/base/configobject.cpp
Outdated
// The higher the activation priority, the later the config object will be | ||
// loaded, with the CheckerComponent at 300 being last. To make sure they | ||
// are shutting down in the right order, we need to shut down the objects | ||
// with the highest priority first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true and already the case since 8ad1717.
lib/base/configobject.cpp
Outdated
if (a->GetActivationPriority() > b->GetActivationPriority()) | ||
return true; | ||
return false; | ||
return a->GetActivationPriority() < b->GetActivationPriority(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this reverts it and does the exact opposite now. This lambda "returns true if the first argument (...) is ordered before (...) the second." That's what the code did. First argument (a) has a higher (>) GetActivationPriority = first argument is ordered before the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, you're right. I seem to have tested the changes in tandem, as they would not be effective together. I'll update the PR accordingly.
lib/base/configobject.cpp
Outdated
<< "Deactivate() called for config object '" | ||
<< object->GetName() | ||
<< "' with type '" | ||
<< type->GetName() | ||
<< "' and priority " | ||
<< type->GetActivationPriority() | ||
<< "."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While too long lines may be hard to read, the other extreme neither contributes to readability.
<< "Deactivate() called for config object '" | |
<< object->GetName() | |
<< "' with type '" | |
<< type->GetName() | |
<< "' and priority " | |
<< type->GetActivationPriority() | |
<< "."; | |
<< "Deactivate() called for config object '" << object->GetName() | |
<< "' with type '" << type->GetName() | |
<< "' and priority " << type->GetActivationPriority() << "."; |
lib/db_ido/dbconnection.ti
Outdated
@@ -10,6 +10,8 @@ namespace icinga | |||
|
|||
abstract class DbConnection : ConfigObject | |||
{ | |||
activation_priority 250; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activation_priority is to be used on final types as Ido*sqlConnection, not abstract classes.
If this PR indeed fix your specific issue, it is likely not due to the deactivation priority of the checker or the API listener, but maybe the new priority order of the
Anyway, can you please explain in #10179 what exactly you are missing in that screenshot of Icinga DB Web (#10179 (comment))? As far as I can see, there is nothing wrong with that service history and I actually did try to reproduce the issue on my end as you described, but your Rust script just keeps crashing on startup: [15823/144776] Analyzing state changes for host-icinga-lost-statehistory-00946!service-icinga-lost-statechange-05
[15824/144776] Analyzing state changes for icinga2!disk
thread 'main' panicked at src/icingadb.rs:133:75:
called `Option::unwrap()` on a `None` value But at least it claimed to have found an error for [217/144776] Analyzing state changes for host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04
Idk what's going on host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04: ObjectNotifications { service_id: "004D960037D7EAE178416B5CD110F13F28FD1542", hard_state_changes: 1, notification_count: 2 } As you can see in Icinga DB Web, this is not an error, the first notification is triggered due to the state change from {
"results": [
{
"attrs": {
"__name": "host-icinga-lost-statehistory-00868!service-icinga-lost-statechange-04!apply-nt-log-notifications",
"active": true,
"command": "nc-log-notification",
"ha_mode": 0,
"host_name": "host-icinga-lost-statehistory-00868",
"interval": 1800,
... |
No it is not. I made those changes only after I confirmed it works with the reversed shutdown order....
I'm sorry, I should have clarified more. From the bottom up, you see a hard state change from ok to warning and the notification that was sent. After that, you see a state change from hard ok to soft warning, meaning the service had a previous hard state changed to ok and it and the corresponding notification are not present in the history.
had removed those services manually as I did not need them. I did not think that it would break my script.
IYou are right, I only changed the notification interval on the test machine, but forgot to do the same locally before sending the basket. I apologize for that... |
4fe3ba3
to
141dedf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so these are the current APs:
lib/base/*logger.ti:12: activation_priority -100;
lib/icinga/icingaapplication.ti:12: activation_priority -50;
lib/icinga/downtime.ti:23: activation_priority -10;
lib/*/*.ti:*: //activation_priority 0;
lib/icinga/scheduleddowntime.ti:24: activation_priority 20;
lib/remote/apilistener.ti:15: activation_priority 50;
lib/compat/compatlogger.ti:13: activation_priority 100;
lib/compat/externalcommandlistener.ti:13: activation_priority 100;
lib/db_ido_*sql/ido*sqlconnection.ti:12: activation_priority 100;
lib/icingadb/icingadb.ti:13: activation_priority 100;
lib/livestatus/livestatuslistener.ti:12: activation_priority 100;
lib/perfdata/*writer.ti:12: activation_priority 100;
lib/notification/notificationcomponent.ti:12: activation_priority 200;
lib/checker/checkercomponent.ti:12: activation_priority 300;
All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.
<walloftext for="devs">
Now, if we travel back in time to better understand how we got here via git log -p -U0
and use the pager to search for "activation_priority", we get:
+- whole .ti files
and the stuff that matters
</walloftext>
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: William Calliari.
|
@Al2Klimov Thanks for the hints. Now everything should be in order. |
c664d8b
to
f3ee453
Compare
lib/icingadb/icingadb.ti
Outdated
@@ -10,7 +10,7 @@ namespace icinga | |||
|
|||
class IcingaDB : ConfigObject | |||
{ | |||
activation_priority 100; | |||
activation_priority 250; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the data outputs (and externalcommandlistener.ti which is rather accidentally there and I'd not touch now) have a harmonized AP of 100. I'd like to keep it harmonized. Btw. let's literally keep 100 to keep the diff small. If you need data outputs later than notificationcomponent.ti, just lower the latter AP and keep the 100-s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that NotificationComponent has a activation_priority of 200, which means, there is some time, where notifications are already/still being sent, but the event is not written into the db. I'd like to avoid that entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then lower NotificationComponent's activation_priority below 100 instead.
f3ee453
to
c25007c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much clearer the diff is! 👍
However:
lib/remote/endpoint.ti
Outdated
@@ -10,6 +10,8 @@ namespace icinga | |||
|
|||
class Endpoint : ConfigObject | |||
{ | |||
activation_priority 300; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? Neither Endpoint overrides #Start() or #Stop(), nor any code calls Endpoint#IsActive().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I did not notice till rn, that all communication with the endpoints also counts as ApiListener.... I expected them to behave like the other config objects... My brain hurts now.
@@ -9,7 +9,7 @@ namespace icinga | |||
|
|||
class NotificationComponent : ConfigObject | |||
{ | |||
activation_priority 200; | |||
activation_priority 75; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#10191 (comment) sounds like you'd like to do the opposite, i.e to leave this as-is. Imagine, all data outputs (100) are shut down or not yet online, but notifications (75) get sent – silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#10191 (comment) This was your suggestion.... I honestly find this code so confusing that I just accepted it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was fully focused on cleaning up the diff.
Anyway, now after thinking about what's left, I suspect what matters in your case is Notification -5 and this one is not needed or even harmful.
Somewhere down the line while applying your suggested improvements, the fix broke again. I'll try to reproduce exactly what went wrong. |
Make sure all configobjects are started and shut down in an order, such that all state changes of monitoring objects are correctly processed during the shutdown.
c25007c
to
4a39831
Compare
ac12be8
to
b97894d
Compare
b97894d
to
39977d9
Compare
@@ -10,7 +10,7 @@ namespace icinga | |||
|
|||
class IcingaDB : ConfigObject | |||
{ | |||
activation_priority 100; | |||
activation_priority -50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No matter whether Icinga DB purges Redis and re-inserts everything – or performs a diff – given #10151, not yet activated objects – all checkables – will effectively be deleted. This is not acceptable, so if -50 is actually needed here, we have to:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet activated objects – all checkables – will effectively be deleted
Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.
icinga2/lib/icingadb/icingadb-objects.cpp
Lines 2781 to 2782 in 3218908
} else if (!object->IsActive() && | |
object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp |
I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority
field. So I wanted to wait until tomorrow and discuss it in our next meeting.
@@ -598,10 +598,10 @@ void ConfigObject::StopObjects() | |||
continue; | |||
|
|||
for (const ConfigObject::Ptr& object : dtype->GetObjects()) { | |||
#ifdef I2_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this!
#endif /* I2_DEBUG */ | ||
<< "Deactivate() called for config object '" << object->GetName() | ||
<< "' with type '" << type->GetName() | ||
<< "and priority " << type->GetActivationPriority() << "'."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< "and priority " << type->GetActivationPriority() << "'."; | |
<< "' and priority '" << type->GetActivationPriority() << "'."; |
@@ -10,7 +10,7 @@ namespace icinga | |||
|
|||
class IcingaDB : ConfigObject | |||
{ | |||
activation_priority 100; | |||
activation_priority -50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet activated objects – all checkables – will effectively be deleted
Idk, how you came up to that conclusion, nor how this relates to #10057 or #10151 in anyway. Icinga DB won't do anything if the checkables aren't truly deleted.
icinga2/lib/icingadb/icingadb-objects.cpp
Lines 2781 to 2782 in 3218908
} else if (!object->IsActive() && | |
object->GetExtension("ConfigObjectDeleted")) { // same as in apilistener-configsync.cpp |
I already checked this PR last week and noticed that Icinga DB, IDO etc. are not activated first as before, but last in the same way they are deactivated. IMHO the possible solution I am thinking of is to not touch the activation order of these objects on startup and deactivate them last on shutdown/reload, but this will not be achievable by simply changing the activation_priority
field. So I wanted to wait until tomorrow and discuss it in our next meeting.
Hi @yhabteab how did the meeting yesterday go? Did you come to any resolution regarding this issue? |
Hi @w1ll-i-code, I'm afraid that changing the order of |
@yhabteab Thanks for the quick reply. I have still some questions:
|
Yes, I'm referring to check result events coming from other endpoints.
Not exactly something is not working, but so far IcingaDB is activated after hosts, services, downtimes etc. (originally I thought it was actually the other way round, i.e. IcingaDB is activated first and then the other objects, that's my mistake), but with this PR its activation priority has been set to |
@yhabteab The ApiListener should already be shut down at that point (activation_priority 50), right? So it wouldn't impact it? |
Unfortunately, stopping the ApiListener object does not stop the listener from accepting new connections and does not immediately disconnect all already connected endpoints either. All the active connections will only be terminated when the process is terminated, not after the API listener has been stopped. @julianbrost was/is working on fixing this, i.e. when the listener is stopped, to manually disconnect all connected endpoints and reject any new connection attempts, but as mentioned before, this will require excessive testing and more time to invest in and will therefore not make it into the next bugfix release. |
This PR fixes #10179.
Shutdown the object in the correct order.
During a restart the objects are shut down in the reversed order they should be. With this the CheckerComponent is still updating objects after the IDO, Icingadb, Notifications, etc. have already been shutdown, leading to icinga2 getting out of sync with external components.
Increase activation_priority for key config objects to ensure correct startup and shutdown order
Make sure all configobjects are started and shut down in the right order, so that each object has its possible dependencies loaded before starting execution.