-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded" #820
Comments
Oh wait, the prior errors messaged as
As the other Icinga DB instance is HA passive - as the logs shows - it should not be responsible for primary key constraint violations in those tables. Can this be due to a prior, aborted sync or is something racing itself? The latter would be supported by the multiple "Deadlock found when trying to get lock; try restarting transaction" errors. By the way, all 35 occurrences of "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction" happened for INSERTS into |
After stopping both Icinga DB nodes, truncating the I just went over the whole synchronization code and am now more confused than before: how could a the primary key constraint be violated from one single node? The delta is calculated based on what is currently within the SQL database and within the Redis. If something is part of both, it shouldn't be in The logs from the crashed node posted above contains:
At the moment (which may not be comparable), the system has the following state:
Under the assumption that this number hasn't changed, there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB. Unless I have missed something, there is no code preventing this case. And, to bring evidence, there were multiple occurrences of |
Oh wait, that would be the wrong way round. If something is missing in Redis, but available in the SQL database, it would be set as I tried reproducing the behavior with some
Another observation was during experiments with the other node being stopped, no SQL deadlock errors occurred. When I wanted to verify this, I re-ran the same scenario with both nodes being active, and also no deadlocks occurred.. |
ref/IP/55850 (Quite unsure if this applies here..) |
Actually, the Icinga DB binaries are built from #800, i.e. they are not based on the v1.2.0 tag.
IIRC, Icinga DB was actually behaving so unstable on both nodes, and was crashing after only |
Unfortunately not. The reported error in this issue happened after you left. I am quite certain, as I am remembering my frustration seeing it crash after your change which (temporary) fixed the issue. However, maybe this issue is due to prior failure, I simply don't know :/ |
Are you sure about that? The crash logs you provided are all from Friday afternoon, which is before I replaced MariaDB with MySQL. {
"PRIORITY": "2",
"_SOURCE_REALTIME_TIMESTAMP": "2024-09-27T15:29:45Z",
"MESSAGE": "Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction\ncan't perform \"INSERT INTO
... |
The timestamp is in UTC, being 17:29:45 in our current timezone. Based on our very important Greetings chat room, you have already left at this point. |
Describe the bug
In a bigger Icinga setup, as built by @yhabteab last week, Icinga DB may abort during the initial state and config sync due to an exceeded lock wait timeout. The setup consists of two Icinga DB instances running in HA mode.
The used Icinga DB version was, afaik, the latest Icinga DB release v1.2.0 with #800 cherry-picked on top. Please note that this error is not related to HA. Furthermore, no two instances were active at the same time.
Detailed logs for this run, acquired by some version of:
Please note, that the first node crashed, but the other then took over and worked without any issues. This happened three times last Friday, but, of course, today I was not able to reproduce it.
To Reproduce
Expected behavior
Not crash
Your Environment
Include as many relevant details about the environment you experienced the problem in
Additional context
N/A
The text was updated successfully, but these errors were encountered: