"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded" #820

oxzi · 2024-09-30T13:19:15Z

Describe the bug

In a bigger Icinga setup, as built by @yhabteab last week, Icinga DB may abort during the initial state and config sync due to an exceeded lock wait timeout. The setup consists of two Icinga DB instances running in HA mode.

config-sync: Aborted initial state sync after 5m37.981933179s
config-sync: Aborted config sync after 5m37.982500933s
Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
can't perform "INSERT INTO \"service_state\" (\"soft_state\", \"last_update\", \"id\", \"last_comment_id\", \"check_commandline>
github.com/icinga/icinga-go-library/database.CantPerformQuery
        /Users/yhabteab/Workspace/go/icinga-go-library/database/utils.go:15
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:505
github.com/icinga/icinga-go-library/retry.WithBackoff
        /Users/yhabteab/Workspace/go/icinga-go-library/retry/retry.go:64
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:500
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700
retry deadline exceeded
github.com/icinga/icinga-go-library/retry.WithBackoff
        /Users/yhabteab/Workspace/go/icinga-go-library/retry/retry.go:99
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:500
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700

The used Icinga DB version was, afaik, the latest Icinga DB release v1.2.0 with #800 cherry-picked on top. Please note that this error is not related to HA. Furthermore, no two instances were active at the same time.

Detailed logs for this run, acquired by some version of:

$ journalctl -u icingadb -o json | | jq '._SOURCE_REALTIME_TIMESTAMP |= (tonumber | . / 1000000 | todate) | {"PRIORITY", "_SOURCE_REALTIME_TIMESTAMP", "MESSAGE", "ICINGADB_REASON", "ICINGADB_ERROR", "ICINGADB_RECOVERED_ERROR", "ICINGADB_ATTEMPTS", "ICINGADB_AFTER"} | del(..|nulls)'

Please note, that the first node crashed, but the other then took over and worked without any issues. This happened three times last Friday, but, of course, today I was not able to reproduce it.

To Reproduce

Create a huge Icinga environment, as documented (interally) from @yhabteab.
???
Profit!

Expected behavior

Not crash

Your Environment

Include as many relevant details about the environment you experienced the problem in

Icinga DB version: v1.2.0 w/ Abort HA Realization Logic After Timeout #800
Icinga 2 version: v2.14.2-1
Operating System and version: AlmaLinux 9.4 (crashing node) and Debian GNU/Linux 12 (other node)

Additional context

N/A

The text was updated successfully, but these errors were encountered:

oxzi · 2024-09-30T13:45:04Z

Oh wait, the prior errors messaged as database: Can't execute query. Retrying are also partially PRIMARY KEY conflicts. For example:

"ICINGADB_ERROR": "can't perform "INSERT INTO \"service_state\" (\"soft_state\", \"last_update\", \"id\", \"last_comment_id\", \"check_commandline\", \"last_state_change\", \"severity\", \"check_attempt\", \"is_handled\", \"is_problem\", \"next_update\", \"execution_time\", \"is_acknowledged\", \"hard_state\", \"output\", \"state_type\", \"host_id\", \"scheduling_source\", \"latency\", \"previous_hard_state\", \"check_timeout\", \"acknowledgement_comment_id\", \"next_check\", \"is_flapping\", \"environment_id\", \"in_downtime\", \"is_reachable\", \"long_output\", \"properties_checksum\", \"performance_data\", \"normalized_performance_data\", \"previous_soft_state\", \"service_id\", \"check_source\") VALUES (:soft_state, :last_update, :id, :last_comment_id, :check_commandline, :last_state_change, :severity, :check_attempt, :is_handled, :is_problem, :next_update, :execution_time, :is_acknowledged, :hard_state, :output, :state_type, :host_id, :scheduling_source, :latency, :previous_hard_state, :check_timeout, :acknowledgement_comment_id, :next_check, :is_flapping, :environment_id, :in_downtime, :is_reachable, :long_output, :properties_checksum, :performance_data, :normalized_performance_data, :previous_soft_state, :service_id, :check_source)": Error 1062 (23000): Duplicate entry '\x8Er\xF3\xD4\xAD \xCB\xC1\x94\xDAr9\x0C\xE6\xEA:\xEA\xACw2' for key 'service_state.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"downtime\" (\"host_id\", \"scheduled_end_time\", \"start_time\", \"end_time\", \"scheduled_start_time\", \"zone_id\", \"parent_id\", \"author\", \"comment\", \"scheduled_duration\", \"object_type\", \"service_id\", \"flexible_duration\", \"duration\", \"name\", \"id\", \"entry_time\", \"is_in_effect\", \"scheduled_by\", \"properties_checksum\", \"triggered_by_id\", \"is_flexible\", \"environment_id\", \"name_checksum\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)": Error 1062 (23000): Duplicate entry '\xEAq'&\xA2v\xD4!\xD5\xCE=\xC5\xB7z\x0F0\xD7\x14\x8D\xD2' for key 'downtime.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"downtime\" (\"host_id\", \"scheduled_end_time\", \"start_time\", \"end_time\", \"scheduled_start_time\", \"zone_id\", \"parent_id\", \"author\", \"comment\", \"scheduled_duration\", \"object_type\", \"service_id\", \"flexible_duration\", \"duration\", \"name\", \"id\", \"entry_time\", \"is_in_effect\", \"scheduled_by\", \"properties_checksum\", \"triggered_by_id\", \"is_flexible\", \"environment_id\", \"name_checksum\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)": Error 1062 (23000): Duplicate entry '\xF6\xF6U'\xEC\x7FW\xCC\xF4\x0F5\x9BgGJ\xCB[oIz' for key 'downtime.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"host_customvar\" (\"host_id\", \"customvar_id\", \"environment_id\", \"id\") VALUES (:host_id, :customvar_id, :environment_id, :id)": Error 1062 (23000): Duplicate entry '\xF0\xBAyn\x8F\x9F\xE52:\xD4\xBE-\x86\xD8O\xE2\x03\x05\xD1\xC0' for key 'host_customvar.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"downtime\" (\"host_id\", \"scheduled_end_time\", \"start_time\", \"end_time\", \"scheduled_start_time\", \"zone_id\", \"parent_id\", \"author\", \"comment\", \"scheduled_duration\", \"object_type\", \"service_id\", \"flexible_duration\", \"duration\", \"name\", \"id\", \"entry_time\", \"is_in_effect\", \"scheduled_by\", \"properties_checksum\", \"triggered_by_id\", \"is_flexible\", \"environment_id\", \"name_checksum\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)": Error 1062 (23000): Duplicate entry '\xF1\x9B\x15zc'C\xEA\xDEF\xCBA\xAB\x82y!(\xACI\xE3' for key 'downtime.PRIMARY'"

As the other Icinga DB instance is HA passive - as the logs shows - it should not be responsible for primary key constraint violations in those tables. Can this be due to a prior, aborted sync or is something racing itself? The latter would be supported by the multiple "Deadlock found when trying to get lock; try restarting transaction" errors.

By the way, all 35 occurrences of "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction" happened for INSERTS into service_state.

oxzi · 2024-10-01T14:26:41Z

After stopping both Icinga DB nodes, truncating the service_state table and restarting the nodes, the active node complained multiple times about an "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction". However, these errors were recoverable and Icinga DB survived.

I just went over the whole synchronization code and am now more confused than before: how could a the primary key constraint be violated from one single node? The delta is calculated based on what is currently within the SQL database and within the Redis. If something is part of both, it shouldn't be in delta.Create.

The logs from the crashed node posted above contains:

config-sync: Inserting 245858 items of type service state
config-sync: Updating 39 items of type service state

At the moment (which may not be comparable), the system has the following state:

127.0.0.1:6380> HLEN icinga:service:state
(integer) 307538

mysql> select count(*) from service_state\G
*************************** 1. row ***************************
count(*): 307538
1 row in set (0.03 sec)

Under the assumption that this number hasn't changed, there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB. Unless I have missed something, there is no code preventing this case. And, to bring evidence, there were multiple occurrences of FLUSHALL in the ~/.rediscli_history file.

oxzi · 2024-10-01T15:22:35Z

there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB

Oh wait, that would be the wrong way round. If something is missing in Redis, but available in the SQL database, it would be set as delta.Delete and not .Create.

I tried reproducing the behavior with some FLUSHALLs, but this did not worked out. However, I was able to crash Icinga DB in another flavor when flushing during the sync, but this is fine, I guess.

redis: HMGET icinga:downtime: field "4bd67fe5e10611a5c312ba1b05d8cfe6d7317662" missing
redis: HMGET icinga:downtime: field "5aacf4988f2686df77cc5c5660855d2deafd1185" missing
redis: HMGET icinga:downtime: field "6e97e19d87853b00477c575a2ed25c0c6204ac2e" missing
redis: HMGET icinga:downtime: field "8d77e6661693adba73c4802c98a0bff07ad07126" missing
redis: HMGET icinga:downtime: field "d3537844bfa9e7b4027c106d5ade0eb9c02720d8" missing
redis: HMGET icinga:downtime: field "d7ae5700b04633be4ccd6d52daa397fbdc1d1c14" missing
redis: HMGET icinga:downtime: field "0ce73ba7968ec4350c7a7d0ca4b00bc811336356" missing
redis: HMGET icinga:downtime: field "2c0bbf8054855095870030ce9a92bfaefea7a1fc" missing
redis: HMGET icinga:downtime: field "3d028f5662ddf7c3a9152c6bef4461a57eb993cb" missing
redis: HMGET icinga:downtime: field "4924e173f3ed5ea0001ef275818e85c3e827cccd" missing
redis: HMGET icinga:downtime: field "6444a66dbc37c526d03c2bc2e921cf6bf0ca6c92" missing
redis: HMGET icinga:downtime: field "bc4652319400f81d232884583cd6f96173146d14" missing
redis: HMGET icinga:downtime: field "c658c0430fa9e59a8e1b7b452c2782c299605e2b" missing
redis: Finished fetching from icinga:downtime with 0 items in 367.842935ms
database: Finished executing "SELECT \"properties_checksum\", \"id\" FROM \"host_state\" WHERE \"environment_id\" = :environment_id" with 15944>
database: Finished executing "UPDATE \"host_state\" SET \"is_overdue\" = :is_overdue, \"id\" = :id WHERE id = :id" with 6207 rows in 5.41943569>
database: Finished executing "SELECT \"id\" FROM \"host_customvar\" WHERE \"environment_id\" = :environment_id" with 125000 rows in 3.172386254s
redis: Finished fetching from icinga:host:customvar with 249905 items in 3.172477344s
config-sync: Aborted config sync after 8.016455329s
config-sync: Aborted initial state sync after 8.016473497s
can't copy from closed channel
github.com/icinga/icinga-go-library/com.CopyFirst[...]
        /Users/yhabteab/Workspace/go/icinga-go-library/com/com.go:64
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:708
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
        /Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700
can't copy first entity
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:710
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
        /Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit

Another observation was during experiments with the other node being stopped, no SQL deadlock errors occurred. When I wanted to verify this, I re-ran the same scenario with both nodes being active, and also no deadlocks occurred..

oxzi · 2024-10-11T14:31:14Z

ref/IP/55850

(Quite unsure if this applies here..)

yhabteab · 2024-10-18T07:38:55Z

The used Icinga DB version was, afaik, the latest Icinga DB release v1.2.0 with #800 cherry-picked on top.

Actually, the Icinga DB binaries are built from #800, i.e. they are not based on the v1.2.0 tag.

Please note, that the first node crashed, but the other then took over and worked without any issues.

IIRC, Icinga DB was actually behaving so unstable on both nodes, and was crashing after only 5m on both of them, not just on the other side. Also, didn't I mention before I left that it had to do with the MariaDB version I was using? Before I deleted MariaDB completely and replaced it with MySQL, if I'm not mistaken, MariaDB version 10.5.x was used. However, all the crashes disappeared after replacing it with MySQL and Icinga DB is running quite normal since then.

oxzi · 2024-10-18T07:51:30Z

However, all the crashes disappeared after replacing it with MySQL and Icinga DB is running quite normal since then.

Unfortunately not. The reported error in this issue happened after you left. I am quite certain, as I am remembering my frustration seeing it crash after your change which (temporary) fixed the issue. However, maybe this issue is due to prior failure, I simply don't know :/

yhabteab · 2024-10-21T07:06:59Z

Unfortunately not. The reported error in this issue happened after you left.

Are you sure about that? The crash logs you provided are all from Friday afternoon, which is before I replaced MariaDB with MySQL.

{
  "PRIORITY": "2",
  "_SOURCE_REALTIME_TIMESTAMP": "2024-09-27T15:29:45Z",
  "MESSAGE": "Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction\ncan't perform \"INSERT INTO
...

oxzi · 2024-10-21T07:11:05Z

The timestamp is in UTC, being 17:29:45 in our current timezone. Based on our very important Greetings chat room, you have already left at this point.

oxzi added the crash label Sep 30, 2024

oxzi added the ref/IP label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded" #820

"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded" #820

oxzi commented Sep 30, 2024

oxzi commented Sep 30, 2024

oxzi commented Oct 1, 2024

oxzi commented Oct 1, 2024

oxzi commented Oct 11, 2024

yhabteab commented Oct 18, 2024

oxzi commented Oct 18, 2024

yhabteab commented Oct 21, 2024

oxzi commented Oct 21, 2024

"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded" #820

"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded" #820

Comments

oxzi commented Sep 30, 2024

Describe the bug

To Reproduce

Expected behavior

Your Environment

Additional context

oxzi commented Sep 30, 2024

oxzi commented Oct 1, 2024

oxzi commented Oct 1, 2024

oxzi commented Oct 11, 2024

yhabteab commented Oct 18, 2024

oxzi commented Oct 18, 2024

yhabteab commented Oct 21, 2024

oxzi commented Oct 21, 2024