Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Aborted initial state sync" / "Aborted config sync" because of "Lock wait timeout exceeded" #820

Open
oxzi opened this issue Sep 30, 2024 · 8 comments

Comments

@oxzi
Copy link
Member

oxzi commented Sep 30, 2024

Describe the bug

In a bigger Icinga setup, as built by @yhabteab last week, Icinga DB may abort during the initial state and config sync due to an exceeded lock wait timeout. The setup consists of two Icinga DB instances running in HA mode.

config-sync: Aborted initial state sync after 5m37.981933179s
config-sync: Aborted config sync after 5m37.982500933s
Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
can't perform "INSERT INTO \"service_state\" (\"soft_state\", \"last_update\", \"id\", \"last_comment_id\", \"check_commandline>
github.com/icinga/icinga-go-library/database.CantPerformQuery
        /Users/yhabteab/Workspace/go/icinga-go-library/database/utils.go:15
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:505
github.com/icinga/icinga-go-library/retry.WithBackoff
        /Users/yhabteab/Workspace/go/icinga-go-library/retry/retry.go:64
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:500
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700
retry deadline exceeded
github.com/icinga/icinga-go-library/retry.WithBackoff
        /Users/yhabteab/Workspace/go/icinga-go-library/retry/retry.go:99
github.com/icinga/icinga-go-library/database.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:500
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700

The used Icinga DB version was, afaik, the latest Icinga DB release v1.2.0 with #800 cherry-picked on top. Please note that this error is not related to HA. Furthermore, no two instances were active at the same time.

Detailed logs for this run, acquired by some version of:

$ journalctl -u icingadb -o json | | jq '._SOURCE_REALTIME_TIMESTAMP |= (tonumber | . / 1000000 | todate) | {"PRIORITY", "_SOURCE_REALTIME_TIMESTAMP", "MESSAGE", "ICINGADB_REASON", "ICINGADB_ERROR", "ICINGADB_RECOVERED_ERROR", "ICINGADB_ATTEMPTS", "ICINGADB_AFTER"} | del(..|nulls)'

Please note, that the first node crashed, but the other then took over and worked without any issues. This happened three times last Friday, but, of course, today I was not able to reproduce it.

To Reproduce

  1. Create a huge Icinga environment, as documented (interally) from @yhabteab.
  2. ???
  3. Profit!

Expected behavior

Not crash

Your Environment

Include as many relevant details about the environment you experienced the problem in

Additional context

N/A

@oxzi oxzi added the crash label Sep 30, 2024
@oxzi
Copy link
Member Author

oxzi commented Sep 30, 2024

Oh wait, the prior errors messaged as database: Can't execute query. Retrying are also partially PRIMARY KEY conflicts. For example:

"ICINGADB_ERROR": "can't perform "INSERT INTO \"service_state\" (\"soft_state\", \"last_update\", \"id\", \"last_comment_id\", \"check_commandline\", \"last_state_change\", \"severity\", \"check_attempt\", \"is_handled\", \"is_problem\", \"next_update\", \"execution_time\", \"is_acknowledged\", \"hard_state\", \"output\", \"state_type\", \"host_id\", \"scheduling_source\", \"latency\", \"previous_hard_state\", \"check_timeout\", \"acknowledgement_comment_id\", \"next_check\", \"is_flapping\", \"environment_id\", \"in_downtime\", \"is_reachable\", \"long_output\", \"properties_checksum\", \"performance_data\", \"normalized_performance_data\", \"previous_soft_state\", \"service_id\", \"check_source\") VALUES (:soft_state, :last_update, :id, :last_comment_id, :check_commandline, :last_state_change, :severity, :check_attempt, :is_handled, :is_problem, :next_update, :execution_time, :is_acknowledged, :hard_state, :output, :state_type, :host_id, :scheduling_source, :latency, :previous_hard_state, :check_timeout, :acknowledgement_comment_id, :next_check, :is_flapping, :environment_id, :in_downtime, :is_reachable, :long_output, :properties_checksum, :performance_data, :normalized_performance_data, :previous_soft_state, :service_id, :check_source)": Error 1062 (23000): Duplicate entry '\x8Er\xF3\xD4\xAD \xCB\xC1\x94\xDAr9\x0C\xE6\xEA:\xEA\xACw2' for key 'service_state.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"downtime\" (\"host_id\", \"scheduled_end_time\", \"start_time\", \"end_time\", \"scheduled_start_time\", \"zone_id\", \"parent_id\", \"author\", \"comment\", \"scheduled_duration\", \"object_type\", \"service_id\", \"flexible_duration\", \"duration\", \"name\", \"id\", \"entry_time\", \"is_in_effect\", \"scheduled_by\", \"properties_checksum\", \"triggered_by_id\", \"is_flexible\", \"environment_id\", \"name_checksum\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)": Error 1062 (23000): Duplicate entry '\xEAq'&\xA2v\xD4!\xD5\xCE=\xC5\xB7z\x0F0\xD7\x14\x8D\xD2' for key 'downtime.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"downtime\" (\"host_id\", \"scheduled_end_time\", \"start_time\", \"end_time\", \"scheduled_start_time\", \"zone_id\", \"parent_id\", \"author\", \"comment\", \"scheduled_duration\", \"object_type\", \"service_id\", \"flexible_duration\", \"duration\", \"name\", \"id\", \"entry_time\", \"is_in_effect\", \"scheduled_by\", \"properties_checksum\", \"triggered_by_id\", \"is_flexible\", \"environment_id\", \"name_checksum\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)": Error 1062 (23000): Duplicate entry '\xF6\xF6U'\xEC\x7FW\xCC\xF4\x0F5\x9BgGJ\xCB[oIz' for key 'downtime.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"host_customvar\" (\"host_id\", \"customvar_id\", \"environment_id\", \"id\") VALUES (:host_id, :customvar_id, :environment_id, :id)": Error 1062 (23000): Duplicate entry '\xF0\xBAyn\x8F\x9F\xE52:\xD4\xBE-\x86\xD8O\xE2\x03\x05\xD1\xC0' for key 'host_customvar.PRIMARY'"

"ICINGADB_ERROR": "can't perform "INSERT INTO \"downtime\" (\"host_id\", \"scheduled_end_time\", \"start_time\", \"end_time\", \"scheduled_start_time\", \"zone_id\", \"parent_id\", \"author\", \"comment\", \"scheduled_duration\", \"object_type\", \"service_id\", \"flexible_duration\", \"duration\", \"name\", \"id\", \"entry_time\", \"is_in_effect\", \"scheduled_by\", \"properties_checksum\", \"triggered_by_id\", \"is_flexible\", \"environment_id\", \"name_checksum\") VALUES (:host_id, :scheduled_end_time, :start_time, :end_time, :scheduled_start_time, :zone_id, :parent_id, :author, :comment, :scheduled_duration, :object_type, :service_id, :flexible_duration, :duration, :name, :id, :entry_time, :is_in_effect, :scheduled_by, :properties_checksum, :triggered_by_id, :is_flexible, :environment_id, :name_checksum)": Error 1062 (23000): Duplicate entry '\xF1\x9B\x15zc'C\xEA\xDEF\xCBA\xAB\x82y!(\xACI\xE3' for key 'downtime.PRIMARY'"

As the other Icinga DB instance is HA passive - as the logs shows - it should not be responsible for primary key constraint violations in those tables. Can this be due to a prior, aborted sync or is something racing itself? The latter would be supported by the multiple "Deadlock found when trying to get lock; try restarting transaction" errors.

By the way, all 35 occurrences of "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction" happened for INSERTS into service_state.

@oxzi
Copy link
Member Author

oxzi commented Oct 1, 2024

After stopping both Icinga DB nodes, truncating the service_state table and restarting the nodes, the active node complained multiple times about an "Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction". However, these errors were recoverable and Icinga DB survived.

I just went over the whole synchronization code and am now more confused than before: how could a the primary key constraint be violated from one single node? The delta is calculated based on what is currently within the SQL database and within the Redis. If something is part of both, it shouldn't be in delta.Create.

The logs from the crashed node posted above contains:

config-sync: Inserting 245858 items of type service state
config-sync: Updating 39 items of type service state

At the moment (which may not be comparable), the system has the following state:

127.0.0.1:6380> HLEN icinga:service:state
(integer) 307538

mysql> select count(*) from service_state\G
*************************** 1. row ***************************
count(*): 307538
1 row in set (0.03 sec)

Under the assumption that this number hasn't changed, there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB. Unless I have missed something, there is no code preventing this case. And, to bring evidence, there were multiple occurrences of FLUSHALL in the ~/.rediscli_history file.

@oxzi
Copy link
Member Author

oxzi commented Oct 1, 2024

there might have been a race between a flushed Redis being populated by Icinga 2 while being read by Icinga DB

Oh wait, that would be the wrong way round. If something is missing in Redis, but available in the SQL database, it would be set as delta.Delete and not .Create.

I tried reproducing the behavior with some FLUSHALLs, but this did not worked out. However, I was able to crash Icinga DB in another flavor when flushing during the sync, but this is fine, I guess.

redis: HMGET icinga:downtime: field "4bd67fe5e10611a5c312ba1b05d8cfe6d7317662" missing
redis: HMGET icinga:downtime: field "5aacf4988f2686df77cc5c5660855d2deafd1185" missing
redis: HMGET icinga:downtime: field "6e97e19d87853b00477c575a2ed25c0c6204ac2e" missing
redis: HMGET icinga:downtime: field "8d77e6661693adba73c4802c98a0bff07ad07126" missing
redis: HMGET icinga:downtime: field "d3537844bfa9e7b4027c106d5ade0eb9c02720d8" missing
redis: HMGET icinga:downtime: field "d7ae5700b04633be4ccd6d52daa397fbdc1d1c14" missing
redis: HMGET icinga:downtime: field "0ce73ba7968ec4350c7a7d0ca4b00bc811336356" missing
redis: HMGET icinga:downtime: field "2c0bbf8054855095870030ce9a92bfaefea7a1fc" missing
redis: HMGET icinga:downtime: field "3d028f5662ddf7c3a9152c6bef4461a57eb993cb" missing
redis: HMGET icinga:downtime: field "4924e173f3ed5ea0001ef275818e85c3e827cccd" missing
redis: HMGET icinga:downtime: field "6444a66dbc37c526d03c2bc2e921cf6bf0ca6c92" missing
redis: HMGET icinga:downtime: field "bc4652319400f81d232884583cd6f96173146d14" missing
redis: HMGET icinga:downtime: field "c658c0430fa9e59a8e1b7b452c2782c299605e2b" missing
redis: Finished fetching from icinga:downtime with 0 items in 367.842935ms
database: Finished executing "SELECT \"properties_checksum\", \"id\" FROM \"host_state\" WHERE \"environment_id\" = :environment_id" with 15944>
database: Finished executing "UPDATE \"host_state\" SET \"is_overdue\" = :is_overdue, \"id\" = :id WHERE id = :id" with 6207 rows in 5.41943569>
database: Finished executing "SELECT \"id\" FROM \"host_customvar\" WHERE \"environment_id\" = :environment_id" with 125000 rows in 3.172386254s
redis: Finished fetching from icinga:host:customvar with 249905 items in 3.172477344s
config-sync: Aborted config sync after 8.016455329s
config-sync: Aborted initial state sync after 8.016473497s
can't copy from closed channel
github.com/icinga/icinga-go-library/com.CopyFirst[...]
        /Users/yhabteab/Workspace/go/icinga-go-library/com/com.go:64
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:708
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
        /Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit
        /opt/homebrew/opt/go/libexec/src/runtime/asm_amd64.s:1700
can't copy first entity
github.com/icinga/icinga-go-library/database.(*DB).UpsertStreamed
        /Users/yhabteab/Workspace/go/icinga-go-library/database/db.go:710
github.com/icinga/icingadb/pkg/icingadb.Sync.ApplyDelta.func2
        /Users/yhabteab/Workspace/go/icingadb/pkg/icingadb/sync.go:158
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/yhabteab/Workspace/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78
runtime.goexit

Another observation was during experiments with the other node being stopped, no SQL deadlock errors occurred. When I wanted to verify this, I re-ran the same scenario with both nodes being active, and also no deadlocks occurred..

@oxzi oxzi added the ref/IP label Oct 11, 2024
@oxzi
Copy link
Member Author

oxzi commented Oct 11, 2024

ref/IP/55850

(Quite unsure if this applies here..)

@yhabteab
Copy link
Member

The used Icinga DB version was, afaik, the latest Icinga DB release v1.2.0 with #800 cherry-picked on top.

Actually, the Icinga DB binaries are built from #800, i.e. they are not based on the v1.2.0 tag.

Please note, that the first node crashed, but the other then took over and worked without any issues.

IIRC, Icinga DB was actually behaving so unstable on both nodes, and was crashing after only 5m on both of them, not just on the other side. Also, didn't I mention before I left that it had to do with the MariaDB version I was using? Before I deleted MariaDB completely and replaced it with MySQL, if I'm not mistaken, MariaDB version 10.5.x was used. However, all the crashes disappeared after replacing it with MySQL and Icinga DB is running quite normal since then.

@oxzi
Copy link
Member Author

oxzi commented Oct 18, 2024

However, all the crashes disappeared after replacing it with MySQL and Icinga DB is running quite normal since then.

Unfortunately not. The reported error in this issue happened after you left. I am quite certain, as I am remembering my frustration seeing it crash after your change which (temporary) fixed the issue. However, maybe this issue is due to prior failure, I simply don't know :/

@yhabteab
Copy link
Member

Unfortunately not. The reported error in this issue happened after you left.

Are you sure about that? The crash logs you provided are all from Friday afternoon, which is before I replaced MariaDB with MySQL.

{
  "PRIORITY": "2",
  "_SOURCE_REALTIME_TIMESTAMP": "2024-09-27T15:29:45Z",
  "MESSAGE": "Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction\ncan't perform \"INSERT INTO
...

@oxzi
Copy link
Member Author

oxzi commented Oct 21, 2024

The timestamp is in UTC, being 17:29:45 in our current timezone. Based on our very important Greetings chat room, you have already left at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants