Use correct cancelAfter broadcast when resources are exhausted #1767

danieldoglas · 2024-06-04T16:51:58Z

Details

Currently, we're facing issues with the replication timing. Because that's happening, the number of replication threads is growing too much, causing resource exhaustion.

Considering that all replication messages can arrive in parallel and not necessarily in order, followers could get stuck in the following case:

receives commit 1, creates thread 1
receives commit 3, creates thread 2
receive commit 2, tries to create thread 3 but resources are exhausted and it fails

Since no threads were created for commit 2, it is never applied. Thread 2, which depends on commit 2 to be completed, gets stuck in an infinite loop. The server then never goes back to a searching state, and can only go back to the pool after a restart.

Fixed Issues

Fixes GH_LINK

Tests

Internal Testing Reminder: when changing bedrock, please compile auth against your new changes

…e a thread for

sqlitecluster/SQLiteNode.cpp

tylerkaraszewski · 2024-06-04T17:36:33Z

sqlitecluster/SQLiteNode.cpp

@@ -1932,7 +1939,7 @@ void SQLiteNode::_changeState(SQLiteNodeState newState) {
        // If we were following, and now we're not, we give up an any replications.
        if (_state == SQLiteNodeState::FOLLOWING) {
            _replicationThreadsShouldExit = true;
-            uint64_t cancelAfter = _leaderCommitNotifier.getValue();
+            uint64_t cancelAfter = commitIDToCancelAfter > 0 ? commitIDToCancelAfter : _leaderCommitNotifier.getValue();


I would just use the boolean value of commitIDToCancelAfter instead of comparing to 0.

tylerkaraszewski · 2024-06-04T17:37:14Z

sqlitecluster/SQLiteNode.cpp

+                    // and waiting for the transaction that failed will be stuck in an infinite loop. To prevent that
+                    // we're changing the state to SEARCHING and sending the cancelAfter property to drop all threads
+                    // that depend on the transaction that failed to be threaded.
+                    uint64_t cancelAfter = message.calcU64("NewCount") - 1;


Might want to check NewCount against 0 first, but this is an edge case that I don't think we can actually trigger.

danieldoglas added 3 commits June 4, 2024 18:43

calculating the cancel after based on the NewCount we failed to creat…

92f7272

…e a thread for

adding comment

6daf253

adjust comment

4dac256

danieldoglas requested a review from tylerkaraszewski June 4, 2024 16:51

danieldoglas self-assigned this Jun 4, 2024

danieldoglas added 4 commits June 4, 2024 18:53

removing format

d2053e5

change state to searching

a8ef77d

drying the code

a6f3e53

fixing comment

fbb1f11

tylerkaraszewski requested changes Jun 4, 2024

View reviewed changes

addressing comments

d4fbfc9

danieldoglas requested a review from tylerkaraszewski June 4, 2024 17:42

tylerkaraszewski approved these changes Jun 4, 2024

View reviewed changes

danieldoglas requested a review from johnmlee101 June 4, 2024 18:01

johnmlee101 approved these changes Jun 4, 2024

View reviewed changes

tylerkaraszewski merged commit 69a1ba7 into main Jun 4, 2024
1 check passed

tylerkaraszewski deleted the dsilva_useCorrectCancelAfterWhenResourcesAreExhausted branch June 4, 2024 18:26

flodnv mentioned this pull request Oct 31, 2024

Decrement thread count on exception. #1929

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use correct cancelAfter broadcast when resources are exhausted #1767

Use correct cancelAfter broadcast when resources are exhausted #1767

danieldoglas commented Jun 4, 2024 •

edited

Loading

tylerkaraszewski Jun 4, 2024

tylerkaraszewski Jun 4, 2024

Use correct cancelAfter broadcast when resources are exhausted #1767

Use correct cancelAfter broadcast when resources are exhausted #1767

Conversation

danieldoglas commented Jun 4, 2024 • edited Loading

Details

Fixed Issues

Tests

tylerkaraszewski Jun 4, 2024

Choose a reason for hiding this comment

tylerkaraszewski Jun 4, 2024

Choose a reason for hiding this comment

danieldoglas commented Jun 4, 2024 •

edited

Loading