-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use correct cancelAfter broadcast when resources are exhausted #1767
Use correct cancelAfter broadcast when resources are exhausted #1767
Conversation
sqlitecluster/SQLiteNode.cpp
Outdated
@@ -1932,7 +1939,7 @@ void SQLiteNode::_changeState(SQLiteNodeState newState) { | |||
// If we were following, and now we're not, we give up an any replications. | |||
if (_state == SQLiteNodeState::FOLLOWING) { | |||
_replicationThreadsShouldExit = true; | |||
uint64_t cancelAfter = _leaderCommitNotifier.getValue(); | |||
uint64_t cancelAfter = commitIDToCancelAfter > 0 ? commitIDToCancelAfter : _leaderCommitNotifier.getValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just use the boolean value of commitIDToCancelAfter
instead of comparing to 0.
sqlitecluster/SQLiteNode.cpp
Outdated
// and waiting for the transaction that failed will be stuck in an infinite loop. To prevent that | ||
// we're changing the state to SEARCHING and sending the cancelAfter property to drop all threads | ||
// that depend on the transaction that failed to be threaded. | ||
uint64_t cancelAfter = message.calcU64("NewCount") - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to check NewCount against 0
first, but this is an edge case that I don't think we can actually trigger.
Details
Currently, we're facing issues with the replication timing. Because that's happening, the number of replication threads is growing too much, causing resource exhaustion.
Considering that all replication messages can arrive in parallel and not necessarily in order, followers could get stuck in the following case:
Since no threads were created for commit 2, it is never applied. Thread 2, which depends on commit 2 to be completed, gets stuck in an infinite loop. The server then never goes back to a searching state, and can only go back to the pool after a restart.
Fixed Issues
Fixes GH_LINK
Tests
Internal Testing Reminder: when changing bedrock, please compile auth against your new changes