-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432
Comments
the original problem was #4419 |
FYI: We improved the ClusterTester program to also send pings via akka to to seed node. Lets check how it looks after the weekend |
Results from our weekend test... no error/warning/missing heartbeat when running our ClusterTester. That means that the error is not in Akka.Net. It is in our akka usage code. |
So I think this actually might be related to the Akka.Cluster actors running on the shared |
Resolved via #4511 |
Akka.Net: 1.4.5
Windows Server 2012 R2
Reproduce:
Clone, compile and run ClusterTester on 2 machines. One is a seed node, the other a non seed node.
The test programm doesn't do anything and produces no load. It just creates a minimal cluster.
Let it run for some days.
Log from SeedNode:
Log from non seed node:
Observation:
Occasionally we log missing heartbeats from the non seed node. Ranging from 2 to 3 seconds. Largest ones are more than 30 seconds. The longer the heartbeat pauses the more often we get a network partition.
How can this happen and what can we do about it ?
The real problem that in our production environment when there is more load on the system these heartbeat pauses are more often and causes network partition multiple times a day.
As workaround we implemented:
but there a still network partitions.
To ensure that we dont have network issues or hiccups we run a ping in parallel to the ClusterTester. We collected up to today more than 1.5 million pings completing in 1ms and faster. No hiccups or errors:
What else can we do to investigate this more?
The text was updated successfully, but these errors were encountered: