You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
InboundBuffer<TEvent>, in the WaitToReadAsync method, sends a CancellationToken to a ChannelReader from System.Threading.Channels.
Per dotnet/runtime#761 (comment) this causes a memory leak which is intentional- the engineers for that project do not keep the channel's objects in a doubly-linked list, so, for performance reasons, they do not deal with cancellation token events until the next write where they then remove the cancelled objects from the list. However, even after a write on the channel, this leaves the generated System.OperationCanceledException objects pinned that will never get garbage collected. These build up over time, forming a memory leak.
Code to Reproduce
For brevity, the issue can be reproduced using the following single line of code, followed by a while (true) { } and then waiting:
var noopChannel = new NoopBufferedChannel(new NoopBufferedChannel.NoopChannelOptions());
We've setup a detailed repository here which has more detail and fully working real-life usage code, showing how we stumbled across the issue when upgrading our system to ElasticSearch 8.x and using the latest Serilog NuGet package (v4.2.0) to send logs to Elastic:
That repo also includes additional screenshots showing the memory leak captured by Visual Studio's Diagnostic Tools. Please check out the repo's README for more details.
Our Use Case
This memory leak is slow. In our testing and with default buffer options, it grows at ~600kb per hour for a single instance of a BufferedChannel from Elastic.Channels. However, we have hundreds of instances of microservices that run for weeks as part of a web application service. For security and auditing reasons, each instance is required to write relevant logs to our Elastic cluster. We have seen scenarios where tens of gigabytes of memory is taken up entirely by just collective instances of System.OperationCanceledException which all originate from InboundBuffer<TEvent>.
Suggested Solution
The System.Threading.Channels engineers have a lot of other use cases to consider and do not have any plan (currently) to switch over to something like a doubly-linked list for their channel implementation. We ended up modifying the Elastic.Channels assembly to not pass a cancellation token to them, but instead handle it internally. The solution isn't exactly elegant, but it retains functionality while also eliminating the memory leak. It is also a low performance-overhead cost for this specific usage of ChannelReader, where the timeout event is expected to occur only once every several seconds (the default for OutboundBufferMaxLifetime is 5 seconds). We are currently using this in production:
in /src/Elastic.Channels/Buffers/InboundBuffer.cs, and specifically in the WaitToReadAsync method, in the main try loop, we do the following:
try
{
_breaker.CancelAfter(Wait);
var readTask = reader.WaitToReadAsync().AsTask();
var delayTask = Task.Delay(Timeout.Infinite, _breaker.Token);
var completedTask = await Task.WhenAny(readTask, delayTask).ConfigureAwait(false);
if (completedTask == delayTask)
{
throw new OperationCanceledException(_breaker.Token);
}
_breaker.CancelAfter(-1);
return await readTask.ConfigureAwait(false);
}
This allows Elastic.Channels to continue using System.Threading.Channels.
We are hoping to get this addressed in the main Elastic.Channels package so that we do not have to maintain a separate copy of the code and the resulting supply-chain-management impacts.
The text was updated successfully, but these errors were encountered:
Summary
InboundBuffer<TEvent>
, in theWaitToReadAsync
method, sends aCancellationToken
to aChannelReader
fromSystem.Threading.Channels
.Per dotnet/runtime#761 (comment) this causes a memory leak which is intentional- the engineers for that project do not keep the channel's objects in a doubly-linked list, so, for performance reasons, they do not deal with cancellation token events until the next write where they then remove the cancelled objects from the list. However, even after a write on the channel, this leaves the generated
System.OperationCanceledException
objects pinned that will never get garbage collected. These build up over time, forming a memory leak.Code to Reproduce
For brevity, the issue can be reproduced using the following single line of code, followed by a
while (true) { }
and then waiting:var noopChannel = new NoopBufferedChannel(new NoopBufferedChannel.NoopChannelOptions());
We've setup a detailed repository here which has more detail and fully working real-life usage code, showing how we stumbled across the issue when upgrading our system to ElasticSearch 8.x and using the latest Serilog NuGet package (v4.2.0) to send logs to Elastic:
https://github.com/b-meridian/eschannels
That repo also includes additional screenshots showing the memory leak captured by Visual Studio's Diagnostic Tools. Please check out the repo's README for more details.
Our Use Case
This memory leak is slow. In our testing and with default buffer options, it grows at ~600kb per hour for a single instance of a
BufferedChannel
fromElastic.Channels
. However, we have hundreds of instances of microservices that run for weeks as part of a web application service. For security and auditing reasons, each instance is required to write relevant logs to our Elastic cluster. We have seen scenarios where tens of gigabytes of memory is taken up entirely by just collective instances ofSystem.OperationCanceledException
which all originate fromInboundBuffer<TEvent>
.Suggested Solution
The
System.Threading.Channels
engineers have a lot of other use cases to consider and do not have any plan (currently) to switch over to something like a doubly-linked list for their channel implementation. We ended up modifying theElastic.Channels
assembly to not pass a cancellation token to them, but instead handle it internally. The solution isn't exactly elegant, but it retains functionality while also eliminating the memory leak. It is also a low performance-overhead cost for this specific usage ofChannelReader
, where the timeout event is expected to occur only once every several seconds (the default forOutboundBufferMaxLifetime
is 5 seconds). We are currently using this in production:in
/src/Elastic.Channels/Buffers/InboundBuffer.cs
, and specifically in theWaitToReadAsync
method, in the maintry
loop, we do the following:This allows
Elastic.Channels
to continue usingSystem.Threading.Channels
.We are hoping to get this addressed in the main
Elastic.Channels
package so that we do not have to maintain a separate copy of the code and the resulting supply-chain-management impacts.The text was updated successfully, but these errors were encountered: