Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New built in health check for the high water detection going stale #3455

Open
jeremydmiller opened this issue Oct 2, 2024 · 1 comment
Open

Comments

@jeremydmiller
Copy link
Member

If it's falling behind the sequence by a certain mark is the logic

@jeremydmiller jeremydmiller added this to the Marten 8.0 milestone Oct 23, 2024
@iamsamcoder
Copy link

I encountered highwater mark going stale this morning. I have a web client running marten, and a separate console app (single intance) running the async daemon.

The highwater mark went stale during an azure maintenance operation. The web client reported Npgsql.PostgresException with message Unexpected processing exception. See inner exception. 08P01: server login has been failing, try again later (server_login_retry). There 500 of these exceptions around the time the highwater mark went stale.

The web client recovered and continued marten operations. However, the app running the daemon was no longer processing async projections since the higwatermark and all projection shards were stalled on the same seqId. This stale continued until I restarted the async daemon application.

Reviewing the logs for the async daemon app, the only exceptions are related to login to server, presumably during Azure maintenance task.

"[10:59:23 ERR] Failed while trying to detect high water statistics for database Marten"
"Npgsql.PostgresException (0x80004005): 08P01: server login has been failing, try again later (server_login_retry)"
"   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)"
"   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)"
"   at Npgsql.Internal.NpgsqlConnector.Authenticate(String username, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)"
"   at Npgsql.Internal.NpgsqlConnector.<Open>g__OpenCore|213_1(NpgsqlConnector conn, SslMode sslMode, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken, Boolean isFirstAttempt)"
"   at Npgsql.Internal.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)"
"   at Npgsql.PoolingDataSource.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)"
"   at Npgsql.PoolingDataSource.<Get>g__RentAsync|34_0(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)"
"   at Npgsql.NpgsqlConnection.<Open>g__OpenAsync|42_0(Boolean async, CancellationToken cancellationToken)"
"   at Marten.Storage.MartenDatabase.SingleQuery`1.ExecuteAsync(CancellationToken cancellation)"
"   at Marten.Storage.MartenDatabase.SingleQuery`1.ExecuteAsync(CancellationToken cancellation)"
"   at Polly.ResiliencePipeline.<>c__9`2.<<ExecuteAsync>b__9_0>d.MoveNext()"
"--- End of stack trace from previous location ---"
"   at Polly.Outcome`1.GetResultOrRethrow()"
"   at Polly.ResiliencePipeline.ExecuteAsync[TResult,TState](Func`3 callback, TState state, CancellationToken cancellationToken)"
"   at Marten.Events.Daemon.HighWater.HighWaterDetector.loadCurrentStatistics(CancellationToken token)"
"   at Marten.Events.Daemon.HighWater.HighWaterDetector.Detect(CancellationToken token)"
"   at Marten.Events.Daemon.HighWater.HighWaterAgent.detectChanges()"
"  Exception data:"
"    Severity: FATAL"
"    SqlState: 08P01"
"    MessageText: server login has been failing, try again later (server_login_retry)"

Then just after the server login error, there is an error for EndOfStreamException. Following this, the highwater mark appears to remain stale until I restarted the app running the async daemon.

"[11:00:36 ERR] Health check eventProgressionHealth threw an unhandled exception after 1.9434ms"
"Npgsql.NpgsqlException (0x80004005): Exception while reading from stream"
" ---> System.IO.EndOfStreamException: Attempted to read past the end of the stream."
"   at Npgsql.Internal.NpgsqlReadBuffer.<Ensure>g__EnsureLong|55_0(NpgsqlReadBuffer buffer, Int32 count, Boolean async, Boolean readingNotifications)"
"   at Npgsql.Internal.NpgsqlReadBuffer.<Ensure>g__EnsureLong|55_0(NpgsqlReadBuffer buffer, Int32 count, Boolean async, Boolean readingNotifications)"
"   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)"
"   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)"
"   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)"
"   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)"
"   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)"
"   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)"
"   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)"
"   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)"
"   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)"
"   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken)"
"   at Weasel.Postgresql.CommandBuilderExtensions.ExecuteReaderAsync(NpgsqlConnection connection, CommandBuilder commandBuilder, NpgsqlTransaction tx, CancellationToken ct)"
"   at Marten.Storage.MartenDatabase.AllProjectionProgress(CancellationToken token)"
"   at Marten.Storage.MartenDatabase.AllProjectionProgress(CancellationToken token)"
"   at Marten.AdvancedOperations.AllProjectionProgress(String tenantId, CancellationToken token)"
"   at Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService.RunCheckAsync(HealthCheckRegistration registration, CancellationToken cancellationToken)"

My plan is to update my marten health check service to also include the max event stream seqId and return unhealthy if greater than X threshold. I think this is what you are planning for the built in health check as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants