Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] #5976

a-alinichenko · 2024-10-08T13:19:37Z

Observed behavior

Each NATS cluster restart triggers this problem.
When we restart NATS cluster we have the next logs and the pod doesn't work:

[7] 2024/10/03 10:36:57.943668 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59078 != applied 58988
[7] 2024/10/03 10:36:57.943701 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:07.943402 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59108 != applied 58988
[7] 2024/10/03 10:37:07.943423 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:17.943138 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59108 != applied 58988
[7] 2024/10/03 10:37:17.943163 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:27.943352 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59188 != applied 58988
[7] 2024/10/03 10:37:27.943400 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:37.942694 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59298 != applied 58988
[7] 2024/10/03 10:37:37.942806 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:47.943473 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59308 != applied 58988
[7] 2024/10/03 10:37:47.943498 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:57.943104 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59388 != applied 58988
[7] 2024/10/03 10:37:57.943135 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"

At this time consumer does not read messages and they are collected on a disk.

Expected behavior

Restarting the pod does not lead to operational problems.

Server and client version

Nats server version is 2.10.21
Golang import version is http://github.com/nats-io/nats.go v1.34.1

Host environment

Filter subject: 1

Stream info example:

Information for Stream ch_queue6 created 2024-09-18 18:54:04

              Subjects: requests6, antibot6
              Replicas: 1
               Storage: File
     Placement Cluster: nats-sessions-api
        Placement Tags: pod:nats-sessions-api-6

Options:

             Retention: WorkQueue
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 10m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: 100,000,000,000
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 7d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats-sessions-api
                Leader: nats-sessions-api-6

State:

              Messages: 961,995
                 Bytes: 1.1 GiB
        First Sequence: 3,884,371,968 @ 2024-10-03 13:11:18 UTC
         Last Sequence: 3,885,335,265 @ 2024-10-03 13:21:41 UTC
      Deleted Messages: 1,303
      Active Consumers: 2
    Number of Subjects: 2

Installed via official helm chart in k8s.
7 pods in the cluster.
7 streams (1 for each pod) placed by tags to different pods.

Allocated resources for each pod:
CPU - 4 cores
Mem - 15 GiB

Current load:
6000-7000 messages / seconds.
The Prometheus query to count this:

Steps to reproduce

Just restart a pod under the high load

The text was updated successfully, but these errors were encountered:

derekcollison · 2024-10-08T16:40:10Z

How many JS API operations are you doing per second? You can get a sense of that from nats traffic under the system account.

a-alinichenko · 2024-10-09T09:08:32Z

@derekcollison
I think this metric shows this

derekcollison · 2024-10-09T14:48:50Z

ok thanks, that would not cause those repeated messages that it can't catch up to the meta leader. Something else is going on that is not obvious to us.

a-alinichenko · 2024-10-09T15:15:22Z

@derekcollison
Should I provide some other details?
NATS installed in k8s GCP GKE.
Applications that use it in the same cluster, in the same network/region.
We have many applications there and have problems only with NATS.

We also have the same configuration of the NATS cluster with the same applications in another region with a lower load.
And with a lower load we don't see such problems with NATS.

derekcollison · 2024-10-09T15:49:44Z

Let's see if @wallyqs could spend some time on a zoom call with you.

wallyqs · 2024-10-09T17:11:13Z

Hi @a-alinichenko ping me at [email protected] and we can try to take a look at the setup/scenario.

wallyqs · 2024-10-14T16:51:06Z

@a-alinichenko I wonder if this could be related to the readinessprobe as well, so instead of the current default in the helm charts could change it into this:

readinessProbe:
  httpGet:
    path: /

a-alinichenko · 2024-10-15T11:45:31Z

@wallyqs Thank you for your answer!
I can change it on our side and test it.
But is it fine to have such warnings as I described in the logs?
Will it not be the cause of any problems if I just change the healthcheck parameters?

wallyqs · 2024-10-15T11:56:10Z

@a-alinichenko in the v2.11 version we have changed the readiness probe to not be as sensitive to changes and avoid the errors that you posted, but for v2.10 what I shared would work better to avoid the k8s service detaching when there is a lot of activity.

a-alinichenko · 2024-10-15T11:57:36Z

@wallyqs, thanks for the clarification!

a-alinichenko added the defect Suspected defect such as a bug or regression label Oct 8, 2024

wallyqs changed the title ~~Falling behind in health check. JetStream is not current with the meta leader.~~ Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] #5976

Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] #5976

a-alinichenko commented Oct 8, 2024

derekcollison commented Oct 8, 2024

a-alinichenko commented Oct 9, 2024

derekcollison commented Oct 9, 2024

a-alinichenko commented Oct 9, 2024

derekcollison commented Oct 9, 2024

wallyqs commented Oct 9, 2024

wallyqs commented Oct 14, 2024

a-alinichenko commented Oct 15, 2024

wallyqs commented Oct 15, 2024

a-alinichenko commented Oct 15, 2024

Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] #5976

Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] #5976

Comments

a-alinichenko commented Oct 8, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

derekcollison commented Oct 8, 2024

a-alinichenko commented Oct 9, 2024

derekcollison commented Oct 9, 2024

a-alinichenko commented Oct 9, 2024

derekcollison commented Oct 9, 2024

wallyqs commented Oct 9, 2024

wallyqs commented Oct 14, 2024

a-alinichenko commented Oct 15, 2024

wallyqs commented Oct 15, 2024

a-alinichenko commented Oct 15, 2024