Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] #5976

Open
a-alinichenko opened this issue Oct 8, 2024 · 10 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@a-alinichenko
Copy link

Observed behavior

Each NATS cluster restart triggers this problem.
When we restart NATS cluster we have the next logs and the pod doesn't work:

[7] 2024/10/03 10:36:57.943668 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59078 != applied 58988
[7] 2024/10/03 10:36:57.943701 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:07.943402 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59108 != applied 58988
[7] 2024/10/03 10:37:07.943423 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:17.943138 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59108 != applied 58988
[7] 2024/10/03 10:37:17.943163 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:27.943352 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59188 != applied 58988
[7] 2024/10/03 10:37:27.943400 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:37.942694 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59298 != applied 58988
[7] 2024/10/03 10:37:37.942806 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:47.943473 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59308 != applied 58988
[7] 2024/10/03 10:37:47.943498 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[7] 2024/10/03 10:37:57.943104 [WRN] RAFT [9R4RvVMj - _meta_] Falling behind in health check, commit 59388 != applied 58988
[7] 2024/10/03 10:37:57.943135 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"

At this time consumer does not read messages and they are collected on a disk.

Expected behavior

Restarting the pod does not lead to operational problems.

Server and client version

Nats server version is 2.10.21
Golang import version is http://github.com/nats-io/nats.go v1.34.1

Host environment

Filter subject: 1

Stream info example:

Information for Stream ch_queue6 created 2024-09-18 18:54:04

              Subjects: requests6, antibot6
              Replicas: 1
               Storage: File
     Placement Cluster: nats-sessions-api
        Placement Tags: pod:nats-sessions-api-6

Options:

             Retention: WorkQueue
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 10m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: 100,000,000,000
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 7d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats-sessions-api
                Leader: nats-sessions-api-6

State:

              Messages: 961,995
                 Bytes: 1.1 GiB
        First Sequence: 3,884,371,968 @ 2024-10-03 13:11:18 UTC
         Last Sequence: 3,885,335,265 @ 2024-10-03 13:21:41 UTC
      Deleted Messages: 1,303
      Active Consumers: 2
    Number of Subjects: 2

Installed via official helm chart in k8s.
7 pods in the cluster.
7 streams (1 for each pod) placed by tags to different pods.

Allocated resources for each pod:
CPU - 4 cores
Mem - 15 GiB

Current load:
6000-7000 messages / seconds.
The Prometheus query to count this:
Screenshot 2024-10-08 at 15 17 46

Steps to reproduce

Just restart a pod under the high load

@a-alinichenko a-alinichenko added the defect Suspected defect such as a bug or regression label Oct 8, 2024
@wallyqs wallyqs changed the title Falling behind in health check. JetStream is not current with the meta leader. Falling behind in health check. JetStream is not current with the meta leader. [v2.10.21] Oct 8, 2024
@derekcollison
Copy link
Member

How many JS API operations are you doing per second? You can get a sense of that from nats traffic under the system account.

@a-alinichenko
Copy link
Author

@derekcollison
I think this metric shows this

Screenshot 2024-10-09 at 11 06 50

@derekcollison
Copy link
Member

ok thanks, that would not cause those repeated messages that it can't catch up to the meta leader. Something else is going on that is not obvious to us.

@a-alinichenko
Copy link
Author

@derekcollison
Should I provide some other details?
NATS installed in k8s GCP GKE.
Applications that use it in the same cluster, in the same network/region.
We have many applications there and have problems only with NATS.

We also have the same configuration of the NATS cluster with the same applications in another region with a lower load.
And with a lower load we don't see such problems with NATS.

@derekcollison
Copy link
Member

Let's see if @wallyqs could spend some time on a zoom call with you.

@wallyqs
Copy link
Member

wallyqs commented Oct 9, 2024

Hi @a-alinichenko ping me at [email protected] and we can try to take a look at the setup/scenario.

@wallyqs
Copy link
Member

wallyqs commented Oct 14, 2024

@a-alinichenko I wonder if this could be related to the readinessprobe as well, so instead of the current default in the helm charts could change it into this:

readinessProbe:
  httpGet:
    path: /

@a-alinichenko
Copy link
Author

@wallyqs Thank you for your answer!
I can change it on our side and test it.
But is it fine to have such warnings as I described in the logs?
Will it not be the cause of any problems if I just change the healthcheck parameters?

@wallyqs
Copy link
Member

wallyqs commented Oct 15, 2024

@a-alinichenko in the v2.11 version we have changed the readiness probe to not be as sensitive to changes and avoid the errors that you posted, but for v2.10 what I shared would work better to avoid the k8s service detaching when there is a lot of activity.

@a-alinichenko
Copy link
Author

@wallyqs, thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants