Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EtcdHighNumberOfFailedGRPCRequests #248

Closed
nabbdl opened this issue Feb 14, 2019 · 13 comments
Closed

EtcdHighNumberOfFailedGRPCRequests #248

nabbdl opened this issue Feb 14, 2019 · 13 comments

Comments

@nabbdl
Copy link

nabbdl commented Feb 14, 2019

I enabled "cluster monitoring operator" on several OKD 3.11 clusters. Everything is working fine except for ETCD monitoring. I followed the documentation to enable etcd monitoring. It's seems to work : most of the checks are green, except the "EtcdHighNumberOfFailedGRPCRequests" which is always triggered (etcd cluster is working correctly). Do I miss something or is there any know issue while enabling etcd cluster monitoring ?
image

@nabbdl
Copy link
Author

nabbdl commented Feb 14, 2019

To be more specific the query regarding this alert always retrieve "100" as a value. I suspect that the query is false.
Current query is :

100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

in my opinion it should be (remove *100 at the begining) :

sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

@benhwebster
Copy link

I also see this on OpenShift 3.11 after enabling etcd monitoring.

@metalmatze
Copy link
Contributor

The current query seems correct to me. It alerts on more than 5% of requests failing.
On my personal cluster I can see that there's a watch method pending for 4min.

{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.135.73.45",job="etcd"}

Is it the same for you? Maybe we should ignore the watches here.

@nabbdl
Copy link
Author

nabbdl commented Feb 15, 2019

Tested your query, result is empty. For me the strange thing is that I have the same alert on all the OpenShifts Cluster I have installed. and the query result for EtcdHighNumberOfFailedGRPCRequests gives alway a value of 100
image

@zot24
Copy link

zot24 commented Mar 19, 2019

Hi @nabbdl just wondering if you still seeing that error and if you resolve the mystery? I'm facing the same error here and not sure why is happening.

@nabbdl
Copy link
Author

nabbdl commented Mar 19, 2019

Hi @zot24. Unfortunately, I'm still seeing the same error.

@zot24
Copy link

zot24 commented Mar 19, 2019

@nabbdl jtlyk I have been doing some research and after reading a lot of comments I think I'll just ignore those alert for now poseidon/typhoon#175 there are a bunch of issues regarding this error message, but pretty much what's going on it's summarized in this issue etcd-io/etcd#10289 and there is still not a fix for it.

In more detail I think this is the offended line https://github.com/gyuho/etcd/blob/0cf9382024da6132cb5f0778c3fb43e4a6c88afd/etcdserver/api/v3rpc/util.go#L111

@zot24
Copy link

zot24 commented Mar 19, 2019

If you using jsonnet you could add the following to suppress that rule for now:

{
  prometheusAlerts+:: {
    groups: std.map(
      function(group)
        if group.name == 'etcd' then
          group {
            rules: std.filter(
              function(rule)
                rule.alert != 'etcdHighNumberOfFailedGRPCRequests',
              group.rules
            ),
          }
        else
          group,
      super.groups
    ),
  },
}

@zot24
Copy link

zot24 commented Apr 30, 2019

@nabbdl jtlyk this just got merge #340

@nabbdl
Copy link
Author

nabbdl commented May 1, 2019

@jtlyk thank you for the update. I’m currently using the « cluster monitoring operator » provided with OpenShift so I can’t use jsonnet to disable the rule. The only thing I can do for now is to completely disable « etcd monitoring ». Or maybe the cluster-monitoring operator will update itself and will take into account the modification ??

@benhwebster
Copy link

You could do what I did in my clusters and create a silence for those alerts in alertmanager, but it does look like they may be backporting it currently: #383

@paulfantom
Copy link
Contributor

Fix backported in #383

/close

@openshift-ci-robot
Copy link
Contributor

@paulfantom: Closing this issue.

In response to this:

Fix backported in #383

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants