Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Fix etcdHighNumberOfFailedGRPCRequests rules return NaN value #10629

Closed
wants to merge 1 commit into from

Conversation

qingwave
Copy link

@qingwave qingwave commented Apr 11, 2019

what changed?
Rules etcdHighNumberOfFailedGRPCRequests in etcd3_alert.rules.yml

why this change was made?
The original rules will return NaN value of etcdHighNumberOfFailedGRPCRequests, but it should be a zero value. as image
etcd_rule_org
The denominator sometimes is zero,and return a NaN value of rules. Replace division with comparison, as image
etcd_rule_fix

@spzala
Copy link
Member

spzala commented Apr 11, 2019

@qingwave thanks for the PR. Please make sure you have commit title in this format <packagename>: <description> e.g. Doc: fix etcdHighNumberOfFailedGRPCRequests rules

@qingwave qingwave changed the title fix etcdHighNumberOfFailedGRPCRequests rules Documentation:Fix etcdHighNumberOfFailedGRPCRequests rules return NaN value Apr 12, 2019
@qingwave qingwave changed the title Documentation:Fix etcdHighNumberOfFailedGRPCRequests rules return NaN value Documentation: Fix etcdHighNumberOfFailedGRPCRequests rules return NaN value Apr 12, 2019
@qingwave qingwave closed this Apr 12, 2019
@qingwave qingwave reopened this Apr 12, 2019
@qingwave qingwave closed this Apr 12, 2019
@qingwave qingwave reopened this Apr 12, 2019
@codecov-io
Copy link

codecov-io commented Apr 12, 2019

Codecov Report

Merging #10629 into master will decrease coverage by 0.32%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10629      +/-   ##
==========================================
- Coverage   71.69%   71.36%   -0.33%     
==========================================
  Files         393      393              
  Lines       36627    36628       +1     
==========================================
- Hits        26258    26140     -118     
- Misses       8532     8648     +116     
- Partials     1837     1840       +3
Impacted Files Coverage Δ
pkg/adt/interval_tree.go 78.07% <0%> (-12.92%) ⬇️
auth/simple_token.go 77.23% <0%> (-9.76%) ⬇️
etcdctl/ctlv3/command/role_command.go 62.9% <0%> (-8.07%) ⬇️
etcdserver/api/v3rpc/lease.go 69.31% <0%> (-7.96%) ⬇️
pkg/netutil/netutil.go 63.11% <0%> (-6.56%) ⬇️
pkg/logutil/zap_grpc.go 47.61% <0%> (-4.77%) ⬇️
proxy/grpcproxy/watcher.go 85.71% <0%> (-4.09%) ⬇️
etcdserver/v2_server.go 80.76% <0%> (-3.85%) ⬇️
pkg/testutil/recorder.go 77.77% <0%> (-3.71%) ⬇️
clientv3/leasing/txn.go 88.09% <0%> (-3.18%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c6894...af6aeb7. Read the comment docs.

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qingwave you need to update the commit title. The build failure gives this error - e932113 fix etcdHighNumberOfFailedGRPCRequests rules... Expected commit title format '<package>{", "<package>}: <description>' Got: e932113 fix etcdHighNumberOfFailedGRPCRequests rules Other than that per the output of the fix you have provided the changes looks good to me. Thanks!

@qingwave
Copy link
Author

@spzala okay, the commit title modified

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qingwave this should be changed as well https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules#L41 Also I might be missing something but the example diagram you have provided for NaN seems not have the exact expression as what's in the current rules? Thanks!

@spzala
Copy link
Member

spzala commented Apr 12, 2019

@spzala okay, the commit title modified

Thanks, that has fixed the build.

@spzala
Copy link
Member

spzala commented Apr 12, 2019

/cc @xiang90

@qingwave
Copy link
Author

qingwave commented Apr 13, 2019

@qingwave this should be changed as well https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules#L41 Also I might be missing something but the example diagram you have provided for NaN seems not have the exact expression as what's in the current rules? Thanks!

@spzala As your said, the fixed rules cannot return a exact expression, only 0 or 1. In prometheus, there is not a gracefully method to avoid divide by zero. We need to check for zero in divider and replace NaN with zero, as follows,

100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method)
      /
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method) > 0
      or 
sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method) > bool 0

image

Is it a good way?

@spzala
Copy link
Member

spzala commented Apr 17, 2019

@qingwave thanks and in that case it sounds good to me. Let's have @hexfusion or @xiang90 take a look too. Also, as I commented earlier the etcd3_alert.rules should be updated accordingly as well. Thanks!

@brancz
Copy link
Contributor

brancz commented Jun 11, 2019

I'm not sure this is the right fix. As far as I can tell the reason this is happening is because we're dividing an empty vector by a scalar, for which not a number is indeed correct. What exactly needs to be fixed here? Are alerts firing that shouldn't be firing?

@servo1x
Copy link

servo1x commented Jul 28, 2019

We're seeing this alert firing constantly, should this rule be removed entirely?

openshift/cluster-monitoring-operator#248 (comment)
#10289

@brancz
Copy link
Contributor

brancz commented Aug 5, 2019

@servo1x for what it's worth it's disabled in openshift (v4.1.0+) until we resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants