-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vtorc
: require topo for Healthy: true
in /debug/health
#17129
base: main
Are you sure you want to change the base?
vtorc
: require topo for Healthy: true
in /debug/health
#17129
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17129 +/- ##
==========================================
+ Coverage 67.43% 67.45% +0.02%
==========================================
Files 1571 1569 -2
Lines 252249 252233 -16
==========================================
+ Hits 170098 170147 +49
+ Misses 82151 82086 -65 ☔ View full report in Codecov by Sentry. |
Added backport labels because I think it's unsafe for VTOrc to report it is healthy when it cannot see the topo. This could lead to outages if a vtorc config is invalid and deployed for some time |
Signed-off-by: Tim Vaillancourt <[email protected]>
a8d6422
to
81e8f81
Compare
Description
This PR addresses #17121 by requiring that we're able to reach the topology at least once (even if it's just empty) before returning
Healthy: true
in/debug/health
. Today a VTOrc is considered healthy purely if it can write to it's own databaseThis is to prevent a
vtorc
deployment with a broken topo config to be seen as healthy, which in Kube can cause a bad config deploy to rollout to all nodes when it could fail on the first that it breaksAlso some error logging was consolidated so we don't log the topo failure multiple times for the same call (see below)
Example with this change:
And in another shell:
(previous to this PR
"Healthy": true
was returned)Backport reason
I think this should be backported to prevent users from the scenario described above. This issue could lead to a user believing a VTOrc deployment is healthy when it's actually unable to load anything from the topo
Related Issue(s)
Resolves #17121
Checklist
Deployment Notes