Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-38078: Validate HAProxy health check interval time value #618

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

grzpiotrowski
Copy link
Contributor

@grzpiotrowski grzpiotrowski commented Aug 20, 2024

The maximum valid time representing value in HAProxy is 2147483647ms (max positive value for a 32-bit signed integer).

Setting time value to one exceeding the maximum handled by HAProxy for router.openshift.io/haproxy.health.check.interval breaks the router-default pods.

Prior to this PR clipHAProxyTimeoutValue was used to validate the timeout values set in annotations like haproxy.router.openshift.io/timeout or haproxy.router.openshift.io/timeout-tunnel.

By reusing the clipHAProxyTimeoutValue function, this PR adds the same validation for the value set in router.openshift.io/haproxy.health.check.interval annotation to ensure it is within the range that HAProxy can parse.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Aug 20, 2024
@openshift-ci-robot
Copy link
Contributor

@grzpiotrowski: This pull request references Jira Issue OCPBUGS-38078, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The maximum valid time representing value in HAProxy is 2147483647ms (max positive value for a 32-bit signed integer).

Setting time value to one exceeding the maximum handled by HAProxy for router.openshift.io/haproxy.health.check.interval breaks he router-default pods.

Prior to this PR clipHAProxyTimeoutValue was used to validate the timeout values set in annotations like haproxy.router.openshift.io/timeout or haproxy.router.openshift.io/timeout-tunnel.

This PR adds the same validation for the value set in router.openshift.io/haproxy.health.check.interval annotation to ensure it is within the range that HAProxy can parse.
clipHAProxyTimeoutValue function is renamed to clipHAProxyTimeValue to reflect the more general purpose of this function, which is to validate time representing values for HAProxy related annotations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

@grzpiotrowski: This pull request references Jira Issue OCPBUGS-38078, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

The maximum valid time representing value in HAProxy is 2147483647ms (max positive value for a 32-bit signed integer).

Setting time value to one exceeding the maximum handled by HAProxy for router.openshift.io/haproxy.health.check.interval breaks the router-default pods.

Prior to this PR clipHAProxyTimeoutValue was used to validate the timeout values set in annotations like haproxy.router.openshift.io/timeout or haproxy.router.openshift.io/timeout-tunnel.

This PR adds the same validation for the value set in router.openshift.io/haproxy.health.check.interval annotation to ensure it is within the range that HAProxy can parse.
clipHAProxyTimeoutValue function is renamed to clipHAProxyTimeValue to reflect the more general purpose of this function, which is to validate time representing values for HAProxy related annotations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lihongan
Copy link
Contributor

cc @rhamini3 @melvinjoseph86

@candita
Copy link
Contributor

candita commented Aug 20, 2024

@grzpiotrowski unless someone has advised you otherwise, I will request that you do not rename the function or constant. It will make this easier to review and backport, by simplifying the number of changes.

@grzpiotrowski grzpiotrowski force-pushed the OCPBUGS-38078-haproxy-healthcheck-interval branch from 0e507e4 to 0ecfb1b Compare August 21, 2024 10:11
@grzpiotrowski
Copy link
Contributor Author

@grzpiotrowski unless someone has advised you otherwise, I will request that you do not rename the function or constant. It will make this easier to review and backport, by simplifying the number of changes.

Yes, I didn't consider the ease of backporting in this case.
As suggested, I reverted the clipHAProxyTimeValue function name back to clipHAProxyTimeoutValue and all other generalized variable names back to original. Only the minimal changes required to fix the issue are left now.

@grzpiotrowski
Copy link
Contributor Author

I was just wondering, should we keep the log messages unchanged or adjust them to say time or time value instead of timeout, as from now the health check interval is also clipped, not only timeouts.
Not sure how this would affect backports.

Example of a log for clipping the health check interval now:

template_helper.go:356] "msg"="route annotation timeout exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"

@rhamini3
Copy link

  1. Annotated all available routes with health check interval above the maximum input
oc annotate route -n openshift-ingress-canary canary --overwrite router.openshift.io/haproxy.health.check.interval=50000d
  1. Looked up pods in openshift-ingress namespace, deleted and recreated them
oc get po -n openshift-ingress                                   
NAME                              READY   STATUS    RESTARTS   AGE
router-default-8648b58758-6q8nk   1/1     Running   0          3m11s
router-default-8648b58758-ckp28   1/1     Running   0          2m7s
  1. Looked at pod logs to make sure health check interval successfully clips the input value
oc logs -n openshift-ingress router-default-8648b58758-6q8nk
I0820 19:54:48.602667       1 template.go:560] "msg"="starting router" "logger"="router" "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 6c1ef257\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2024-08-20T17:29:50Z\n"
I0820 19:54:48.605066       1 metrics.go:156] "msg"="router health and metrics port listening on HTTP and HTTPS" "address"="0.0.0.0:1936" "logger"="metrics"
I0820 19:54:48.607465       1 router.go:217] "msg"="creating a new template router" "logger"="template" "writeDir"="/var/lib/haproxy"
I0820 19:54:48.607535       1 router.go:302] "msg"="router will coalesce reloads within an interval of each other" "interval"="5s" "logger"="template"
I0820 19:54:48.607837       1 router.go:372] "msg"="watching for changes" "logger"="template" "path"="/etc/pki/tls/private"
I0820 19:54:48.607898       1 router.go:283] "msg"="router is including routes in all namespaces" "logger"="router"
I0820 19:54:48.621526       1 reflector.go:359] Caches populated for *v1.Route from github.com/openshift/router/pkg/router/controller/factory/factory.go:124
I0820 19:54:48.623961       1 reflector.go:359] Caches populated for *v1.EndpointSlice from github.com/openshift/router/pkg/router/controller/factory/factory.go:124
I0820 19:54:48.624807       1 reflector.go:359] Caches populated for *v1.Service from github.com/openshift/router/pkg/router/template/service_lookup.go:33
I0820 19:54:48.712700       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.712753       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.712801       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713165       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713239       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713411       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713480       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713587       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713631       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713670       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713864       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713938       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714134       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714216       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714379       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714443       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714703       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714837       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
E0820 19:54:48.715494       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0820 19:54:48.746673       1 router.go:669] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"

Mark bug as verified
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 21, 2024
@grzpiotrowski
Copy link
Contributor Author

/retest

@candita
Copy link
Contributor

candita commented Aug 22, 2024

/assign

@candita
Copy link
Contributor

candita commented Sep 3, 2024

Unrelated failure in e2e-agnostic.

fail [github.com/openshift/origin/test/extended/cpu_partitioning/crio.go:166]: error getting crio container data from node ci-op-2srp6824-4c798-nf4kn-master-0
Unexpected error:
<*errors.errorString | 0xc001a08ee0>:
err execing command jq: error (at :1): Cannot index array with string "info"
jq: error (at :1): Cannot iterate over null (null)
{
s: "err execing command jq: error (at :1): Cannot index array with string "info"\njq: error (at :1): Cannot iterate over null (null)",}

/test e2e-agnostic

@candita
Copy link
Contributor

candita commented Sep 3, 2024

As far as gathering a list of other places we may need to clip values, I found some that are not annotations:

timeout connect {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_CONNECT_TIMEOUT") "5s" }}
timeout client {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_CLIENT_TIMEOUT") "30s" }}
timeout client-fin {{ firstMatch $timeSpecPattern (env "ROUTER_CLIENT_FIN_TIMEOUT") "1s" }}
timeout server {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s" }}
timeout server-fin {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_FIN_TIMEOUT") "1s" }}
timeout http-request {{ firstMatch $timeSpecPattern (env "ROUTER_SLOWLORIS_TIMEOUT") "10s" }}
timeout http-keep-alive {{ firstMatch $timeSpecPattern (env "ROUTER_SLOWLORIS_HTTP_KEEPALIVE") "300s" }}
# Long timeout for WebSocket connections.
timeout tunnel {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_TUNNEL_TIMEOUT") "1h" }}

tcp-request inspect-delay {{ firstMatch $timeSpecPattern (env "ROUTER_INSPECT_DELAY") "5s" }}

tcp-request inspect-delay {{ firstMatch $timeSpecPattern (env "ROUTER_INSPECT_DELAY") "5s" }}

However, in https://github.com/openshift/cluster-ingress-operator/blob/8252ac492c04d161fbcf60ef82af2989c99f4a9d/pkg/operator/controller/ingress/deployment.go#L610-L634 we do take care of clipping them.

@candita
Copy link
Contributor

candita commented Sep 3, 2024

@grzpiotrowski this looks good, but we will need some kind of test with this to make sure there are no regressions over time, and validate that it works as is. Maybe something like adding a few test cases to the tests in TestConfigTemplate in https://github.com/openshift/router/blob/master/pkg/router/router_test.go.

Miciah and others added 6 commits October 10, 2024 11:42
* pkg/router/router_test.go (mustCreate): Rename...
(mustCreateRoute): ... to this.
* pkg/router/router_test.go (harness): Add a client field.
(TestMain): Initialize the harness client field with the Kubernetes client set.
(TestConfigTemplate): Add test cases where the endpointslice associated with the
route specifies appProtocol with the recognized value "h2c" and unrecognized
values "unknown-value" and "kubernetes.io/h2c".  Verify that the recognized
value causes "proto h2" to be added to the backend server line and that the
unrecognized values do not.  Modify the test logic to apply an optional
mustCreateEndpointSlice from the test expectations.
(mustCreateRoute): Add targetServiceName field.
((mustCreateRoute).Apply): Initialize the route's spec.to.name field to the
targetServiceName value from the mustCreateRoute parameters if it is specified.
This new parameter isn't strictly required for the new test cases, but it makes
them more explicit and easier to read.
(mustCreateEndpointSlice): New type, used in the new test cases.
((mustCreateEndpointSlice).Apply): Create an endpointslice using the Kubernetes
API client set from the harness and the specified endpointslice name, service
name, and port appProtocol value from the mustCreateEndpointSlice parameters.
(mustCreateWithConfig): Add mustCreateEndpointSlice so that the new test cases
can use it.
(matchConfig): Add logic for parsing the "server" attribute, which the new test
cases use.
This commit fixes OCPBUGS-42972.

https://issues.redhat.com/browse/OCPBUGS-42972

* images/router/haproxy/conf/haproxy-config.template:
* pkg/router/template/configmanager/haproxy/backend.go
((Backend).UpdateServerInfo):
* pkg/router/template/configmanager/haproxy/manager.go
((haproxyConfigManager).ReplaceRouteEndpoints): Recognize both "h2c" and
"kubernetes.io/h2c" as valid values for appProtocol.
* pkg/router/router_test.go (TestConfigTemplate): Update the expectations for
the "route with appProtocol: kubernetes.io/h2c" test case.
Use clipHAProxyTimeoutValue on router.openshift.io/haproxy.health.check.interval
annotation to ensure it is within the range that HAProxy can parse.

This commit fixes OCPBUGS-38078.

https://issues.redhat.com/browse/OCPBUGS-38078

* images/router/haproxy/conf/haproxy-config.template:
Use clipHAProxyTimeoutValue on `router.openshift.io/haproxy.health.check.interval` annotation.
* pkg/router/router_test.go:
(TestConfigTemplate): Add test cases for route health check interval
annotation. Verify that the correct value is added to the backend server
line and that the values exceeding the maximum haproxy time value get clipped
to the max limit. Verify that invalid annotation values result in the default
health check interval value applied.
(MustCreateRoute): Add targetServiceWeight field
((MustCreateRoute).Apply): Initialize the route's spec.to.weight to the
targetServiceWeight value from the mustCreateRoute parameters if it is specified.
This parameter allows to set a non-zero weight for the service, making
the endpoint Active as a result.
(MustCreateEndpointSlice): Add addresses field
((MustCreateEndpointSlice).Apply): Initialize the endpoint's Addresses field
to the addresses array from the mustCreateEndpointSlice parameters if specified.
This enables having at least two endpoints for the route in the test case
and satisfy the conditions needed to configure the check inter in the backend
server line.
@grzpiotrowski grzpiotrowski force-pushed the OCPBUGS-38078-haproxy-healthcheck-interval branch from 0ecfb1b to e9081e6 Compare October 16, 2024 14:24
Copy link
Contributor

openshift-ci bot commented Oct 16, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from candita. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@grzpiotrowski
Copy link
Contributor Author

I finally added the health check interval test cases in the TestConfigTemplate.
I piggybacked on the work of @Miciah by using the new mustCreateEndpointSlice type from PR #627. For this reason I opted to rebase my branch on top of the branch from PR #627. I reckon this means that this PR must go after #627.

Copy link
Contributor

openshift-ci bot commented Oct 16, 2024

@grzpiotrowski: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 e9081e6 link false /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants