OCPBUGS-38078: Validate HAProxy health check interval time value #618

grzpiotrowski · 2024-08-20T10:39:37Z

The maximum valid time representing value in HAProxy is 2147483647ms (max positive value for a 32-bit signed integer).

Setting time value to one exceeding the maximum handled by HAProxy for router.openshift.io/haproxy.health.check.interval breaks the router-default pods.

Prior to this PR clipHAProxyTimeoutValue was used to validate the timeout values set in annotations like haproxy.router.openshift.io/timeout or haproxy.router.openshift.io/timeout-tunnel.

By reusing the clipHAProxyTimeoutValue function, this PR adds the same validation for the value set in router.openshift.io/haproxy.health.check.interval annotation to ensure it is within the range that HAProxy can parse.

openshift-ci-robot · 2024-08-20T10:39:44Z

@grzpiotrowski: This pull request references Jira Issue OCPBUGS-38078, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The maximum valid time representing value in HAProxy is 2147483647ms (max positive value for a 32-bit signed integer).

Setting time value to one exceeding the maximum handled by HAProxy for router.openshift.io/haproxy.health.check.interval breaks he router-default pods.

Prior to this PR clipHAProxyTimeoutValue was used to validate the timeout values set in annotations like haproxy.router.openshift.io/timeout or haproxy.router.openshift.io/timeout-tunnel.

This PR adds the same validation for the value set in router.openshift.io/haproxy.health.check.interval annotation to ensure it is within the range that HAProxy can parse.
clipHAProxyTimeoutValue function is renamed to clipHAProxyTimeValue to reflect the more general purpose of this function, which is to validate time representing values for HAProxy related annotations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-08-20T10:41:07Z

@grzpiotrowski: This pull request references Jira Issue OCPBUGS-38078, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

The maximum valid time representing value in HAProxy is 2147483647ms (max positive value for a 32-bit signed integer).

Setting time value to one exceeding the maximum handled by HAProxy for router.openshift.io/haproxy.health.check.interval breaks the router-default pods.

Prior to this PR clipHAProxyTimeoutValue was used to validate the timeout values set in annotations like haproxy.router.openshift.io/timeout or haproxy.router.openshift.io/timeout-tunnel.

This PR adds the same validation for the value set in router.openshift.io/haproxy.health.check.interval annotation to ensure it is within the range that HAProxy can parse.
clipHAProxyTimeoutValue function is renamed to clipHAProxyTimeValue to reflect the more general purpose of this function, which is to validate time representing values for HAProxy related annotations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

lihongan · 2024-08-20T10:51:15Z

cc @rhamini3 @melvinjoseph86

candita · 2024-08-20T23:54:54Z

@grzpiotrowski unless someone has advised you otherwise, I will request that you do not rename the function or constant. It will make this easier to review and backport, by simplifying the number of changes.

grzpiotrowski · 2024-08-21T10:21:31Z

@grzpiotrowski unless someone has advised you otherwise, I will request that you do not rename the function or constant. It will make this easier to review and backport, by simplifying the number of changes.

Yes, I didn't consider the ease of backporting in this case.
As suggested, I reverted the clipHAProxyTimeValue function name back to clipHAProxyTimeoutValue and all other generalized variable names back to original. Only the minimal changes required to fix the issue are left now.

grzpiotrowski · 2024-08-21T10:36:54Z

I was just wondering, should we keep the log messages unchanged or adjust them to say time or time value instead of timeout, as from now the health check interval is also clipped, not only timeouts.
Not sure how this would affect backports.

Example of a log for clipping the health check interval now:

template_helper.go:356] "msg"="route annotation timeout exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"

rhamini3 · 2024-08-21T14:11:57Z

Annotated all available routes with health check interval above the maximum input

oc annotate route -n openshift-ingress-canary canary --overwrite router.openshift.io/haproxy.health.check.interval=50000d

Looked up pods in openshift-ingress namespace, deleted and recreated them

oc get po -n openshift-ingress                                   
NAME                              READY   STATUS    RESTARTS   AGE
router-default-8648b58758-6q8nk   1/1     Running   0          3m11s
router-default-8648b58758-ckp28   1/1     Running   0          2m7s

Looked at pod logs to make sure health check interval successfully clips the input value

oc logs -n openshift-ingress router-default-8648b58758-6q8nk
I0820 19:54:48.602667       1 template.go:560] "msg"="starting router" "logger"="router" "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 6c1ef257\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2024-08-20T17:29:50Z\n"
I0820 19:54:48.605066       1 metrics.go:156] "msg"="router health and metrics port listening on HTTP and HTTPS" "address"="0.0.0.0:1936" "logger"="metrics"
I0820 19:54:48.607465       1 router.go:217] "msg"="creating a new template router" "logger"="template" "writeDir"="/var/lib/haproxy"
I0820 19:54:48.607535       1 router.go:302] "msg"="router will coalesce reloads within an interval of each other" "interval"="5s" "logger"="template"
I0820 19:54:48.607837       1 router.go:372] "msg"="watching for changes" "logger"="template" "path"="/etc/pki/tls/private"
I0820 19:54:48.607898       1 router.go:283] "msg"="router is including routes in all namespaces" "logger"="router"
I0820 19:54:48.621526       1 reflector.go:359] Caches populated for *v1.Route from github.com/openshift/router/pkg/router/controller/factory/factory.go:124
I0820 19:54:48.623961       1 reflector.go:359] Caches populated for *v1.EndpointSlice from github.com/openshift/router/pkg/router/controller/factory/factory.go:124
I0820 19:54:48.624807       1 reflector.go:359] Caches populated for *v1.Service from github.com/openshift/router/pkg/router/template/service_lookup.go:33
I0820 19:54:48.712700       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.712753       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.712801       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713165       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713239       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713411       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713480       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713587       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713631       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713670       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713864       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.713938       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714134       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714216       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714379       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714443       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714703       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
I0820 19:54:48.714837       1 template_helper.go:356] "msg"="route annotation time exceeds maximum allowable by HAProxy, clipping to 2147483647ms" "input"="50000d" "logger"="template"
E0820 19:54:48.715494       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0820 19:54:48.746673       1 router.go:669] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"

Mark bug as verified
/label qe-approved

grzpiotrowski · 2024-08-21T15:14:40Z

/retest

candita · 2024-08-22T20:54:58Z

/assign

candita · 2024-09-03T19:37:14Z

Unrelated failure in e2e-agnostic.

fail [github.com/openshift/origin/test/extended/cpu_partitioning/crio.go:166]: error getting crio container data from node ci-op-2srp6824-4c798-nf4kn-master-0
Unexpected error:
<*errors.errorString | 0xc001a08ee0>:
err execing command jq: error (at :1): Cannot index array with string "info"
jq: error (at :1): Cannot iterate over null (null)
{
s: "err execing command jq: error (at :1): Cannot index array with string "info"\njq: error (at :1): Cannot iterate over null (null)",}

/test e2e-agnostic

candita · 2024-09-03T19:55:53Z

As far as gathering a list of other places we may need to clip values, I found some that are not annotations:

router/images/router/haproxy/conf/haproxy-config.template

Lines 158 to 167 in 0ecfb1b

    
           timeout connect {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_CONNECT_TIMEOUT") "5s" }} 
        
           timeout client {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_CLIENT_TIMEOUT") "30s" }} 
        
           timeout client-fin {{ firstMatch $timeSpecPattern (env "ROUTER_CLIENT_FIN_TIMEOUT") "1s" }} 
        
           timeout server {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_TIMEOUT") "30s" }} 
        
           timeout server-fin {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_SERVER_FIN_TIMEOUT") "1s" }} 
        
           timeout http-request {{ firstMatch $timeSpecPattern (env "ROUTER_SLOWLORIS_TIMEOUT") "10s" }} 
        
           timeout http-keep-alive {{ firstMatch $timeSpecPattern (env "ROUTER_SLOWLORIS_HTTP_KEEPALIVE") "300s" }} 
        
           # Long timeout for WebSocket connections. 
        
           timeout tunnel {{ firstMatch $timeSpecPattern (env "ROUTER_DEFAULT_TUNNEL_TIMEOUT") "1h" }}

router/images/router/haproxy/conf/haproxy-config.template

Line 222 in 0ecfb1b

    
           tcp-request inspect-delay {{ firstMatch $timeSpecPattern (env "ROUTER_INSPECT_DELAY") "5s" }}

router/images/router/haproxy/conf/haproxy-config.template

Line 293 in 0ecfb1b

    
           tcp-request inspect-delay {{ firstMatch $timeSpecPattern (env "ROUTER_INSPECT_DELAY") "5s" }}

However, in https://github.com/openshift/cluster-ingress-operator/blob/8252ac492c04d161fbcf60ef82af2989c99f4a9d/pkg/operator/controller/ingress/deployment.go#L610-L634 we do take care of clipping them.

candita · 2024-09-03T20:20:03Z

@grzpiotrowski this looks good, but we will need some kind of test with this to make sure there are no regressions over time, and validate that it works as is. Maybe something like adding a few test cases to the tests in TestConfigTemplate in https://github.com/openshift/router/blob/master/pkg/router/router_test.go.

* pkg/router/router_test.go (mustCreate): Rename... (mustCreateRoute): ... to this.

* pkg/router/router_test.go (harness): Add a client field. (TestMain): Initialize the harness client field with the Kubernetes client set. (TestConfigTemplate): Add test cases where the endpointslice associated with the route specifies appProtocol with the recognized value "h2c" and unrecognized values "unknown-value" and "kubernetes.io/h2c". Verify that the recognized value causes "proto h2" to be added to the backend server line and that the unrecognized values do not. Modify the test logic to apply an optional mustCreateEndpointSlice from the test expectations. (mustCreateRoute): Add targetServiceName field. ((mustCreateRoute).Apply): Initialize the route's spec.to.name field to the targetServiceName value from the mustCreateRoute parameters if it is specified. This new parameter isn't strictly required for the new test cases, but it makes them more explicit and easier to read. (mustCreateEndpointSlice): New type, used in the new test cases. ((mustCreateEndpointSlice).Apply): Create an endpointslice using the Kubernetes API client set from the harness and the specified endpointslice name, service name, and port appProtocol value from the mustCreateEndpointSlice parameters. (mustCreateWithConfig): Add mustCreateEndpointSlice so that the new test cases can use it. (matchConfig): Add logic for parsing the "server" attribute, which the new test cases use.

This commit fixes OCPBUGS-42972. https://issues.redhat.com/browse/OCPBUGS-42972 * images/router/haproxy/conf/haproxy-config.template: * pkg/router/template/configmanager/haproxy/backend.go ((Backend).UpdateServerInfo): * pkg/router/template/configmanager/haproxy/manager.go ((haproxyConfigManager).ReplaceRouteEndpoints): Recognize both "h2c" and "kubernetes.io/h2c" as valid values for appProtocol. * pkg/router/router_test.go (TestConfigTemplate): Update the expectations for the "route with appProtocol: kubernetes.io/h2c" test case.

…nsistent with ART for 4.18 Reconciling with https://github.com/openshift/ocp-build-data/tree/827ab4ccce9cbbcf82c9dbaf6398b61d6cff8d7a/images/openshift-enterprise-haproxy-router.yml

Use clipHAProxyTimeoutValue on router.openshift.io/haproxy.health.check.interval annotation to ensure it is within the range that HAProxy can parse. This commit fixes OCPBUGS-38078. https://issues.redhat.com/browse/OCPBUGS-38078 * images/router/haproxy/conf/haproxy-config.template: Use clipHAProxyTimeoutValue on `router.openshift.io/haproxy.health.check.interval` annotation.

* pkg/router/router_test.go: (TestConfigTemplate): Add test cases for route health check interval annotation. Verify that the correct value is added to the backend server line and that the values exceeding the maximum haproxy time value get clipped to the max limit. Verify that invalid annotation values result in the default health check interval value applied. (MustCreateRoute): Add targetServiceWeight field ((MustCreateRoute).Apply): Initialize the route's spec.to.weight to the targetServiceWeight value from the mustCreateRoute parameters if it is specified. This parameter allows to set a non-zero weight for the service, making the endpoint Active as a result. (MustCreateEndpointSlice): Add addresses field ((MustCreateEndpointSlice).Apply): Initialize the endpoint's Addresses field to the addresses array from the mustCreateEndpointSlice parameters if specified. This enables having at least two endpoints for the route in the test case and satisfy the conditions needed to configure the check inter in the backend server line.

openshift-ci · 2024-10-16T14:24:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from candita. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

grzpiotrowski · 2024-10-16T14:29:36Z

I finally added the health check interval test cases in the TestConfigTemplate.
I piggybacked on the work of @Miciah by using the new mustCreateEndpointSlice type from PR #627. For this reason I opted to rebase my branch on top of the branch from PR #627. I reckon this means that this PR must go after #627.

openshift-ci · 2024-10-16T18:31:06Z

@grzpiotrowski: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6	`e9081e6`	link	false	`/test e2e-metal-ipi-ovn-ipv6`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from frobware and knobunc August 20, 2024 10:39

grzpiotrowski force-pushed the OCPBUGS-38078-haproxy-healthcheck-interval branch from 0e507e4 to 0ecfb1b Compare August 21, 2024 10:11

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 21, 2024

openshift-ci bot assigned candita Aug 22, 2024

Miciah and others added 6 commits October 10, 2024 11:42

Rename mustCreate to mustCreateRoute

28d1cb5

* pkg/router/router_test.go (mustCreate): Rename... (mustCreateRoute): ... to this.

Updating openshift-enterprise-haproxy-router-container image to be co…

b9e294f

…nsistent with ART for 4.18 Reconciling with https://github.com/openshift/ocp-build-data/tree/827ab4ccce9cbbcf82c9dbaf6398b61d6cff8d7a/images/openshift-enterprise-haproxy-router.yml

grzpiotrowski force-pushed the OCPBUGS-38078-haproxy-healthcheck-interval branch from 0ecfb1b to e9081e6 Compare October 16, 2024 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-38078: Validate HAProxy health check interval time value #618

OCPBUGS-38078: Validate HAProxy health check interval time value #618

grzpiotrowski commented Aug 20, 2024 •

edited

Loading

openshift-ci-robot commented Aug 20, 2024

openshift-ci-robot commented Aug 20, 2024

lihongan commented Aug 20, 2024

candita commented Aug 20, 2024

grzpiotrowski commented Aug 21, 2024

grzpiotrowski commented Aug 21, 2024

rhamini3 commented Aug 21, 2024

grzpiotrowski commented Aug 21, 2024

candita commented Aug 22, 2024

candita commented Sep 3, 2024

candita commented Sep 3, 2024

candita commented Sep 3, 2024

openshift-ci bot commented Oct 16, 2024

grzpiotrowski commented Oct 16, 2024

openshift-ci bot commented Oct 16, 2024

OCPBUGS-38078: Validate HAProxy health check interval time value #618

Are you sure you want to change the base?

OCPBUGS-38078: Validate HAProxy health check interval time value #618

Conversation

grzpiotrowski commented Aug 20, 2024 • edited Loading

openshift-ci-robot commented Aug 20, 2024

openshift-ci-robot commented Aug 20, 2024

lihongan commented Aug 20, 2024

candita commented Aug 20, 2024

grzpiotrowski commented Aug 21, 2024

grzpiotrowski commented Aug 21, 2024

rhamini3 commented Aug 21, 2024

grzpiotrowski commented Aug 21, 2024

candita commented Aug 22, 2024

candita commented Sep 3, 2024

candita commented Sep 3, 2024

candita commented Sep 3, 2024

openshift-ci bot commented Oct 16, 2024

grzpiotrowski commented Oct 16, 2024

openshift-ci bot commented Oct 16, 2024

grzpiotrowski commented Aug 20, 2024 •

edited

Loading