From 1877d812747788192816e8b6f82233142645cc5d Mon Sep 17 00:00:00 2001
From: Mat Kowalski <mko@redhat.com>
Date: Thu, 17 Oct 2024 10:13:13 +0200
Subject: [PATCH] OCPBUGS-43428: Soften haproxy timeout for kubeapi probe

This PR changes timeouts used by haproxy when deciding whether the
master backend (i.e. k8s api server) is dead or alive.

The previous probe was relatively strict, allowing for a very fast
failover but at the same time very prone to temporary flakiness.

The new configuration aligns haproxy with the readiness probe used by
k8s when detecting if pod is dead or alive. Aligning those
configurations removes the mismatch we have when k8s believes api server
is ready but haproxy sees it as dead.

A consequence of this change is a potential increase of the downtime
when api server is forcefully removed. In the worst case scenario we may
see unavailability for 15 seconds. This should not be happening much in
a real setups, but for the sake of completeness this should be noted.

Fixes: OCPBUGS-43428
---
 templates/master/00-master/on-prem/files/haproxy-haproxy.yaml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/templates/master/00-master/on-prem/files/haproxy-haproxy.yaml b/templates/master/00-master/on-prem/files/haproxy-haproxy.yaml
index b402ecb9b5..a7d0d4ae85 100644
--- a/templates/master/00-master/on-prem/files/haproxy-haproxy.yaml
+++ b/templates/master/00-master/on-prem/files/haproxy-haproxy.yaml
@@ -36,8 +36,9 @@ contents:
       stats refresh 30s
       stats auth Username:Password
     backend masters
+       timeout check 10s
        option  httpchk GET /readyz HTTP/1.0
        balance roundrobin
     {{`{{- range .LBConfig.Backends }}
-       server {{ .Host }} {{ .Address }}:{{ .Port }} weight 1 verify none check check-ssl inter 1s fall 2 rise 3
+       server {{ .Host }} {{ .Address }}:{{ .Port }} weight 1 verify none check check-ssl inter 5s fall 3 rise 1
     {{- end }}`}}