Status | SLI | SLO |
---|---|---|
WIP | Time to start 30*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes |
- As a user, I want a guarantee that my workload of X pods can be started within a given time
- As a user, I want to understand how quickly I can react to a dramatic change in workload profile when my workload exhibits very bursty behavior (e.g. shop during Back Friday Sale)
- As a user, I want a guarantee how quickly I can recreate the whole setup in case of a serious disaster which brings the whole cluster down.
- Start with a healthy (all nodes ready, all cluster addons already running) cluster with N (>0) running pause pods per node.
- Create a number of
Namespaces
and a number ofDeployments
in each of them. - All
Namespaces
should be isomorphic, possibly excluding last one which should run all pods that didn't fit in the previous ones. - Single namespace should run 5000
Pods
in the following configuration:- one big
Deployment
running ~1/3 of allPods
from thisnamespace
- medium
Deployments
, each with 120Pods
, in total running ~1/3 of allPods
from thisnamespace
- small
Deployment
, each with 10Pods
, in total running ~1/3 of allPods
from thisNamespace
- one big
- Each
Deployment
should be covered by a singleService
. - Each
Pod
in anyDeployment
contains two pause containers, oneSecret
other than defaultServiceAccount
and oneConfigMap
. Additionally it has resource requests set and doesn't use any advanced scheduling features or init containers.