Skip to content

Latest commit

 

History

History
34 lines (30 loc) · 1.73 KB

system_throughput.md

File metadata and controls

34 lines (30 loc) · 1.73 KB

System throughput SLI/SLO details

Definition

Status SLI SLO
WIP Time to start 30*#nodes pods, measured from test scenario start until observing last Pod as ready Benchmark: when all images present on all Nodes, 99th percentile <= X minutes

User stories

  • As a user, I want a guarantee that my workload of X pods can be started within a given time
  • As a user, I want to understand how quickly I can react to a dramatic change in workload profile when my workload exhibits very bursty behavior (e.g. shop during Back Friday Sale)
  • As a user, I want a guarantee how quickly I can recreate the whole setup in case of a serious disaster which brings the whole cluster down.

Test scenario

  • Start with a healthy (all nodes ready, all cluster addons already running) cluster with N (>0) running pause pods per node.
  • Create a number of Namespaces and a number of Deployments in each of them.
  • All Namespaces should be isomorphic, possibly excluding last one which should run all pods that didn't fit in the previous ones.
  • Single namespace should run 5000 Pods in the following configuration:
    • one big Deployment running ~1/3 of all Pods from this namespace
    • medium Deployments, each with 120 Pods, in total running ~1/3 of all Pods from this namespace
    • small Deployment, each with 10 Pods, in total running ~1/3 of all Pods from this Namespace
  • Each Deployment should be covered by a single Service.
  • Each Pod in any Deployment contains two pause containers, one Secret other than default ServiceAccount and one ConfigMap. Additionally it has resource requests set and doesn't use any advanced scheduling features or init containers.