Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod level metrics are not being surfaced by the shim #180

Open
2 tasks
kate-goldenring opened this issue Aug 21, 2024 · 2 comments
Open
2 tasks

Pod level metrics are not being surfaced by the shim #180

kate-goldenring opened this issue Aug 21, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@kate-goldenring
Copy link
Collaborator

This is a summary of a thread from the #spinkube CNCF slack. Thank you @asteurer for discovering this issue!

The issue

Initial discovery: When running a CPU intensive spin app with the shim, if load/ requests increases, CPU usage reporting on the pod stays static (output of kubectl top pods does not change). This makes it impossible to use the Horizontal Pod Autoscaler with SpinKube. This is consistent for all of the following tested K8s distributions -- note that the only distribution that does not exhibit this behavior is K3d:

distro containerd works
k3d v1.7.7-k3s1.27 yes
AKS 1.7.15-1 no
k3s 1.6.28 no
Kind 1.7.15 no

Repro steps

  1. Apply a CPU intensive spin app deployment and the HPA:
piVersion: apps/v1
kind: Deployment
metadata:
 name: spin-test
spec:
 replicas: 1
 selector:
   matchLabels:
     app: spin-test
 template:
   metadata:
     labels:
       app: spin-test
   spec:
     runtimeClassName: wasmtime-spin-v2
     containers:
     - name: spin-test
       image: ghcr.io/spinkube/spin-operator/cpu-load-gen:20240311-163328-g1121986
       command: ["/"]
       ports:
       - containerPort: 80
       resources:
         requests:
           cpu: 100m
           memory: 400Mi
         limits:
           cpu: 500m
           memory: 600M
---
apiVersion: v1
kind: Service
metadata:
 name: spin-test
spec:
 ports:
   - protocol: TCP
     port: 80
     targetPort: 80
 selector:
   app: spin-test
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: spinapp-autoscaler
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: spin-test
 minReplicas: 1
 maxReplicas: 10
 metrics:
 - type: Resource
   resource:
     name: cpu
     target:
       type: Utilization
       averageUtilization: 50
---
  1. Call the stats API for the node the Pod is running on:
kubectl get --raw "/api/v1/nodes/$NODE_NAME/proxy/stats/summary?only_cpu_and_memory=true" | grep "spin-test" -C 40

Output may look similar to the following:

  {
   "podRef": {
    "name": "spin-test-66d9dd45f5-csf5j",
    "namespace": "default",
    "uid": "d82c9414-c690-4b0c-925f-4ead983edce4"
   },
   "startTime": "2024-08-21T17:22:33Z",
   "containers": [
    {
     "name": "spin-test",
     "startTime": "2024-08-21T17:22:33Z",
     "cpu": {
      "time": "2024-08-21T17:28:32Z",
      "usageNanoCores": 4728,
      "usageCoreNanoSeconds": 1926321
     },
     "memory": {
      "time": "2024-08-21T17:28:32Z",
      "workingSetBytes": 25096192
     }
    }
   ],
   "cpu": {
    "time": "2024-08-21T17:28:32Z",
    "usageNanoCores": 0,
    "usageCoreNanoSeconds": 0
   },
   "memory": {
    "time": "2024-08-21T17:28:32Z",
    "availableBytes": 599998464,
    "usageBytes": 0,
    "workingSetBytes": 0,
    "rssBytes": 0,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },

Notice how the Pod CPU and memory usage values are 0 while the container has properly propagated values.

  1. Load the app to see if the HPA increases replicas
    If Pod metrics were properly reported, the app replicas would increase.
# After port forwarding to port 3000
bombardier localhost:3000 -n 10 -t 30s

Calling the stats API during the load test shows that while the container usageNanoCores jumped from 4728 to 497486, the pod metrics did not change nor did the app replica count or the output of kubectl top pods for that Pod.

Other investigation

Pod metrics are surfaced for normal containers not executed with the shim (without runtime class wasmtime-spin-v2 specified):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spin-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spin-test
  template:
    metadata:
      labels:
        app: spin-test
    spec:
      containers:
      - name: spin-test
        image: ghcr.io/kate-goldenring/spin-in-container:fib
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: spin-test
spec:
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
  selector:
    app: spin-test
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: spin-test
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

However, if that same container is executed with the shim, Pod metrics are no longer surfaced.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spin-test-runwasi
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spin-test-runwasi
  template:
    metadata:
      labels:
        app: spin-test-runwasi
    spec:
      runtimeClassName: wasmtime-spin-v2
      containers:
      - name: spin-test-runwasi
        image: ghcr.io/kate-goldenring/spin-in-container:fib
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: spin-test-runwasi
spec:
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
  selector:
    app: spin-test-runwasi

Possible solutions and areas to investigate

Some areas to investigate that @jsturtevant and @radu-matei mentioned are the following:

  • cgroup version being used
  • how runwasi handles / mocks Pod creation here
@kate-goldenring kate-goldenring self-assigned this Aug 21, 2024
@kate-goldenring
Copy link
Collaborator Author

I wonder if this may be the issue: containerd/cri#922. Specifically, we may need to add the io.kubernetes.container.name=="POD" label to the pause container. This may also explain why this works on k3d if it is using docker for the container runtime instead of containerd (not sure if this is the case though).

@kate-goldenring
Copy link
Collaborator Author

@jprendes and I spent some time setting up GDB debugging with the shim for this. While we did not come to any new conclusions, I wanted to share our repro steps:

Debugging with GDB

  1. Install K3s. This uses kwasm to configure the containerd config to use a shim at /opt/kwasm/bin/containerd-shim-spin-v2 so be sure to move your debug binary here

    wget https://gist.githubusercontent.com/kate-goldenring/a90bbe696d2cd48b44c093e1154047c0/raw/93f6ee1281123858290cb2a6ac61141e4671d38c/spin-kube-k3s.sh
    chmod +x ./spin-kube-k3s.sh
    ./spin-kube-k3s.sh
  2. Download the Native Debug VSCode extension

  3. Create script to enable executing gbd as sudo user:

    #!/bin/bash
    
    sudo gdb $*
  4. Create gdb launch.json

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "type": "gdb",
            "request": "attach",
            "name": "Attach to PID",
            "target": "{PID}",
            "cwd": "${workspaceRoot}",
            "valuesFormatting": "parseText",
            "gdbpath": "/home/kagold/projects/containerd-shim-spin/_scratch/resources-debug/sudo-gdb.sh"
        }
    ]
}
  1. Build debug version of shim and move it to expected shim location for k8s distro
  2. Apply spin app deployment
  3. Get spin process PID and update launch.json to target it
  4. Run debugger in VSCode
  5. Pause debugger and add desired breakpoint
  6. (repeat)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

2 participants