Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGX and/or FPGA mutating webhooks for pod creation removes securityContext.appArmorProfile #1943

Open
JJGadgets opened this issue Dec 27, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@JJGadgets
Copy link

Describe the bug
A clear and concise description of what the bug is.

When the SGX and FPGA mutating webhooks that mutate created pods is applied to a cluster, pods that specify securityContext.appArmorProfile in the controller template will have the securityContext.appArmorProfile spec removed when the pod is created. It is present on the controller template spec (e.g. DaemonSet).

This causes pods/containers that require Unconfined appArmorProfile such as Cilium agent's apply-sysctl-overwrites initContainer to fail as the cluster default appArmorProfile config is applied instead.

To Reproduce
Steps to reproduce the behavior:

  1. Install Cilium without setting securityContext.privileged=true anywhere in Helm values (Helm values from Talos docs on Cilium install can be followed, probably full chart defaults too). Check that kubectl get ds -n kube-system cilium does apply appArmorProfile (default values will).
  2. Install Intel Device Plugins Operator Helm chart with full chart default values.
  3. Restart Cilium agent pod.
  4. Notice that the apply-sysctls-override initContainer fails due to nsenter permission denied. kubectl get pod -o yaml on the newly created pod won't have appArmorProfile or any hints of AppArmor in the YAML.

Expected behavior
A clear and concise description of what you expected to happen.

The mutating webhook should not modify the appArmorProfile. If it must for SGX/FPGA to work (I'm not sure about this), the Helm values should allow disabling those specific webhooks as not all users will use the SGX and FPGA aspects of the operator.

Screenshots
If applicable, add screenshots to help explain your problem.

With mutating webhook:
image

Without mutating webhook, Cilium agent DaemonSet pod for same node:
image

System (please complete the following information):

  • OS version: Talos 1.9.1
  • Kernel version: Linux 6.12.6-talos
  • Device plugins version: v0.29.0
  • Hardware info: Only Intel i915 iGPU in use
  • Cilium version: 1.16.5

Additional context
Add any other context about the problem here.

@tkatila
Copy link
Contributor

tkatila commented Dec 29, 2024

Thanks @JJGadgets for the report. We'll have to take a look why it happens.

There is a values entry you could use to limit operator's scope to only a certain plugin:
--set 'controllerExtraArgs=- --devices=gpu'
(I think you should be able to set multiple devices, but my helm skills were not enough..)

In this case, sgx and fpga would be ignored by the operator. The hooks k8s objects would still be installed, so it's not an overly good solution.

@JJGadgets
Copy link
Author

@tkatila thanks for the workaround suggestion, I'll give it a shot.

My Helm is done via a FluxCD HelmRelease which is a CR, so I can set multiple values in a YAML list, no worries there.

@JJGadgets
Copy link
Author

Alright, the workaround worked, perhaps the chart should have a per device type enabled boolean, which both adds the --devices arg and only adds the webhook when that device type is enabled?

E.g.

devices:
  gpu: true
  qat: true
  sgx: false
  fpga: false

@tkatila
Copy link
Contributor

tkatila commented Dec 30, 2024

Alright, the workaround worked

Great to hear.

perhaps the chart should have a per device type enabled boolean, which both adds the --devices arg and only adds the webhook when that device type is enabled?

Sure. Adding such an option would make sense. And fixing the webhook not to break the securityContext.

@tkatila tkatila added the bug Something isn't working label Dec 30, 2024
@tkatila
Copy link
Contributor

tkatila commented Dec 30, 2024

The bug didn't reproduce on my vanilla 1.31 k8s environment. I also tried re-installing the cluster with Cilium but still the appArmorProfile stayed after pod restart.

Which k8s version is included in the Talos 1.9.1? Or which version are you using?

@JJGadgets
Copy link
Author

Talos 1.9.1 ships with Kubernetes 1.32.0.

@mythi
Copy link
Contributor

mythi commented Jan 2, 2025

The mutating webhook should not modify the appArmorProfile. If it must for SGX/FPGA to work (I'm not sure about this)

neither one does this. with your workaround, can you help to check which one (if you add sgx or fpga to --devices) triggers the behavior you are seeing and also provide the api-server and operator logs when that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants