Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico CNI installation fails with cridocker v0.3.12 #345

Open
anfechtung opened this issue Apr 2, 2024 · 23 comments
Open

Calico CNI installation fails with cridocker v0.3.12 #345

anfechtung opened this issue Apr 2, 2024 · 23 comments

Comments

@anfechtung
Copy link

Expected Behavior

Prior to v0.3.12 we were able to successfully install calico cni provider using the tigera operator to a baremetal kubeadm managed kubernetes cluster.

Actual Behavior

When updating our process to use cri-docker v0.3.12 we see bind errors during calico deployment.

Initially the tigera-operator fails to deploy.

 Normal   Pulled     15m                 kubelet            Successfully pulled image "<<redacted>>/tigera/operator:v1.32.3" in 21.153634784s (21.153662597s including waiting)
  Warning  Failed     13m (x12 over 15m)  kubelet            Error: Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /var/lib/calico

After manually creating the folder /var/lib/calico on the controller node, the tigera operator pod deploys, but calico cni pods fail with

 Normal   Pulled     2m20s                 kubelet            Successfully pulled image "quay.io/calico/cni:v3.27.0" in 7.987855195s (7.988025596s including waiting)
  Warning  Failed     35s (x10 over 2m20s)  kubelet            Error: Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /opt/cni/bin
  Normal   Pulled     35s (x9 over 2m20s)   kubelet            Container image "quay.io/calico/cni:v3.27.0" already present on machine

Steps to Reproduce the Problem

  1. install and configure cri-docker
  2. Deploy a kubernetes cluster (v1.25.5) on docker
  3. deploy the tigera-operator (v1.23.3)

Specifications

  • Version: kubernetes v1.25.5
  • Platform: ubuntu
  • Subsystem: cri-docker v0.3.12
@neersighted
Copy link
Collaborator

This is because #311 switched cri-dockerd from using the deprecated 'Binds' API to the new 'Mounts' API, which does not create missing directories by default: bf1a9b9

To preserve backward-compatible behavior, we need to set CreateMountpoint to true (as it is false, the zero value, by default) in GenerateMountBindings.

cc @nwneisen @AkihiroSuda

@AkihiroSuda
Copy link
Contributor

Isn't CreateMountpoint here working?

https://github.com/Mirantis/cri-dockerd/blob/v0.3.12/libdocker/helpers.go#L224

@neersighted
Copy link
Collaborator

Shoot, I missed that we're setting that in the diff. @anfechtung could you please let us know what Engine version you are using?

@anfechtung
Copy link
Author

I am assuming by Engine you mean the docker runtime:

root@vm-compute1:~# docker --version
Docker version 24.0.2, build cb74dfc
root@vm-compute1:~#

@neersighted
Copy link
Collaborator

docker --version is only the version of the CLI; to get the daemon version please provide docker version (also please provide docker info), which will interrogate the client and the server.

@anfechtung
Copy link
Author

Client: Docker Engine - Community
 Version:           24.0.2
 API version:       1.43
 Go version:        go1.20.4
 Git commit:        cb74dfc
 Built:             Thu May 25 21:52:13 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       659604f
  Built:            Thu May 25 21:52:13 2023
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 runc:
  Version:          1.1.7
  GitCommit:        v1.1.7-0-g860f061
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
root@vm-compute1:~#

Client: Docker Engine - Community
 Version:    24.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.5
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.18.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

root@vm-compute1:~# docker info
Server:
 Containers: 156
  Running: 101
  Paused: 0
  Stopped: 55
 Images: 79
 Server Version: 24.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
  Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.4.0-170-generic
 Operating System: Ubuntu 18.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 12.74GiB
 Name: vm-compute1
 ID: 29543cf3-2f2a-45f6-a42a-cd31c9385775
 Docker Root Dir: /var/lib/docker
 Debug Mode: false

@AkihiroSuda
Copy link
Contributor

We might be downgrading the API version to <= v1.41 in somewhere?
https://github.com/moby/moby/blob/v24.0.2/api/server/router/container/container_routes.go#L526

@anfechtung
Copy link
Author

Do you have any potential workarounds? Or a planned fix? I am trying to determine if it makes sense to go down the rabbit hole of pre-creating all of the needed directories.

@neersighted
Copy link
Collaborator

Someone has to figure out exactly what's going over the wire and whether the issue is on the client or server side. I don't think there are any workarounds outside of pre-creating the directories on the host.

@neersighted
Copy link
Collaborator

#346 ought to solve this; would you mind testing a build off of master?

That being said, I think we should keep this issue open until we have a regression test.

@anfechtung
Copy link
Author

Is there a deb package built from master, or would I need to build from master? Currently we are using the deb package to install.

@neersighted
Copy link
Collaborator

You would need to build from master; there are instructions and it is as trivial as a go build and moving the binary into the bin directory. Obviously that's not ideal and you'd want a release for production, but hopefully it validates the fix for you (and you'd get packages from the next patch release).

@anfechtung
Copy link
Author

I compiled from master, and dropped the new binary on my cluster. I am still getting the same error. I tried setting the log level for the cri-docker service to debug, but it didn't produce anything useful.

@anfechtung
Copy link
Author

After reading through the docker documentation, and the go docker libraries (Mount and Volume), I think this is simply the expected behavior when using docker mounts.

@neersighted
Copy link
Collaborator

It looks like some more digging will have to be done to determine where the fault lies; however, this is not the intended behavior. Kubernetes requires implicit directory creation as it was based on the Engine Binds API, which had this default behavior. We specifically added a new option to the Mounts API to enable implicit directory creation in v23, so if it doesn't work, there is a bug either in the daemon, or in cri-dockerd.

@adthonb
Copy link

adthonb commented Apr 19, 2024

I have the same problem with Promtail Pod. It tries to bind the path at /run/promtail but it can't. Normally, it should be created on the container's initial. Nodes using cri-dockerd 0.3.11 are working normally

Pod Event

Events:
  Warning  Failed  8m17s (x12 over 10m)  kubelet  Error: Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /run/promtail

cri-dockerd version

$ cri-dockerd --version
cri-dockerd 0.3.12 (c2e3805)

Docker Information

$ docker version
Client: Docker Engine - Community
 Version:           25.0.2
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        29cf629
 Built:             Thu Feb  1 00:22:57 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.2
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       fce6e0c
  Built:            Thu Feb  1 00:22:57 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ docker info
Client: Docker Engine - Community
 Version:    25.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.5
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
Server:
 Containers: 27
  Running: 25
  Paused: 0
  Stopped: 2
 Images: 24
 Server Version: 25.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-102-generic
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 7.61GiB
 Name: c3-pn-k8s-cp-01
 ID: d9c0761d-e30b-482c-98b5-24129d5e370a
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: cthongrak
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@neersighted
Copy link
Collaborator

@corhere and @nwneisen are cooking a new 0.3 release which should revert the problematic change; though we still need to solve this for 0.4 in order to go forward.

@AkihiroSuda
Copy link
Contributor

AkihiroSuda commented Apr 21, 2024

Is there any minimal reproducer that does not depend on Calico?

Can't repro the issue with the following yaml

---
apiVersion: v1
kind: Pod
metadata:
  name: bind
spec:
  volumes:
    - name: mnt
      hostPath:
        path: /tmp/non-existent
  containers:
    - name: busybox
      image: busybox
      args: ["sleep", "infinity"]
      volumeMounts:
        - name: mnt
          mountPath: /mnt

(cri-dockerd v0.3.12, Docker v26.0.1, Kubernetes v1.30.0)

@anfechtung
Copy link
Author

Not sure what may have changed, but this same error does not occur in v0.3.13.

@corhere
Copy link
Collaborator

corhere commented Apr 26, 2024

@anfechtung v0.3.13 has the problematic change #311 reverted.

@AkihiroSuda
Copy link
Contributor

Still can't repro the issue with calico. I wonder if the issue might have been already fixed in a recent version of Docker?

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.3/manifests/tigera-operator.yaml

Used minikube v1.33 (Kubernetes v1.30.0, Docker v26.0.1, cri-dockerd v0.3.12, according to strings /usr/bin/cri-dockerd)
Followed the "Operator" steps in https://docs.tigera.io/calico/3.27/getting-started/kubernetes/minikube

@AkihiroSuda
Copy link
Contributor

AkihiroSuda commented Apr 30, 2024

Bad Docker versions: <= v24.0.9, <= v25.0.3
Good Docker versions: >= v25.0.4, >= v26.0.0

Seems fixed in moby/moby@v25.0.3...v25.0.4

@nwneisen
Copy link
Collaborator

nwneisen commented May 2, 2024

I was able to reproduce the error and fix. I followed the calico quickstart steps using minikube. This was all done using c2e3805, v0.3.12.

Failure

Using minikube v1.31.1, calico fails due to the missing mount

nneisen:~/code/cri-dockerd (master): minikube  version
minikube version: v1.31.1
commit: fd3f3801765d093a485d255043149f92ec0a695f
nneisen:~/code/cri-dockerd (master):  kubectl get pods -A
tigera-operator   tigera-operator-786dc9d695-p86vw   0/1     CreateContainerError   0            24s
nneisen:~/code/cri-dockerd (master): kubectl describe pod tigera-operator-786dc9d695-p86vw -n tigera-operator
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  66s                default-scheduler  Successfully assigned tigera-operator/tigera-operator-786dc9d695-p86vw to minikube
  Normal   Pulling    66s                kubelet            Pulling image "quay.io/tigera/operator:v1.32.7"
  Normal   Pulled     60s                kubelet            Successfully pulled image "quay.io/tigera/operator:v1.32.7" in 5.814069007s (5.814076587s including waiting)
  Warning  Failed     10s (x6 over 60s)  kubelet            Error: Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /var/lib/calico
  Normal   Pulled     10s (x5 over 60s)  kubelet            Container image "quay.io/tigera/operator:v1.32.7" already present on machine

Working

After upgrading my minikube version to v1.33.0, calico is successful

nneisen:~/code/cri-dockerd (master): minikube version
minikube version: v1.33.0
commit: 86fc9d54fca63f295d8737c8eacdbb7987e89c67
nneisen:~/code/cri-dockerd (master): kubectl get pods -A
tigera-operator   tigera-operator-6678f5cb9d-h7c9f   1/1     Running   0          10s
nneisen:~/code/cri-dockerd (master): kubectl describe pod tigera-operator-6678f5cb9d-h7c9f -n tigera-operator
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m4s   default-scheduler  Successfully assigned tigera-operator/tigera-operator-6678f5cb9d-h7c9f to minikube
  Normal  Pulling    3m3s   kubelet            Pulling image "quay.io/tigera/operator:v1.32.7"
  Normal  Pulled     2m59s  kubelet            Successfully pulled image "quay.io/tigera/operator:v1.32.7" in 4.388s (4.388s including waiting). Image size: 69724923 bytes.
  Normal  Created    2m59s  kubelet            Created container tigera-operator
  Normal  Started    2m59s  kubelet            Started container tigera-operator

Solution

We should document that

  • the master branch and 0.4.x releases require docker >= v25.0.4, >= v26.0.0
  • the release/0.3.x branch and releases are for <= v24.0.9, <= v25.0.3

cc: @corhere @neersighted @AkihiroSuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants