Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildx kubernetes driver sometimes returns ERROR: error dialing backend: remote error: tls: internal error #2668

Open
3 tasks done
dcherniv opened this issue Aug 31, 2024 · 3 comments

Comments

@dcherniv
Copy link

dcherniv commented Aug 31, 2024

Contributing guidelines

I've found a bug and checked that ...

  • ... the documentation does not mention anything about my problem
  • ... there are no open or closed issues that are related to my problem

Description

On EKS 1.29 specifically using ARM64 nodes, it seems that there exists a race condition somewhere where the node CSR is not signed yet but the node is reported as ready in the cluster.
The issue goes away after a number of seconds once the CSR is approved and issued.
However, during that time all calls to the pods situated on the node return the above mentioned error.
Buildx kubernetes driver specifically returns this:

ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded

NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
 \_ buildx-php-release-8.2-2b0efca1    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig=   error                linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

I was browsing the buildx source code and found the call to list the workers: https://github.com/docker/buildx/blob/master/vendor/github.com/moby/buildkit/client/workers.go#L31
But it's not clear what the retry logic is here. It seems to me when we get the above tls internal error, the code just dies and buildx quits.
Is it possible to handle this specific error somehow? It's transient, and should succeed if buildx does some kind of exponential backoff.

buildx is started as follows:

          docker buildx create --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/amd64 \
                 --buildkitd-flags '--debug --trace' \
                 --driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=amd64","tolerations=key=runners,value=dedicated"'
          sleep 10
          docker buildx ls
          docker buildx create --append --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/arm64 \
                 --buildkitd-flags '--debug --trace' \
                 --driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=arm64","tolerations=key=runners,value=dedicated;key=arch,value=arm64"'
          sleep 10
          docker buildx ls

buildx and docker versions:
buildkit remote agent to be booted on the nodes: moby/buildkit:v0.15.2
docker version: docker:27.2-dind with its built-in buildkit, no modifications.
The whole setup runs on self-hosted github-actions runners using 0.9.3 version of oci://ghcr.io/actions/actions-runner-controller-charts

It seems to only happen under heavy load on the cluster. We have a repo where we build about 20-30 docker images in parallel (its our base images repo). Each docker image requests 2 buildx kubernetes workers, one for amd64 and one for arm64. So a lot of nodes get spun up at the same time.

Expected behaviour

buildx to not die when it encounters a transient error.

Actual behaviour

Failure log follows:

#1 [internal] booting buildkit
W0831 21:09:50.507151     230 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 66.8s done
#1 DONE 66.8s
buildx-php-release-8.2-2b0efca
NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
#1 [internal] booting buildkit
W0831 21:11:07.543695     328 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 71.9s done
#1 DONE 71.9s
buildx-php-release-8.2-2b0efca
ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded

Sometimes it is able to proceed past this error (i'm guessing due to the sleep 10 statement), but not always.

Buildx version

github.com/dockerbuildx v0.16.2 99dea6d

Docker info

/ # docker info
Client:
 Version:    27.2.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2
    Path:     /usr/local/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2
    Path:     /usr/local/libexec/docker/cli-plugins/docker-compose

Builders list

NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
 \_ buildx-php-release-8.2-2b0efca1    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig=   error                linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

Configuration

FROM public.ecr.aws/docker/library/php:8.2-apache

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get -y update &&\
    apt-get -y install gnupg git unzip


### Build logs

_No response_

### Additional info

_No response_
@dcherniv
Copy link
Author

Found the reason why this error pops up: awslabs/amazon-eks-ami#1944
TLDR, apparently in some cases the nodes on eks report ready status, but their CSR is not signed and approved yet. So pods schedule, start running and when buildx attempts to get their status, it gets the above error.

@tonistiigi
Copy link
Member

cc @AkihiroSuda

@AkihiroSuda AkihiroSuda added the kind/bug Something isn't working label Sep 4, 2024
@dcherniv
Copy link
Author

dcherniv commented Sep 9, 2024

Sounds like the discussion was already had and the consensus was that the clients should retry.
kubernetes/kubernetes#73047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants