Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS Handshake error between Nats pod and kube-system pods #921

Open
Arkessler opened this issue Jul 26, 2024 · 1 comment
Open

TLS Handshake error between Nats pod and kube-system pods #921

Arkessler opened this issue Jul 26, 2024 · 1 comment
Labels
defect Suspected defect such as a bug or regression

Comments

@Arkessler
Copy link

What version were you using?

helm version: 1.2.1
Nats: nats:2.10.17-alpine

What environment was the server running in?

Azure AKS k8s version 1.28.9

Nats chart included as sub-chart of main release, so the following config gets merged with nat's default values.yaml

Nats section of the values.yaml

nats:
  # ========================== My Org's custom nats config =========================
  # Whether or not to install the nats subchart to the "nats" namespace
  enabled: false

  # FQDN for bridge agents to connect to nats
  # Example: ORG-bridge.company.com
  bridgeAddress: 
  
  # Name of the secret containing the TLS certificate. Defines a helm anchor so
  # that this value can be passed to the nats chart below.
  tlsCertSecretName: &tlsCertSecretName bridge-ingress-cert

  # =========================== Nats Helm Chart Config ============================
  # Expose service as a public IP
  service:
    merge:
      spec:
        type: LoadBalancer

  # Config for nats server
  config:
    # Flexible config block that gets merged into nats.conf
    merge:
      # Resolver.conf config for memory resolver. Set with:
      # --set nats.config.merge.operator=$NATS_OPERATOR_JWT \
      # --set nats.config.merge.system_account=$NATS_SYS_ACCOUNT_ID \
      # --set nats.config.merge.resolver_preload.$NATS_SYS_ACCOUNT_ID=$NATS_SYS_ACCOUNT_JWT
      operator: OPERATOR_JWT
      system_account: SYSTEM_ACCOUNT_ID
      resolver: MEM
      resolver_preload:
        # SYSTEM_ACCOUNT_ID: SYSTEM_ACCOUNT_JWT

    nats:
      tls:
        enabled: true
        secretName: *tlsCertSecretName
        merge:
          timeout: 50

    monitor:
      tls:
        enabled: true

    # Run resolver: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro/jwt/resolver
    resolver:
      enabled: true

The *tlsCertSecretName is a Helm alias to the name of a manually deployed Cert:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ORG-bridge-ingress-cert
spec:
  secretName: {{ .Values.nats.tlsCertSecretName }}
  duration: 2160h # 90 days
  renewBefore: 240h # 10 days
  usages:
  - digital signature
  - key encipherment
  # - server auth
  # - client auth # included because routes mutually verify each other
  issuerRef:
    name: {{ .Values.certManager.issuerName }}
    # name: staging-issuer
    kind: ClusterIssuer
  commonName: "{{ .Values.nats.bridgeAddress }}"
  dnsNames:
  - "{{ .Values.nats.bridgeAddress }}"

Is this defect reproducible?

Reliably occurs with the above config on our AKS cluster. Haven't tested on other K8s varieties.

Given the capability you are leveraging, describe your expectation?

I am attempting to run a simple single node nats instance without clustering (at this stage). The goal is to deploy it with a pre-configured operator, account, and user, and TLS required for connections coming in from outside the K8s cluster.

Further down the line I will need to configure clustering, but TLS and mem-resolver based auth are the requirements for this stage of work.

I'd like to get the pod running healthily without throwing TLS errors.

I was on the fence whether to label this as a defect, I'm not sure if this is just a misunderstanding on my part of how TLS is supposed to be configured for the Nats helm chart. Happy to change the category if it looks like the error's on my end, I could just really use some guidance on how to get this sorted!

Given the expectation, what is the defect you are observing?

With the above config, I get the following in the nats pod logs. It seems as though the nats pod is attempting to dial pods in the kube-system namespace but failing to do so (assuming the directionality from the -> in the log message).

k logs ORG-nats-0 -f
Defaulted container "nats" out of: nats, reloader
[6] 2024/07/26 18:52:25.460906 [INF] Starting nats-server
[6] 2024/07/26 18:52:25.460957 [INF]   Version:  2.10.17
[6] 2024/07/26 18:52:25.460959 [INF]   Git:      [b91de03]
[6] 2024/07/26 18:52:25.460961 [INF]   Name:     ORG-nats-0
[6] 2024/07/26 18:52:25.460963 [INF]   ID:       NCN2JEE4LEXOZNC5OSYDVX6EW23EPPLLMNWPXOMINE3A564MUC77764W
[6] 2024/07/26 18:52:25.460968 [INF] Using configuration file: /etc/nats-config/nats.conf
[6] 2024/07/26 18:52:25.460970 [INF] Trusted Operators
[6] 2024/07/26 18:52:25.460971 [INF]   System  : ""
[6] 2024/07/26 18:52:25.460973 [INF]   Operator: "ORG"
[6] 2024/07/26 18:52:25.460975 [INF]   Issued  : 2024-07-15 16:39:52 +0000 UTC
[6] 2024/07/26 18:52:25.460989 [INF]   Expires : Never
[6] 2024/07/26 18:52:25.462384 [INF] Starting http monitor on 0.0.0.0:8222
[6] 2024/07/26 18:52:25.462518 [INF] Listening for client connections on 0.0.0.0:4222
[6] 2024/07/26 18:52:25.462528 [INF] TLS required for client connections
[6] 2024/07/26 18:52:25.462678 [INF] Server is ready
[6] 2024/07/26 18:52:49.307129 [ERR] 10.224.0.16:64083 - cid:5 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:64083: read: connection reset by peer
[6] 2024/07/26 18:52:49.601709 [ERR] 10.244.11.1:22990 - cid:6 - TLS handshake error: read tcp 10.244.11.6:4222->10.244.11.1:22990: read: connection reset by peer
[6] 2024/07/26 18:52:50.421799 [ERR] 10.224.0.12:15840 - cid:7 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.12:15840: read: connection reset by peer
[6] 2024/07/26 18:52:50.656785 [ERR] 10.224.0.4:59517 - cid:8 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.4:59517: read: connection reset by peer
[6] 2024/07/26 18:52:50.669588 [ERR] 10.224.0.7:35999 - cid:9 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.7:35999: read: connection reset by peer
[6] 2024/07/26 18:52:50.674150 [ERR] 10.224.0.14:46192 - cid:10 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.14:46192: read: connection reset by peer
[6] 2024/07/26 18:52:50.738035 [ERR] 10.224.0.17:39583 - cid:11 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.17:39583: read: connection reset by peer
[6] 2024/07/26 18:52:51.211579 [ERR] 10.224.0.15:32887 - cid:12 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.15:32887: read: connection reset by peer
[6] 2024/07/26 18:52:51.408095 [ERR] 10.224.0.13:45266 - cid:13 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.13:45266: read: connection reset by peer
[6] 2024/07/26 18:52:55.309357 [ERR] 10.224.0.16:6606 - cid:14 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:6606: read: connection reset by peer
.....

10.244.14.232 is the pod IP of the nats instnace

ORG-nats-0                      2/2     Running   0          15m     10.244.14.232   aks-default-36350781-vmss000068   <none>           <none>

All of the IPs that it attempting to dial look like kube-system pods, some of which share the same internal cluster ip

cloud-node-manager-nsbpx              1/1     Running   0          26d     10.224.0.12     aks-amd64-25378925-vmss000004     <none>           <none>
cloud-node-manager-qbhbs              1/1     Running   0          130m    10.224.0.10     aks-default-36350781-vmss0000a8   <none>           <none>
coredns-6745896b65-fv8df              1/1     Running   0          7d22h   10.244.14.170   aks-default-36350781-vmss000068   <none>           <none>
coredns-6745896b65-mtnmx              1/1     Running   0          17d     10.244.22.17    aks-amd64-25378925-vmss000000     <none>           <none>
coredns-6745896b65-tn749              1/1     Running   0          17d     10.244.0.13     aks-amd64-25378925-vmss000010     <none>           <none>
....
kube-proxy-2scnf                      1/1     Running   0          23d     10.224.0.14     aks-amd64-25378925-vmss000010     <none>           <none>
kube-proxy-b9m4d                      1/1     Running   0          26d     10.224.0.7      aks-amd64-25378925-vmss000000     <none>           <none>
kube-proxy-fmncs                      1/1     Running   0          26d     10.224.0.12     aks-amd64-25378925-vmss000004     <none>           <none>
kube-proxy-gk9dj                      1/1     Running   0          10d     10.224.0.16     aks-amd64-25378925-vmss000011     <none>           <none>
kube-proxy-h9vcs                      1/1     Running   0          27h     10.224.0.17     aks-default-36350781-vmss0000a4   <none>           <none>
....

This is where my understanding starts to break down, I'm not sure why nats would be dialing pods in the kube-system namespace. Would really appreciate any guidance you can provide! It almost feels like setting the nats.tls.enabled broke something with nats talking to other in-cluster pods.

I looked at examples like https://gist.github.com/wallyqs/d9c9131a5bd5e247b2e4a6d4aac898af, but it seemed as though they were aimed at nats clustering. As far as I understand, it doesn't seem like I need self-signed routes config for a single pod deploy?

@Arkessler Arkessler added the defect Suspected defect such as a bug or regression label Jul 26, 2024
@Arkessler
Copy link
Author

For anyone else who encounters this issue in Azure, the problem was due to Azure Loadbalancers auto-generating health check probes against every port that has a rule. When Azure loadbalancer tries to probe the client port 4222, or the cluster port 6222, it causes an error due to being an improperly formatted request (that is also missing a valid cert if I understand correctly)

To disable the health checks on the client and clustering ports, add the following to your helm chart config (based on https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#loadbalancer-annotations)

    merge:
      metadata:
        annotations:
          service.beta.kubernetes.io/port_6222_no_probe_rule: "true"
          service.beta.kubernetes.io/port_4222_no_probe_rule: "true"
      spec:
        type: LoadBalancer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

1 participant