TLS Handshake error between Nats pod and `kube-system` pods #921

Arkessler · 2024-07-26T20:28:57Z

What version were you using?

helm version: 1.2.1
Nats: nats:2.10.17-alpine

What environment was the server running in?

Azure AKS k8s version 1.28.9

Nats chart included as sub-chart of main release, so the following config gets merged with nat's default values.yaml

Nats section of the values.yaml

nats:
  # ========================== My Org's custom nats config =========================
  # Whether or not to install the nats subchart to the "nats" namespace
  enabled: false

  # FQDN for bridge agents to connect to nats
  # Example: ORG-bridge.company.com
  bridgeAddress: 
  
  # Name of the secret containing the TLS certificate. Defines a helm anchor so
  # that this value can be passed to the nats chart below.
  tlsCertSecretName: &tlsCertSecretName bridge-ingress-cert

  # =========================== Nats Helm Chart Config ============================
  # Expose service as a public IP
  service:
    merge:
      spec:
        type: LoadBalancer

  # Config for nats server
  config:
    # Flexible config block that gets merged into nats.conf
    merge:
      # Resolver.conf config for memory resolver. Set with:
      # --set nats.config.merge.operator=$NATS_OPERATOR_JWT \
      # --set nats.config.merge.system_account=$NATS_SYS_ACCOUNT_ID \
      # --set nats.config.merge.resolver_preload.$NATS_SYS_ACCOUNT_ID=$NATS_SYS_ACCOUNT_JWT
      operator: OPERATOR_JWT
      system_account: SYSTEM_ACCOUNT_ID
      resolver: MEM
      resolver_preload:
        # SYSTEM_ACCOUNT_ID: SYSTEM_ACCOUNT_JWT

    nats:
      tls:
        enabled: true
        secretName: *tlsCertSecretName
        merge:
          timeout: 50

    monitor:
      tls:
        enabled: true

    # Run resolver: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro/jwt/resolver
    resolver:
      enabled: true

The *tlsCertSecretName is a Helm alias to the name of a manually deployed Cert:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ORG-bridge-ingress-cert
spec:
  secretName: {{ .Values.nats.tlsCertSecretName }}
  duration: 2160h # 90 days
  renewBefore: 240h # 10 days
  usages:
  - digital signature
  - key encipherment
  # - server auth
  # - client auth # included because routes mutually verify each other
  issuerRef:
    name: {{ .Values.certManager.issuerName }}
    # name: staging-issuer
    kind: ClusterIssuer
  commonName: "{{ .Values.nats.bridgeAddress }}"
  dnsNames:
  - "{{ .Values.nats.bridgeAddress }}"

Is this defect reproducible?

Reliably occurs with the above config on our AKS cluster. Haven't tested on other K8s varieties.

Given the capability you are leveraging, describe your expectation?

I am attempting to run a simple single node nats instance without clustering (at this stage). The goal is to deploy it with a pre-configured operator, account, and user, and TLS required for connections coming in from outside the K8s cluster.

Further down the line I will need to configure clustering, but TLS and mem-resolver based auth are the requirements for this stage of work.

I'd like to get the pod running healthily without throwing TLS errors.

I was on the fence whether to label this as a defect, I'm not sure if this is just a misunderstanding on my part of how TLS is supposed to be configured for the Nats helm chart. Happy to change the category if it looks like the error's on my end, I could just really use some guidance on how to get this sorted!

Given the expectation, what is the defect you are observing?

With the above config, I get the following in the nats pod logs. It seems as though the nats pod is attempting to dial pods in the kube-system namespace but failing to do so (assuming the directionality from the -> in the log message).

k logs ORG-nats-0 -f
Defaulted container "nats" out of: nats, reloader
[6] 2024/07/26 18:52:25.460906 [INF] Starting nats-server
[6] 2024/07/26 18:52:25.460957 [INF]   Version:  2.10.17
[6] 2024/07/26 18:52:25.460959 [INF]   Git:      [b91de03]
[6] 2024/07/26 18:52:25.460961 [INF]   Name:     ORG-nats-0
[6] 2024/07/26 18:52:25.460963 [INF]   ID:       NCN2JEE4LEXOZNC5OSYDVX6EW23EPPLLMNWPXOMINE3A564MUC77764W
[6] 2024/07/26 18:52:25.460968 [INF] Using configuration file: /etc/nats-config/nats.conf
[6] 2024/07/26 18:52:25.460970 [INF] Trusted Operators
[6] 2024/07/26 18:52:25.460971 [INF]   System  : ""
[6] 2024/07/26 18:52:25.460973 [INF]   Operator: "ORG"
[6] 2024/07/26 18:52:25.460975 [INF]   Issued  : 2024-07-15 16:39:52 +0000 UTC
[6] 2024/07/26 18:52:25.460989 [INF]   Expires : Never
[6] 2024/07/26 18:52:25.462384 [INF] Starting http monitor on 0.0.0.0:8222
[6] 2024/07/26 18:52:25.462518 [INF] Listening for client connections on 0.0.0.0:4222
[6] 2024/07/26 18:52:25.462528 [INF] TLS required for client connections
[6] 2024/07/26 18:52:25.462678 [INF] Server is ready
[6] 2024/07/26 18:52:49.307129 [ERR] 10.224.0.16:64083 - cid:5 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:64083: read: connection reset by peer
[6] 2024/07/26 18:52:49.601709 [ERR] 10.244.11.1:22990 - cid:6 - TLS handshake error: read tcp 10.244.11.6:4222->10.244.11.1:22990: read: connection reset by peer
[6] 2024/07/26 18:52:50.421799 [ERR] 10.224.0.12:15840 - cid:7 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.12:15840: read: connection reset by peer
[6] 2024/07/26 18:52:50.656785 [ERR] 10.224.0.4:59517 - cid:8 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.4:59517: read: connection reset by peer
[6] 2024/07/26 18:52:50.669588 [ERR] 10.224.0.7:35999 - cid:9 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.7:35999: read: connection reset by peer
[6] 2024/07/26 18:52:50.674150 [ERR] 10.224.0.14:46192 - cid:10 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.14:46192: read: connection reset by peer
[6] 2024/07/26 18:52:50.738035 [ERR] 10.224.0.17:39583 - cid:11 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.17:39583: read: connection reset by peer
[6] 2024/07/26 18:52:51.211579 [ERR] 10.224.0.15:32887 - cid:12 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.15:32887: read: connection reset by peer
[6] 2024/07/26 18:52:51.408095 [ERR] 10.224.0.13:45266 - cid:13 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.13:45266: read: connection reset by peer
[6] 2024/07/26 18:52:55.309357 [ERR] 10.224.0.16:6606 - cid:14 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:6606: read: connection reset by peer
.....

10.244.14.232 is the pod IP of the nats instnace

ORG-nats-0                      2/2     Running   0          15m     10.244.14.232   aks-default-36350781-vmss000068   <none>           <none>

All of the IPs that it attempting to dial look like kube-system pods, some of which share the same internal cluster ip

cloud-node-manager-nsbpx              1/1     Running   0          26d     10.224.0.12     aks-amd64-25378925-vmss000004     <none>           <none>
cloud-node-manager-qbhbs              1/1     Running   0          130m    10.224.0.10     aks-default-36350781-vmss0000a8   <none>           <none>
coredns-6745896b65-fv8df              1/1     Running   0          7d22h   10.244.14.170   aks-default-36350781-vmss000068   <none>           <none>
coredns-6745896b65-mtnmx              1/1     Running   0          17d     10.244.22.17    aks-amd64-25378925-vmss000000     <none>           <none>
coredns-6745896b65-tn749              1/1     Running   0          17d     10.244.0.13     aks-amd64-25378925-vmss000010     <none>           <none>
....
kube-proxy-2scnf                      1/1     Running   0          23d     10.224.0.14     aks-amd64-25378925-vmss000010     <none>           <none>
kube-proxy-b9m4d                      1/1     Running   0          26d     10.224.0.7      aks-amd64-25378925-vmss000000     <none>           <none>
kube-proxy-fmncs                      1/1     Running   0          26d     10.224.0.12     aks-amd64-25378925-vmss000004     <none>           <none>
kube-proxy-gk9dj                      1/1     Running   0          10d     10.224.0.16     aks-amd64-25378925-vmss000011     <none>           <none>
kube-proxy-h9vcs                      1/1     Running   0          27h     10.224.0.17     aks-default-36350781-vmss0000a4   <none>           <none>
....

This is where my understanding starts to break down, I'm not sure why nats would be dialing pods in the kube-system namespace. Would really appreciate any guidance you can provide! It almost feels like setting the nats.tls.enabled broke something with nats talking to other in-cluster pods.

I looked at examples like https://gist.github.com/wallyqs/d9c9131a5bd5e247b2e4a6d4aac898af, but it seemed as though they were aimed at nats clustering. As far as I understand, it doesn't seem like I need self-signed routes config for a single pod deploy?

The text was updated successfully, but these errors were encountered:

Arkessler · 2024-08-06T18:02:23Z

For anyone else who encounters this issue in Azure, the problem was due to Azure Loadbalancers auto-generating health check probes against every port that has a rule. When Azure loadbalancer tries to probe the client port 4222, or the cluster port 6222, it causes an error due to being an improperly formatted request (that is also missing a valid cert if I understand correctly)

To disable the health checks on the client and clustering ports, add the following to your helm chart config (based on https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#loadbalancer-annotations)

    merge:
      metadata:
        annotations:
          service.beta.kubernetes.io/port_6222_no_probe_rule: "true"
          service.beta.kubernetes.io/port_4222_no_probe_rule: "true"
      spec:
        type: LoadBalancer

Arkessler added the defect Suspected defect such as a bug or regression label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TLS Handshake error between Nats pod and `kube-system` pods #921

TLS Handshake error between Nats pod and `kube-system` pods #921

Arkessler commented Jul 26, 2024

Arkessler commented Aug 6, 2024

TLS Handshake error between Nats pod and kube-system pods #921

TLS Handshake error between Nats pod and kube-system pods #921

Comments

Arkessler commented Jul 26, 2024

What version were you using?

What environment was the server running in?

Is this defect reproducible?

Given the capability you are leveraging, describe your expectation?

Given the expectation, what is the defect you are observing?

Arkessler commented Aug 6, 2024

TLS Handshake error between Nats pod and `kube-system` pods #921

TLS Handshake error between Nats pod and `kube-system` pods #921