`received full heartbeat request addressed to node with different revision` after rolling restart with ephemeral storage #1431

vuldin · 2024-07-19T19:50:52Z

What happened?

Running kubectl rollout restart sts redpanda -n redpanda after deploying with ephemeral storage results in an unhealthy cluster, with the destroyed broker remaining in the cluster and the new broker getting assigned a new node ID (increasing the broker count by 1).

What did you expect to happen?

Expected the cluster to restart and result in a healthy cluster.

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

Create kind cluster:

kind create cluster --name jlp-cluster --config ~/projects/redpanda/kind-config.yaml

Create Redpanda config with ephemeral storage:

cat << EOF > values.yaml
tls:
  enabled: false
storage:
  persistentVolume:
    enabled: false
EOF

Deploy Redpanda 24.1.8 via helm 5.8.12:

helm install redpanda redpanda --repo https://charts.redpanda.com -n redpanda --wait --create-namespace --set version=5.8.12 --set image.tag=v24.1.8 -f values.yaml

Once the cluster is healthy, do a rolling restart:

kubectl rollout restart sts redpanda -n redpanda

Continuously run the following command until redpanda-2 is available:

kubectl logs pod/redpanda-2 -n redpanda -f

Eventually you will see print constantly:

WARN  2024-07-19 18:46:48,917 [shard 0:raft] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:3922 - received full heartbeat request addressed to node with different revision: {id: 2, revision: 0}, current node: {id: 3, revision: 0}, source: {id: 1, revision: 0}

I've ran through this multiple times. Sometimes the rolling restart doesn't continue past redpanda-2. Other times it continues as expected. Most times the cluster ends in the following state, where redpanda-2 is assigned a new node ID and redpanda-1 never returns to the cluster:

> kubectl exec -it redpanda-2 -n redpanda -c redpanda -- rpk cluster info
CLUSTER
=======
redpanda.a0eeca84-6e4c-44cc-b32c-3238b20a8679

BROKERS
=======
ID    HOST                                             PORT
0*    redpanda-0.redpanda.redpanda.svc.cluster.local.  9093
3     redpanda-2.redpanda.redpanda.svc.cluster.local.  9093

TOPICS
======
NAME      PARTITIONS  REPLICAS
_schemas  1           3

> kubectl exec -it redpanda-2 -n redpanda -c redpanda -- rpk cluster health
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          false
Unhealthy reasons:                [leaderless_partitions nodes_down under_replicated_partitions]
Controller ID:                    0
All nodes:                        [0 1 2 3]
Nodes down:                       [1]
Leaderless partitions (1):        [kafka/_schemas/0]
Under-replicated partitions (1):  [redpanda/controller/0]

Anything else we need to know?

We have this doc for this config, but there is no mention of an issue with being able to restart. It seems that running in this state is never a good idea with this issue, since anytime a broker leaves the cluster the cluster will become unhealthy. The brokers should be decommissioned when using ephemeral storage. We have this doc explaining how to perform a rolling restart, but no mention of any issues when using ephemeral storage.

It would be great if we could disable any changes when users run kubectl rollout restart sts redpanda -n redpanda when they also have storage.persistentVolume.enabled: false.

Which are the affected charts?

Redpanda

Chart Version(s)

This happens with all versions I've tested, from 5.8.12 to 5.7.24.

Cloud provider

none

JIRA Link: K8S-299

The text was updated successfully, but these errors were encountered:

chrisseto · 2024-07-22T14:37:03Z

Some excellent discussion going on in our internal slack.

The tl;dr is that we (I) don't believe there are use cases outside of simple testing / verification of chart / redpanda behaviors. If anyone has other uses cases, please chime in!

Until then, we'll update the docs and add some red tape to both NOTES.txt and the values.yaml file indicating that the errors seen here are expected behavior.

@chrisseto file a docs issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`received full heartbeat request addressed to node with different revision` after rolling restart with ephemeral storage #1431

`received full heartbeat request addressed to node with different revision` after rolling restart with ephemeral storage #1431

vuldin commented Jul 19, 2024 •

edited by jira bot

Loading

chrisseto commented Jul 22, 2024

received full heartbeat request addressed to node with different revision after rolling restart with ephemeral storage #1431

received full heartbeat request addressed to node with different revision after rolling restart with ephemeral storage #1431

Comments

vuldin commented Jul 19, 2024 • edited by jira bot Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

Anything else we need to know?

Which are the affected charts?

Chart Version(s)

Cloud provider

chrisseto commented Jul 22, 2024

`received full heartbeat request addressed to node with different revision` after rolling restart with ephemeral storage #1431

`received full heartbeat request addressed to node with different revision` after rolling restart with ephemeral storage #1431

vuldin commented Jul 19, 2024 •

edited by jira bot

Loading