Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
redpanda: make lifecycle hooks debuggable
Prior to this commit debugging issues with our lifecycle hooks was next to impossible. This is primarily due to Kubernetes providing little to no output about them except in the case of failure. Our hooks are wrapped with ; true to ensure failure never happens making the entire issue worse. This commit adds a more complex wrapper around the PostStart and PreStop hooks which causes all output from the hooks to be output to stdout of the redpanda process so it appears in `kubectl logs` with a timestamp and prefix indicating which hook it is. Additionally, this commit removes a seemingly benign bugged step of the PostStart hook that claimed to be creating the bootstrap user. This logic is handled by either the bootstrap environment variable or by the config-watcher container. Example output from `kubectl logs -f` on a terminating node: ``` INFO 2024-10-10 18:23:02,637 [shard 0:main] cluster - members_table.cc:258 - marking node 2 in maintenance state INFO 2024-10-10 18:23:02,637 [shard 0:main] cluster - drain_manager.cc:54 - Node draining is starting INFO 2024-10-10 18:23:02,637 [shard 0:main] cluster - drain_manager.cc:150 - Node draining has started INFO 2024-10-10 18:23:02,637 [shard 0:main] cluster - drain_manager.cc:183 - Node draining has completed on shard 0 lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + touch /tmp/preStopHookStarted lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + source /var/lifecycle/common.sh lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_URL=https://redpanda-2.redpanda.default.svc.cluster.local:9644 lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_NODE_ID_CMD='curl --silent --fail --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/node_config' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_MAINTENANCE_DELETE_CMD_PREFIX='curl -X DELETE --silent -o /dev/null -w "%{http_code}"' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_MAINTENANCE_PUT_CMD_PREFIX='curl -X PUT --silent -o /dev/null -w "%{http_code}"' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_MAINTENANCE_GET_CMD='curl -X GET --silent --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/maintenance' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + set -x lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + preStopHook lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ curl --silent --fail --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/node_config lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '\"node_id\":[^,}]*' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '[^: ]*$' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + NODE_ID=2 lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + echo 'Setting maintenance mode on node 2' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: Setting maintenance mode on node 2 lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + CURL_MAINTENANCE_PUT_CMD='curl -X PUT --silent -o /dev/null -w "%{http_code}" --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/brokers/2/maintenance' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '' = '"200"' ']' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ curl -X PUT --silent -o /dev/null -w '"%{http_code}"' --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/brokers/2/maintenance lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + status='"200"' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + sleep 0.5 lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '"200"' = '"200"' ']' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '' = true ']' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '' = false ']' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ curl -X GET --silent --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/maintenance lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + res='{"draining": true, "finished": true, "errors": false, "partitions": 2, "eligible": 0}' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ echo '{"draining":' true, '"finished":' true, '"errors":' false, '"partitions":' 2, '"eligible":' '0}' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '\"finished\":[^,}]*' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '[^: ]*$' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + finished=true lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ echo '{"draining":' true, '"finished":' true, '"errors":' false, '"partitions":' 2, '"eligible":' '0}' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '\"draining\":[^,}]*' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '[^: ]*$' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + draining=true lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + sleep 0.5 lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' true = true ']' lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + touch /tmp/preStopHookFinished lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + true INFO 2024-10-10 18:23:03,400 [shard 0:main] main - application.cc:466 - Stopping... ```
- Loading branch information