Point in time recovery on PG13+: Recovery target is unreachable due to log related to hot_standby mode off #1042

desaintmartin · 2024-10-23T09:19:36Z

Similar to zalando/postgres-operator#2168, there is an issue with a restore with point-in-time-recovery in future on PG13+ due to implementation of #580.

With the correct settings:

logging_collector: 'on'
log_destination: stderr,csvlog

The log is correctly read by maybe_pg_upgrade.py, BUT it happens that sometimes a log containing Hot standby mode is disabled. due to some random connection (from our pg exporter... and from spilo/patroni itself? see log below) is inserted at the same time and makes the check fail.
Hot standby mode being disabled is a feature: Patroni disable hot_standby during PITR due to potential errors: patroni/patroni#1891

The exact cause to why it works on small databases or fails on our bigger databases with same settings is unknown yet, I think our biggest databases are just... slower, making this a lot more likely to happen between interesting logs.

The fact that we rely on the last 5 lines of log to decide if it is ok seems very dangerous.

During recovery, we had to... edit live the spilo code to check the last 15 lines instead of 5 to make it work during the second try.

I think a short-term solution would be to have better check that "grep in the last 5 lines", and a long-term solution would allow a restore without mandatory time target.

Extract from csv logs:

2024-10-22 16:51:06.215 UTC,,,539,,6717d728.21b,1889,,2024-10-22 16:47:36 UTC,,0,LOG,00000,"restored log file ""0000000100002E880000003C"" from archive",,,,,,,,,"","startup",,0
2024-10-22 16:51:06.310 UTC,"postgres","postgres",87962,"[local]",6717d7fa.1579a,1,"",2024-10-22 16:51:06 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
2024-10-22 16:51:06.479 UTC,,,539,,6717d728.21b,1890,,2024-10-22 16:47:36 UTC,,0,LOG,00000,"redo done at 2E88/3C000060 system usage: CPU: user: 91.75 s, system: 36.75 s, elapsed: 207.81 s",,,,,,,,,"","startup",,0
2024-10-22 16:51:06.479 UTC,,,539,,6717d728.21b,1891,,2024-10-22 16:47:36 UTC,,0,LOG,00000,"last completed transaction was at log time 2024-10-22 15:07:01.344774+00",,,,,,,,,"","startup",,0
2024-10-22 16:51:06.479 UTC,,,539,,6717d728.21b,1892,,2024-10-22 16:47:36 UTC,,0,FATAL,XX000,"recovery ended before configured recovery target was reached",,,,,,,,,"","startup",,0
2024-10-22 16:51:07.317 UTC,"postgres","postgres",87967,"[local]",6717d7fb.1579f,1,"",2024-10-22 16:51:07 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
2024-10-22 16:51:07.932 UTC,,,536,,6717d727.218,5,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"startup process (PID 539) exited with exit code 1",,,,,,,,,"","postmaster",,0
2024-10-22 16:51:07.932 UTC,,,536,,6717d727.218,6,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0
2024-10-22 16:51:08.027 UTC,"postgres_exporter","wiremind",87968,"127.0.0.1:41510",6717d7fc.157a0,1,"",2024-10-22 16:51:08 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:08.027 UTC,"postgres_exporter","wiremind",87969,"127.0.0.1:41520",6717d7fc.157a1,1,"",2024-10-22 16:51:08 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:08.323 UTC,"postgres","postgres",87971,"[local]",6717d7fc.157a3,1,"",2024-10-22 16:51:08 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:09.331 UTC,"postgres","postgres",87973,"[local]",6717d7fd.157a5,1,"",2024-10-22 16:51:09 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:09.654 UTC,,,536,,6717d727.218,7,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"shutting down due to startup process failure",,,,,,,,,"","postmaster",,0
2024-10-22 16:51:14.214 UTC,,,536,,6717d727.218,8,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster",,0

for reference, here is our postgres operator Postgresql resource:

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: my-postgres
  namespace: default
  labels:
    sidecar.istio.io/inject: 'true'
spec:
  additionalVolumes:
    - mountPath: /etc/config.yaml
      name: metrics
      subPath: config.yaml
      targetContainers:
        - metrics-exporter
      volumeSource:
        configMap:
          defaultMode: 420
          name: patroni-postgresql-metrics
  allowedSourceRanges: []
  connectionPooler:
    maxDBConnections: 800
    mode: transaction
    numberOfInstances: 6
    resources:
      limits:
        cpu: '2'
        memory: 200Mi
      requests:
        cpu: '0.5'
        memory: 100Mi
    schema: pooler
    user: pooler
  enableConnectionPooler: true
  enableMasterLoadBalancer: false
  enableMasterPoolerLoadBalancer: false
  enableReplicaConnectionPooler: false
  enableReplicaLoadBalancer: false
  enableReplicaPoolerLoadBalancer: false
  enableShmVolume: true
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
            - key: node-role.kubernetes.io/bare-metal
              operator: In
              values:
                - 'true'
        weight: 90
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node-role.kubernetes.io/production
              operator: In
              values:
                - 'true'
  numberOfInstances: 1
  patroni:
    initdb:
      data-checksums: 'true'
      encoding: UTF8
      locale: en_US.UTF-8
    loop_wait: 10
    maximum_lag_on_failover: 1048576
    pg_hba:
      - local all all trust
      - local replication standby trust
      - host postgres postgres 0.0.0.0/0 md5
      - host postgres standby 0.0.0.0/0 md5
      - host wiremind all 0.0.0.0/0 md5
      - host replication standby 0.0.0.0/0 md5
    retry_timeout: 300
    slots:
      patroni:
        type: physical
    synchronous_mode: false
    synchronous_mode_strict: false
    ttl: 650
  podPriorityClassName: app-critical-prod
  postgresql:
    parameters:
      checkpoint_completion_target: '0.9'
      checkpoint_timeout: '3600'
      effective_cache_size: 187500MB
      effective_io_concurrency: '300'
      hot_standby: 'off'
      idle_in_transaction_session_timeout: 600s
      jit: 'off'
      log_checkpoints: 'on'
      log_connections: 'off'
      log_destination: stderr,csvlog
      log_disconnections: 'off'
      log_line_prefix: '%m [%p] '
      log_min_duration_statement: '-1'
      log_statement: ddl
      log_temp_files: '-1'
      logging_collector: 'on'
      maintenance_work_mem: 2048MB
      max_connections: '1500'
      max_parallel_maintenance_workers: '4'
      max_parallel_workers: '43'
      max_parallel_workers_per_gather: '21'
      max_replication_slots: '5'
      max_wal_senders: '7'
      max_wal_size: 30GB
      max_worker_processes: '43'
      min_wal_size: 2GB
      random_page_cost: '1.1'
      shared_buffers: 62500MB
      statement_timeout: 1800s
      tcp_keepalives_count: '2'
      tcp_keepalives_idle: '60'
      tcp_keepalives_interval: '5'
      timezone: GMT
      wal_buffers: 1875kB
      wal_compression: 'on'
      wal_level: replica
      work_mem: 15MB
    version: '14'
  preparedDatabases:
    wiremind:
      defaultUsers: true
      extensions:
        btree_gist: public
        pg_stat_statements: public
      schemas:
        public:
          defaultRoles: false
          defaultUsers: false
        raw_extract_data:
          defaultRoles: true
          defaultUsers: true
  resources:
    limits:
      cpu: '43'
      memory: 250000Mi
    requests:
      cpu: '43'
      memory: 250000Mi
  teamId: my-postgres
  tolerations:
    - effect: NoSchedule
      key: production
      operator: Equal
      value: 'true'
  volume:
    size: 4500Gi
    storageClass: csi-topolvm-standard-retain

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Point in time recovery on PG13+: Recovery target is unreachable due to log related to hot_standby mode off #1042

Point in time recovery on PG13+: Recovery target is unreachable due to log related to hot_standby mode off #1042

desaintmartin commented Oct 23, 2024 •

edited

Loading

Point in time recovery on PG13+: Recovery target is unreachable due to log related to hot_standby mode off #1042

Point in time recovery on PG13+: Recovery target is unreachable due to log related to hot_standby mode off #1042

Comments

desaintmartin commented Oct 23, 2024 • edited Loading

desaintmartin commented Oct 23, 2024 •

edited

Loading