Remote DB container issues #968

ggefgreh · 2024-07-24T14:42:34Z

Hi,
I think this is more the question (skill issue) than a problem of this tool.

Config

It is possible that I missunderstood the documentation, but I think it is possible to configure barman like that:

WAL streaming only (I am not sure if archive is needed - since it has problems with it sa well)
backup via rsync/ssh

Servers

barman server - backup host, installed locally on ubuntu via apt (192.168.76.14)
database - TimescaleDB container on another server (192.168.76.10). SSH (22) is mapped to port 5422 of the host, and db is on default 5432

Files

Global barman:

_[barman]
barman_user = barman

configuration_files_directory = /etc/barman.d
barman_home = /var/lib/barman

log_file = /var/log/barman/barman.log
log_level = DEBUG
compression = gzip
immediate_checkpoint = true
basebackup_retry_times = 3
basebackup_retry_sleep = 30
check_timeout = 60
last_backup_maximum_age = 1 DAYS_

Server config:

[server1]
ssh_command = ssh [email protected] -p 5422
conninfo = host=192.168.76.10 port=5432 user=dbadmin dbname=my-db
streaming_conninfo = host=192.168.76.10 port=5432 user=streaming_barman

backup_method = rsync
reuse_backup = link
backup_options = concurrent_backup

streaming_archiver = on
slot_name = barman
create_slot = auto

parallel_jobs = 2

Problem

The main problem I see is I cannot perform a backup, but it seems like underlying problem is missconfiguration or something else with WAL. Also when I execute the check command, it looks like it crashes database and it is going in recovery mode (recover itself then). It also happens when trying to perform backup, but I don't see anything wrong in postgres container logs, only that:
LOG: server process (PID 1858) exited with exit code 255, whe trying that backup or check.

Here are the results of check
$ barman check server1
Server server1:
WAL archive: FAILED (please make sure WAL shipping is setup)
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups)
backup minimum size: OK (0 B)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: FAILED (there are 7 failed backups)
minimum redundancy requirements: OK (have 0 backups, expected at least 0)
ssh: OK (PostgreSQL server)
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK

The results of backup gives either information that it cannot be performed or that database is in recovery.

P.S.

I already spend too much time on this. Before that I tried pgbackrest but I gave up on that. I will be gratefull for even the direction of further steps.
I am fairly new to postgres and never performed any backup with it.

Yes I tried if I can ssh paswordlessly between them - I can
I can manually perform a start of streaming WAL

The text was updated successfully, but these errors were encountered:

martinmarques · 2024-07-24T15:52:52Z

As you are trying to set up streaming of WALs, you should set archiver = off so it doesn't expect WALs to come in via archive_command.

Regarding the postgres server crashing, you'll have to find more information inside the container, probably increasing the log level or checking for core dumps. I would also check if you can connect directly to the database and run queries on it. barman check just connects via psycopg2 (normal libpq connection) and runs check queries. I can't think of a reason why it would crash.

ggefgreh · 2024-07-25T12:20:52Z

okay, I recreated the slot in the meantime and all of a sudden WAL archive row is gone from the check, but I added that archiver = off as you advised, just to be sure. From time to time, after backup tries, when getting that database system is in recovery it comes back as FAILED.
Mainly, now I am getting this in check:
$ barman check server1
Server server1:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups)
backup minimum size: OK (0 B)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: FAILED (there are 1 failed backups)
minimum redundancy requirements: FAILED (have 0 backups, expected at least 3)
ssh: OK (PostgreSQL server)
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK

So it seems good from that perspective.

To address what you told me to check:

yes I can connect directly to the database and run queries, without any problems.
I am on it - checking logs and learning about core dumps (so I will leave it for another response)

postgres docker logs

In terms of interesting things, WALs are not being saved properly on a backup server - I mean there is a file .partial but the directory wals is empty, however, I found something in the logs:

LOG:  00000: server process (PID 2471) exited with exit code 255
LOG:  00000: terminating any other active server processes
LOCATION:  HandleChildCrash, postmaster.c:3504
LOG:  00000: all server processes terminated; reinitializing
LOCATION:  PostmasterStateMachine, postmaster.c:4018
LOG:  00000: database system was interrupted; last known up at <DATETIME>
LOCATION: StartupXLOG, xlog.c:4962
LOG:  00000: database system was not properly shut down; automatic recovery in progress
LOCATION:  InitWalRecovery, xlogrecovery.c:896
LOG:  00000: redo starts at 3/63329DC0
LOCATION:  PerformWalRecovery, xlogrecovery.c:1659
LOG:  00000: unexpected pageaddr 3/5E38C000 in log segment 000000010000000300000063, offset 3719168
LOCATION:  ReadRecord, xlogrecovery.c:3082
LOG:  00000: redo done at 3/6338BF78 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s
LOCATION:  PerformWalRecovery, xlogrecovery.c:1796
LOG:  00000: checkpoint starting: end-of-recovery immediate wait
LOCATION:  LogCheckpointStart, xlog.c:6093

When trying to perform a backup with result:
ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check qursorrtls'

I think it is due to no WALs in wals dir. Because I see this in barman logs:
barman.wal_archiver INFO: No xlog segments found from streaming for server1

martinmarques · 2024-07-26T09:05:37Z

To get WALs in the wals directory run the following SQL on the postgres node as a superuser: SELECT pg_switch_wal();. Then on the barman server run barman archive-wal or wait for the next barman cron to run. That should produce a WAL switch and the WAL would get archived in the wals directory.
That .partial file is from the streaming of WALs.
I don't think the crash has anything to do with no WALs available in the wals directory

ggefgreh · 2024-07-26T18:36:46Z

Okay, I did what you said and it achieved starting a backup but it had been interrupted due to connection termination which couldn't be reconnected. I can't provide logs and output, because I already cleaned them up - it was a while ago and I hoped it wasn't one-time event, but it was. I was looking for setting up longer keep-alive time, but I am not sure if it is related to postgres or system ssh settings.

For now, I enabled logs for connections and disconnections and there this happens when performing a backup:

LOCATION:  set_authn_id, auth.c:371
[11297] LOG:  00000: replication connection authorized: user=streaming_barman
[11297] LOCATION:  PerformAuthentication, postinit.c:296
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[1] LOG:  00000: server process (PID 11309) exited with exit code 255
[1] LOCATION:  LogChildExit, postmaster.c:3737
[1] LOG:  00000: terminating any other active server processes
[1] LOCATION:  HandleChildCrash, postmaster.c:3504
[11297] LOG:  00000: disconnection: session time: 0:00:00.170 user=streaming_barman database= host=192.168.76.14 port=34568
[11297] LOCATION:  log_disconnections, postgres.c:5029
[11311] LOG:  00000: connection received: host=192.168.76.14 port=34584
[11311] LOCATION:  BackendInitialize, postmaster.c:4400
[11311] FATAL:  57P03: the database system is in recovery mode
[11311] LOCATION:  ProcessStartupPacket, postmaster.c:2410
[1] LOG:  00000: all server processes terminated; reinitializing
[1] LOCATION:  PostmasterStateMachine, postmaster.c:4018
[11312] LOG:  00000: database system was interrupted; last known up at 2024-07-26 17:59:57 UTC
[11312] LOCATION:  StartupXLOG, xlog.c:4962
[11315] LOG:  00000: connection received: host=192.168.76.14 port=34590
[11315] LOCATION:  BackendInitialize, postmaster.c:4400
[11315] FATAL:  57P03: the database system is in recovery mode
LOCATION:  ProcessStartupPacket, postmaster.c:2410
LOG:  00000: database system was not properly shut down; automatic recovery in progress
[11312] LOCATION:  InitWalRecovery, xlogrecovery.c:896
[11312] LOG:  00000: redo starts at 3/720000D8
[11312] LOCATION:  PerformWalRecovery, xlogrecovery.c:1659
[11312] LOG:  00000: invalid record length at 3/7200A300: wanted 24, got 0
[11312] LOCATION:  ReadRecord, xlogrecovery.c:3082
[11312] LOG:  00000: redo done at 3/7200A2C8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
[11312] LOCATION:  PerformWalRecovery, xlogrecovery.c:1796
[11313] LOG:  00000: checkpoint starting: end-of-recovery immediate wait
[11313] LOCATION:  LogCheckpointStart, xlog.c:6093
[11313] LOG:  00000: checkpoint complete: wrote 47 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=40 kB, estimate=40 kB
[11313] LOCATION:  LogCheckpointEnd, xlog.c:6174
[1] LOG:  00000: database system is ready to accept connections

As far as I can read and I don't know what it looks like, I would see it seems like a streaming problem (the time is right for backup command execution). I also didn't observe any abnormality in CPU/Disk/Usage while performing a backup or, anything else. Only dying all DB-related processes while crashing.

Is it normal that it has so many IDENTIFY_SYSTEM executions?

And of course, it is not related to empty wals directory, because now it is not empty. I will try to explore some more logs and maybe once again go from the beginning and check the configuration for the 99th time.

martinmarques · 2024-08-02T06:45:46Z

Is it normal that it has so many IDENTIFY_SYSTEM executions?

Those are streaming connections where pg_receivewal requests the server to identify before it starts streaming

https://www.postgresql.org/docs/current/protocol-replication.html

I can't find anything useful in those logs. You'll probably have to increase the log_level in Postgres.

My suggestion is to bring this to the pgsql-general mailing list or the Postgres Slack/Discord channels.

martinmarques added the question label Jul 24, 2024

martinmarques self-assigned this Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote DB container issues #968

Remote DB container issues #968

ggefgreh commented Jul 24, 2024 •

edited

Loading

martinmarques commented Jul 24, 2024

ggefgreh commented Jul 25, 2024 •

edited

Loading

martinmarques commented Jul 26, 2024

ggefgreh commented Jul 26, 2024

martinmarques commented Aug 2, 2024

Remote DB container issues #968

Remote DB container issues #968

Comments

ggefgreh commented Jul 24, 2024 • edited Loading

Config

Servers

Files

Global barman:

Server config:

Problem

P.S.

martinmarques commented Jul 24, 2024

ggefgreh commented Jul 25, 2024 • edited Loading

postgres docker logs

martinmarques commented Jul 26, 2024

ggefgreh commented Jul 26, 2024

martinmarques commented Aug 2, 2024

ggefgreh commented Jul 24, 2024 •

edited

Loading

ggefgreh commented Jul 25, 2024 •

edited

Loading