Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote DB container issues #968

Open
ggefgreh opened this issue Jul 24, 2024 · 5 comments
Open

Remote DB container issues #968

ggefgreh opened this issue Jul 24, 2024 · 5 comments
Assignees
Labels

Comments

@ggefgreh
Copy link

ggefgreh commented Jul 24, 2024

Hi,
I think this is more the question (skill issue) than a problem of this tool.

Config

It is possible that I missunderstood the documentation, but I think it is possible to configure barman like that:

  • WAL streaming only (I am not sure if archive is needed - since it has problems with it sa well)
  • backup via rsync/ssh

Servers

  1. barman server - backup host, installed locally on ubuntu via apt (192.168.76.14)
  2. database - TimescaleDB container on another server (192.168.76.10). SSH (22) is mapped to port 5422 of the host, and db is on default 5432

Files

Global barman:

_[barman]
barman_user = barman

configuration_files_directory = /etc/barman.d
barman_home = /var/lib/barman

log_file = /var/log/barman/barman.log
log_level = DEBUG
compression = gzip
immediate_checkpoint = true
basebackup_retry_times = 3
basebackup_retry_sleep = 30
check_timeout = 60
last_backup_maximum_age = 1 DAYS_

Server config:

[server1]
ssh_command = ssh [email protected] -p 5422
conninfo = host=192.168.76.10 port=5432 user=dbadmin dbname=my-db
streaming_conninfo = host=192.168.76.10 port=5432 user=streaming_barman

backup_method = rsync
reuse_backup = link
backup_options = concurrent_backup

streaming_archiver = on
slot_name = barman
create_slot = auto

parallel_jobs = 2

Problem

The main problem I see is I cannot perform a backup, but it seems like underlying problem is missconfiguration or something else with WAL. Also when I execute the check command, it looks like it crashes database and it is going in recovery mode (recover itself then). It also happens when trying to perform backup, but I don't see anything wrong in postgres container logs, only that:
LOG: server process (PID 1858) exited with exit code 255, whe trying that backup or check.

Here are the results of check
$ barman check server1
Server server1:
WAL archive: FAILED (please make sure WAL shipping is setup)
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups)
backup minimum size: OK (0 B)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: FAILED (there are 7 failed backups)
minimum redundancy requirements: OK (have 0 backups, expected at least 0)
ssh: OK (PostgreSQL server)
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK

The results of backup gives either information that it cannot be performed or that database is in recovery.

P.S.

I already spend too much time on this. Before that I tried pgbackrest but I gave up on that. I will be gratefull for even the direction of further steps.
I am fairly new to postgres and never performed any backup with it.

  • Yes I tried if I can ssh paswordlessly between them - I can
  • I can manually perform a start of streaming WAL
@martinmarques
Copy link
Contributor

As you are trying to set up streaming of WALs, you should set archiver = off so it doesn't expect WALs to come in via archive_command.

Regarding the postgres server crashing, you'll have to find more information inside the container, probably increasing the log level or checking for core dumps. I would also check if you can connect directly to the database and run queries on it. barman check just connects via psycopg2 (normal libpq connection) and runs check queries. I can't think of a reason why it would crash.

@ggefgreh
Copy link
Author

ggefgreh commented Jul 25, 2024

okay, I recreated the slot in the meantime and all of a sudden WAL archive row is gone from the check, but I added that archiver = off as you advised, just to be sure. From time to time, after backup tries, when getting that database system is in recovery it comes back as FAILED.
Mainly, now I am getting this in check:
$ barman check server1
Server server1:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups)
backup minimum size: OK (0 B)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: FAILED (there are 1 failed backups)
minimum redundancy requirements: FAILED (have 0 backups, expected at least 3)
ssh: OK (PostgreSQL server)
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK

So it seems good from that perspective.

To address what you told me to check:

  • yes I can connect directly to the database and run queries, without any problems.
  • I am on it - checking logs and learning about core dumps (so I will leave it for another response)

postgres docker logs

In terms of interesting things, WALs are not being saved properly on a backup server - I mean there is a file .partial but the directory wals is empty, however, I found something in the logs:

LOG:  00000: server process (PID 2471) exited with exit code 255
LOG:  00000: terminating any other active server processes
LOCATION:  HandleChildCrash, postmaster.c:3504
LOG:  00000: all server processes terminated; reinitializing
LOCATION:  PostmasterStateMachine, postmaster.c:4018
LOG:  00000: database system was interrupted; last known up at <DATETIME>
LOCATION: StartupXLOG, xlog.c:4962
LOG:  00000: database system was not properly shut down; automatic recovery in progress
LOCATION:  InitWalRecovery, xlogrecovery.c:896
LOG:  00000: redo starts at 3/63329DC0
LOCATION:  PerformWalRecovery, xlogrecovery.c:1659
LOG:  00000: unexpected pageaddr 3/5E38C000 in log segment 000000010000000300000063, offset 3719168
LOCATION:  ReadRecord, xlogrecovery.c:3082
LOG:  00000: redo done at 3/6338BF78 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s
LOCATION:  PerformWalRecovery, xlogrecovery.c:1796
LOG:  00000: checkpoint starting: end-of-recovery immediate wait
LOCATION:  LogCheckpointStart, xlog.c:6093

When trying to perform a backup with result:
ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check qursorrtls'

I think it is due to no WALs in wals dir. Because I see this in barman logs:
barman.wal_archiver INFO: No xlog segments found from streaming for server1

@martinmarques
Copy link
Contributor

To get WALs in the wals directory run the following SQL on the postgres node as a superuser: SELECT pg_switch_wal();. Then on the barman server run barman archive-wal or wait for the next barman cron to run. That should produce a WAL switch and the WAL would get archived in the wals directory.
That .partial file is from the streaming of WALs.
I don't think the crash has anything to do with no WALs available in the wals directory

@martinmarques martinmarques self-assigned this Jul 26, 2024
@ggefgreh
Copy link
Author

Okay, I did what you said and it achieved starting a backup but it had been interrupted due to connection termination which couldn't be reconnected. I can't provide logs and output, because I already cleaned them up - it was a while ago and I hoped it wasn't one-time event, but it was. I was looking for setting up longer keep-alive time, but I am not sure if it is related to postgres or system ssh settings.

For now, I enabled logs for connections and disconnections and there this happens when performing a backup:

LOCATION:  set_authn_id, auth.c:371
[11297] LOG:  00000: replication connection authorized: user=streaming_barman
[11297] LOCATION:  PerformAuthentication, postinit.c:296
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[1] LOG:  00000: server process (PID 11309) exited with exit code 255
[1] LOCATION:  LogChildExit, postmaster.c:3737
[1] LOG:  00000: terminating any other active server processes
[1] LOCATION:  HandleChildCrash, postmaster.c:3504
[11297] LOG:  00000: disconnection: session time: 0:00:00.170 user=streaming_barman database= host=192.168.76.14 port=34568
[11297] LOCATION:  log_disconnections, postgres.c:5029
[11311] LOG:  00000: connection received: host=192.168.76.14 port=34584
[11311] LOCATION:  BackendInitialize, postmaster.c:4400
[11311] FATAL:  57P03: the database system is in recovery mode
[11311] LOCATION:  ProcessStartupPacket, postmaster.c:2410
[1] LOG:  00000: all server processes terminated; reinitializing
[1] LOCATION:  PostmasterStateMachine, postmaster.c:4018
[11312] LOG:  00000: database system was interrupted; last known up at 2024-07-26 17:59:57 UTC
[11312] LOCATION:  StartupXLOG, xlog.c:4962
[11315] LOG:  00000: connection received: host=192.168.76.14 port=34590
[11315] LOCATION:  BackendInitialize, postmaster.c:4400
[11315] FATAL:  57P03: the database system is in recovery mode
LOCATION:  ProcessStartupPacket, postmaster.c:2410
LOG:  00000: database system was not properly shut down; automatic recovery in progress
[11312] LOCATION:  InitWalRecovery, xlogrecovery.c:896
[11312] LOG:  00000: redo starts at 3/720000D8
[11312] LOCATION:  PerformWalRecovery, xlogrecovery.c:1659
[11312] LOG:  00000: invalid record length at 3/7200A300: wanted 24, got 0
[11312] LOCATION:  ReadRecord, xlogrecovery.c:3082
[11312] LOG:  00000: redo done at 3/7200A2C8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
[11312] LOCATION:  PerformWalRecovery, xlogrecovery.c:1796
[11313] LOG:  00000: checkpoint starting: end-of-recovery immediate wait
[11313] LOCATION:  LogCheckpointStart, xlog.c:6093
[11313] LOG:  00000: checkpoint complete: wrote 47 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=40 kB, estimate=40 kB
[11313] LOCATION:  LogCheckpointEnd, xlog.c:6174
[1] LOG:  00000: database system is ready to accept connections

As far as I can read and I don't know what it looks like, I would see it seems like a streaming problem (the time is right for backup command execution). I also didn't observe any abnormality in CPU/Disk/Usage while performing a backup or, anything else. Only dying all DB-related processes while crashing.

Is it normal that it has so many IDENTIFY_SYSTEM executions?

And of course, it is not related to empty wals directory, because now it is not empty. I will try to explore some more logs and maybe once again go from the beginning and check the configuration for the 99th time.

@martinmarques
Copy link
Contributor

Is it normal that it has so many IDENTIFY_SYSTEM executions?

Those are streaming connections where pg_receivewal requests the server to identify before it starts streaming

https://www.postgresql.org/docs/current/protocol-replication.html

I can't find anything useful in those logs. You'll probably have to increase the log_level in Postgres.

My suggestion is to bring this to the pgsql-general mailing list or the Postgres Slack/Discord channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants