-
Notifications
You must be signed in to change notification settings - Fork 21
Leases and locks
-
The repository format invariants are always preserved: it's never corrupted, even when clients are interrupted.
-
Avoid if possible the user ever needing to decide to break a lock: this is toilsome, probably hard for a human to decide absolutely reliably, and tends to cause problems for scripts or automated operations, where there is no human to punt to.
-
Readers should be "physically" read-only: they don't need permission to write and so they cannot write a lock file.
-
Support deletion of old versions and any objects they reference. (This might be manually or automatically triggered to free up space or to remove things that are older than a retention period.)
-
Support storing archives on a remote machine over sftp or on an object store such as S3. So we cannot rely on Unix flocks, etc. And, remote archives will not always be accessed from a single machine.
Assume that clients may be interrupted, disconnected, or crash, at any time. (It's OK if they try to clean up, but we can't assume it will always work.)
We'll assume that the files written up to any moment when the client is interrupted will all be readable in future. This is not quite true, if for example the archive is written to a local filesystem and the machine abruptly loses power or crashes while a backup is underway: filesystems generally don't guarantee a perfect ordering across files, and so we may end up with the archive referencing blocks that aren't present, or that are corrupt.
However, in that circumstance it's very hard to make any guarantees at all about what will be seen after recovery, and so we mostly handle this by read operations being tolerant of missing blocks. (There is an ancillary problem here that really future backups should notice that the blocks are missing or corrupt, and try to avoid referencing them. But this is out of scope for this document.) Hopefully abrupt reboots that lose in-flight file IO are rare.
We also assume that readers see files immediately after they are written, i.e. the archive filesystem is strongly globally consistent. This didn't used to be true for S3 and GCS, but now it is. Here too, it seems very hard to do anything without this assumption, without relying on some external consistency mechanism.
All machines involved have reasonably-accurate clocks, within say a minute of true time.
During a backup, Conserve first writes data blocks and then writes index hunks that reference those blocks. Index hunks may also reference blocks already in the archive, either if the file is unchanged from the previous backup, or if the file hash matches something else that was stored. (There's no guarantee that the match will be against the directly previous backup.)
This ordering ensures that if the client is interrupted, every referenced block will be stored, although it's possible that some blocks were stored that are not referenced by any index hunks.
During garbage collection (which is run as part of deletion), Conserve lists all present blocks, walks all indexes to discover all referenced blocks, and then finds all blocks that are not referenced by any index, and deletes them.
The problem is that if garbage collection races with a backup being written, it may see the new blocks as unreferenced and decide to delete them, even though they are imminently about to be referenced by new index hunks.
It seems the main thing we need is mutual exclusion between backups and garbage collection. It would be OK for any number of backups to write to the same archive simultaneously, but we should not search for unreferenced blocks at the same time that blocks and index hunks are being written.
(Or maybe there is some other solution that would let them run simultaneously?)
The current solution in 0.6.15 is that deletion refuses to start if there's an incomplete backup underway, and backups refuse to start if there's a GC_LOCK
file in the archive.
This is OK but not ideal. If the gc is interrupted, which is plausible because it does a lot of work and so takes a long time, the lock can be left behind, and so the user has to decide whether to manually break it (which is against one of the goals.)
Also, this requires that there's a human in the loop and so conflicts with the goal of running reliably and entirely automatically.
Similarly, the last backup being incomplete is a poor proxy for "is there a backup process still running." The backup might have been interrupted weeks ago. If the user wants to clean up the interrupted backup, Conserve refuses: they need to complete a new backup first.
We might think of never deleting blocks added in the most recent backup: but new index hunks can reference arbitrarily-old blocks, so this doesn't help.
Similarly, we could emphasize deleting particular backups rather than deleting unreferenced blocks, and then delete the blocks that were uniquely referenced by the to-be-deleted backup. But it still might happen that the currently-underway backup chooses to add a new reference to this apparently-obsolete block.
Earlier versions of Conserve (0.5?) avoided this problem by keeping a per-band block directory, which does make it easy to delete all the band's blocks, but at the price of a lot of duplicate block data. So that's not a very good solution.
We could also try keeping a file lock in a directory on the client machine, representing that a logical lock is held on some remote archive.
A different approach to mututal exclusion between backups and GC is to write a lease file in the archive, as say a json file LEASE
in the root.
The lease file asserts that a process needs mutual exclusion and that it is still alive.
To prevent stale leases we include in the lease the time it was last refreshed.
If the lease is less than say five minutes old, it is still valid, and any contending writers must wait. (Possibly five minutes is too long and one minute would be enough.)
While the process is still working, it should periodically check and refresh the lease, well ahead of its expiration date.
If a process discovers the lease was stolen, it can no longer continue and should loudly exit. (This should never normally happen, but seems conceivable if a user manually deletes the lease file, if there is a bug, etc.) The backup process could exit before writing a new index hunk; the gc process could exit before deleting any blocks.
Checking and renewal of the lease could be done from a background thread, which should mean it's always renewed promptly.
Checking and reading back the lease also protects against filesystems or transports that don't reliably detect conflicting attempts to create a single file.
A single lease per archive would block two backup processes from running simultaneously. In general they could safely co-exist. (It wouldn't be very useful to do this at the moment, but with multiple backup series sharing a blockdir it might be useful.)
We assume here that the transport filesystem can atomically create a file, including with "create if not exists" semantics, which S3 added in 2024. Also, we assume that they have consistent read-after-write. We also assume that filesystems report mtimes reasonably correctly, and that clients have reasonably accurate clocks, within a few seconds of each other.
At any time the lease can be in one of these states, from the point of view of the filesystem holding the lease file:
- Absent: These is no file; any client can write it.
- Corrupt-Recent: There is a file, but it's empty or can't be deserialized. This might occur if a write is interrupted or the filesystem is corrupted. We assume that if the file entry exists it still has an mtime, which can be taken to indicate the last renewal time, even if we can't read any other metadata. If the mtime is later than the default lease lifetime, we can assume the lease is still held.
- Corrupt-Stale: The lease file exists but can't be deserialized, and its mtime is older than the lease lifetime.
- Held: The lease file specifies a lease expiry time in the future. Any other client must wait for a while and can then try again to acquire the lease.
- Expired: The expiry time is in the past; the lease can be taken by any client.
Clients have their own state machine for the lease, which interacts with the ground truth of the lease file:
- Null: The client does not hold or want the lease. This is the initial state and also the terminal state after the lease is released.
- Probe: The client reads the file to see if it can acquire the lease. If the lease file is Absent it moves to Acquiring. If the lease is Expired or Corrupt-Stale, it moves to Cleaning. If the lease is Held or Corrupt-Recent, it moves to Waiting.
- Waiting: The client believes someone else holds the lease, and is waiting for it to expire. The client waits for a short interval (perhaps 10s) then Probes again.
- Acquiring: The client believes the lease is not held, and will try to write the lease file. If it succeeds, it will set the expiry time and move to the Held state. If it fails, perhaps because of a race with another client, it will move to the Waiting state.
- Cleaning: The lease file was written by another client, but it's expired. This client will delete the file and then move to the Acquiring state.
- Held: The client holds the lease and is working. At intervals, it will revalidate its lease, by first reading the lease file: if it was stolen, it moves to the Stolen state. Otherwise, it rewrites the lease file, setting a later expiry time. Eventually, it will release the lease.
- Stolen: The client believed that it held a valid lease, but when it checked the lease file it found it was held by someone else, or missing. This is an unexpected error, and the client should abort without writing any other files.
- Releasing: The client believes it holds the lease and is in the process of releasing it. It will delete the lease file and then move to the Null state.
The most important content in the file is the expiry time, a Unix timestamp.
It also holds other information identifying the client, both so that clients can be sure they're not stealing each other leases, and to potentially help diagnose problems of contention:
- Client chosen nonce
- Client process id
- Client Conserve version string
- Client hostname
- Client username
These are serialized as json. Absent or unexpected fields should not be an error.
In addition to If-None-Match, we might be able to use other S3 features to reduce the risk of inadvertently stealing leases?