-
Notifications
You must be signed in to change notification settings - Fork 21
Ideas
The features below are classified by themes in only rough priority order.
Development proceeds a bit sporadically depending on developer time.
In general Conserve's tip tree should always be usable, and we will make a new release every month that there has been significant development.
Perhaps run a named command (like diff
) on files that differ.
It'd be really nice to account for which changes might have been due to later changes in the source:
-
Files are different and the file in the source is newer.
-
File is missing from the source and potentially deleted since the backup was created.
These changes should be possible without making a new major archive format, although they will create new bands that can't be read by old versions.
-
Try using Rust Prost to write out protobufs.
-
If done using derivation the buigd process should still be reasonably clean.
-
Allows storing small files inline as bytes.
-
Consider also dictionary-compressing hashes within each index hunk rather that relying on byte compression to do this.
-
-
Alternatively, try CBOR, but it's probably less efficient that binary proto. This will allow storing tiny files inline in the index, and should make the uncompressed form smaller: in particular block references can be binary rather than hex, so will be half the size.
-
Measure the size and time impact of this change.
-
This probably requires a translation between the in-memory index entry and the serialized object.
-
-
Consider compressing the index with zstd rather than Snappy.
Last time I tried this the results were actually not so impressive, but we could try again.
https://github.com/sourcefrog/conserve/issues/154
-
Store tiny files in the index rather than in blocks. For tiny files (for example 10 bytes) it's actually smaller to store the content than a reference to the block, and it avoids separate IO to the block store.
This depends on having an index that can efficiently include binary content.
-
Store permissions
-
Store and restore ownership.
-
In many cases the ownership will be all the same. Perhaps it can be omitted from the index.
-
Perhaps there should be an option to request storing ownership: for many cases it doesn't matter so much.
-
Being unable to set the group or owner should be a problem that's by default only a warning.
https://github.com/sourcefrog/conserve/issues/106
In some cases backups spend a lot of time in stat
on the blockdir. This could
be helped by keeping an in-memory cache of which blocks and block prefix subdirs
are present.
During restore we will read data from combined blocks multiple times, and at the moment they're read and decompressed each time. This is pretty inefficient.
One way to help is to keep a cache, maybe an adaptive replacement cache, of data blocks in memory and read out from that.
Also, we could look ahead through the index hunk and see all the blocks that are needed for the files that are to be restored. We don't need to be "surprised" to see something is needed again when we can see arbitrarily far forward which ones will be needed. Perhaps this can turn into cache hints.
In both of these there needs to be some reasonable cap, either fixed size or perhaps a fraction of system memory.
Incremental backups still write a full copy of the index, listing all the entries in the current tree. This in practice seems to work pretty reasonably, with an index only about 1/1000th the size of the tree. (For each file there's about 100 bytes in the name and block references.)
(I used to think this would be very important, but experience seems to show it's not so much.)
We could add a concept of higher-tier versions, that record only files stored since a basis index.
-
An index concept of a whiteout.
-
A tree reader that reads several indexes in parallel and merges them. (Something much like this will be needed to read incomplete trees.)
-
A tree writer that notices only the differences versus the parent tree, and records them, including whiteouts.
It seems like we'd need some heuristic for when to make a delta rather than full index. One possibility is to look at the length of the previous delta index: if it's getting too long (perhaps 1/4 of the full index?) then just store a full index.
Validation checks some invariants of the format, to catch either bugs or issues originating in the environment, like disk corruption.
Perhaps more of the work here is in creating tests that make variously broken archives and validate them - positive cases for validation.
What bugs are actually plausible? What failures could be caused by interruption or machine crash or other likely underlying failures?
How much is this similar to just doing a restore and throwing away the results?
- For the archive
- [done] No unexpected directories or files
- [done] All band directories are in the canonical format
- For every band
- The index block numbers are contiguous and correctly formated
- No unexpected files or directories
- For every entry in the index:
- Filenames are in order (and without duplicates)
- Filenames don't contain
/
or.
or..
- The referenced blocks exist
- (Deep only) The blocks can be extracted and they reconstitute the expected hash
- For the blockdir:
- No unexpected top-level files or directories
- Every prefix subdirectory is a hex prefix of the right length
- Every file inside a prefix subdirectory matches the prefix
- There are no unexpected files or directories inside prefix subdirectories
- No zero-byte files
- No temporary files
- For every block in the blockdir:
- [done] The hash of the block is what the name says.
- All blocks are referenced by one index
Should report on (and gc could clean up) any old leftover tmp files.
- Auto ignore
CACHEDIR.TAG
directories using the Rust library to match them.
- Try https://github.com/TyOverby/flame flamegraph profiling. (May not be useful if the compression/hashing/etc is very tightly interleaved? But we can still try.)
- Test handling of various broken archives - perhaps needs some scripts or infrastructure to construct them
- decompression failure
- missing block
- bad block
- missing index file
- File is removed during reading of index
- Add more unit tests for restore.
- Interesting Unicode names? (What's interesting?)
- Filenames that cause trouble across Windows/Unix.
- Test performance of block storage by looking at counts: semi-white-box test of side effects
- Filesystem wrapper to allow injecting faults
- Detection of corrupt block:
- Wrong hash
- Decompression fails
- Helper to compare trees and show diff
- Helper for blackbox tests: show all output if something fails in the test. (Is it enough to just print output unconditionally?)
- Rename
testsupport
to a seperabletreebuilder
?
- Detect there's an interrupted band
- Look at what index blocks are already present
- Find the last stored name from the last stored index block
- Maybe check all the data blocks from the last index block are actually stored, to know that the interruption was safe?
- Resume from that filename
-
Reading, hashing, and compressing non-small files within a group can be parallelized.
-
Reading non-small files in to memory can be parallelized. Hashing and compressing them still needs to be serial, but that should be much cheaper.
-
Parallelize finding referenced blocks across all the hunks of a band.
Both reading and writing do a lot of CPU-intensive hashing and de/compression, and are fairly easy to parallel.
Parallelizing within a single file is probably possible, but doing random IO within the file will be complicated, especially for non-local filesystems. Similarly entries must be written into the index in order: they could arrive a bit out of order but we do need to finish one chunk at a time.
However it should be easy to parallelize across multiple files, and index chunks give an obvious granularity for doing this:
- Read a thousand filenames.
- Compress and store all of them, generating index entries in the right order. (Or, sort the index entries if necessary.)
- Write out the index chunk and move to the next.
It seems like it'll fit naturally on Rayon, which is great.
I do want to also combine small blocks together, which means the index entry isn't available immediately after the file is written in, only when the chunk is complete. This could potentially be on a per-thread basis.
At the moment every new information is passed to the progress bar, we look at whether it was repainted recently, and if not, redraw it.
If there is a burst of updates and then a pause, this can leave the terminal showing an update that was not the last one, and that doesn't reflect where the process is up to.
We could instead have a thread that repaints every 500ms, and calls from the work threads only update the state.
This would also avoid contention of worker threads on the pb mutex: painting to the terminal is potentially somewhat slow.
Or, alternatively, try to make sure that there just are frequent updates, and so frequent opportunities to repaint. In particular we could emit ticks while processing blocks within a large file.
- Salt the hashes to avoid DoS collision attacks, and to enable encryption. (Store the salt in the base tier? Requires version bump.)
- Asymmetric encryption? Perhaps better to rely on the underlying storage?
- Signing?
Perhaps do SFTP first. https://github.com/sourcefrog/conserve/issues/83 However at least on Linux, SFTP can be done pretty well by sshfs.
https://github.com/sourcefrog/conserve/issues/122
-
conserve replicate
to copy bands from an archive without changing the content?- Like an ordering-aware
gsutil rsync
orrsync
- Like an ordering-aware
- Test on GCS FUSE
- For remote or slow storage, keep a local cache of which blocks are present?
- Store inode numbers and attempt to restore hard links
- Store file types other than file/dir/symlink
-
How can we avoid every user needing to manually configure what to exclude?
-
Exclude files from future backups but don't mark them as deleted. In other words, a pattern which will be assumed to be unchanged. (Is it useful?)
conserve size
on an archive should probably give the size of the archive by
default, not the size of the last stored tree.
conserve versions --sizes
won't say much useful about the version sizes any
more, because most of the disk usage isn't in the band directory. Maybe we need
a conserve archive describe
or conserve archive measure
.
We could say the total size of all blocks referenced by that version.
Perhaps it'd be good to say how many blocks are used by that version and not any newer version.