-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does wyng actually does a btrfs submodule snapshot of /var/lib/qubes? #215
Comments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
The _btrfs_subvol_snapshot function (used under_get_btrfs_generation, used under metadata_lock, then used under _btrfs_subvol_snapshot) in the wyng script does create a BTRFS subvolume snapshot ( Considering the overall design of the script, which primarily uses
Given these points, it appears that the BTRFS subvolume snapshot creation might be an artifact of earlier development? Or a feature that wasn't fully integrated into the current backup strategy, or to accomodate btrfs send ops (which users could do as needed on his own). It's not clear from the code why |
A subvol snapshot is created to isolate the local data set from an otherwise busy filesystem, and provide a target for checking transaction ids. This is a critical step in getting correct delta info on Btrfs (its not required on XFS because of XFS relative simplicity); without the transaction ids from the subvol snapshots, data corruption could occur in the archive volumes because Wyng would be sending the wrong data blocks. One puzzle here is that, unlike reflinks, subvol snapshots are not supposed to be metadata-intensive at all... not until data is written to one of the related subvols causing them to diverge. It may be that creating the subvol is triggering some kind of cache flush or other housekeeping relating to the reflink ops that were done just prior. I know that there is a new default mode in Btrfs that attempts to pre-process subvol metadata in such a way to accelerate the mounting of filesystems that bear a large number of subvolumes. Recommendations:
Number 3 hints that some automatic mitigation may be possible in Wyng by processing the volume list in small batches. It may also be possible to reduce the scope of each subvol snapshot so it only includes the volume(s) in a specific batch. However, I think the largest footprints by far are probably the Qubes extra snapshots and fragmentation. Template variation / fragmentation being a large factor seems doubtful to me, unless the template root volumes are large; most are in the 10-20GB range which I don't regard as large. |
Note to self: Creating many smaller subvol snapshots may not be an option, as there is no technique for isolating individual vols (img files) from a large directory of vols to create subvol snapshots just for isolated vols. Creating hard links to the target vols using existing Wyng snapshot naming could probably work: A subvol snapshot would then be created as normal, except the system overall is bearing one less reflink snapshot for each target vol. This may result in significant reduction in the metadata usage highwater mark during delta acquisition; this should be at least as effective as acquiring deltas in small batches and needs no special handling of paths in the remainder of the delta processing code. (The downside is that the subvol snapshot must be retained through the entire send/monitor procedure, and deleted only when its complete.) |
|
wyng taking disk images snapshots and qubes still doing snapshots and rotation of disk images shoud take care of this without needing btrfs subvolule snapshots, no?
No problem of free space.
Ha. Did this once and forgot about that. But again, as reminded here, this is incompatible with deduplication, but isn't used on this installation i'm typing from, so cannot be the source of issue outside of clones+specialized templates/qubes from reflinked disk images/bad config.
Will do and report back.
That was switched from RAID1 (default Q4.2) to SINGLE already. Only DATA and SYSTEM is DUP and considered good on SSD drives. Question is how to enforce proper defaults into Q4.3, if BTRFS perf are comparable or improved compared to lvm2. As of now, this is unfortunately not the case with my use case (nvme drive on nv41 laptop with 12 cores, qusal deployed so cloned+specialized reflinked disk images, with qubesos still rotating disk images where wyng backups state that is prior of live state, so still not understanding why btrfs subvolume snapshot required, if wyng takes snapshot of disk image and compare old snapshot with new snapshot to send backup delta anyway).
I have some qubes that are pretty heavy (250Gb+) but the slowdown in wyng are observed in btrfs submodule clone and basic templates, not even the specialized derivatives. But more testing needed with observation poi ts. Once again here, its hard to troubleshoot without more verbosity added into the pre-backup phase, where wyng only says that its preparing for backup, without detailing time consumed in each phase. In my opinion, that would greatly help to give output on what is happening there, or some insights on iowait analysis to see what is happening under the hood and understand why the slowdowns occurs. @tasket I will present on my different PoCs (rpi5+softraid wyng storage, veeble: to be reused/expended to offer disk states as a firmware service I poked you before for joint grant application at QubesOS/qubes-issues#858 (comment)) at QubesOS mini-summit. If you can free yourself in presentation time window, that would be awesome to have you there (ideal would be at least for Q&A on Sept 21 2024 where details can be seen under https://cfp.3mdeb.com/qubes-os-summit-2024/talk/J8TQEU/ ) |
Add vol handling debug output, issue #215
@tlaurion I added volume/subvol debug mode output to the update in 08wip branch. I'll take a look at he calendar but I should be free for at least Sept 21. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as outdated.
This comment was marked as outdated.
@tasket Here is gpg encrypted+compressed tar.gz (so this is accepted by GitHub) of log(txt) encrypted+compressed by gpg, encrypted for your proton mail email recipient linked to your public key in this repo: My public key can be downloaded through:
Some analysis and script used to do so. Script for wyng output (
|
Improved code (prior was pointing to wrong lines and gave somewhat wrong time diffs picking on lines not having timestamps), if somewhat useful to you @tasket. I think that using
|
@tasket by doubling dom0's max mem under /etc/default/grub (
That was done on a fresh reboot, where htop sorts on total CPU time. While wyng op ongoing: Analysis: swap still touched, most time consumed (in still running process) are krunner related, where while wyng was happening, it was used by btrfs-transaction and fs related crypt etc as can be seen). Running in dedup send mode, as I do not see why I wouldn't. 12th gen cpu, 64Gb ram. I can spare resources, but i'm not sure what is happening here. Again, thoughts welcome. |
This changed everything.
Seems like ssd_spread and discard were the issue as well as defrag (which is enemy of reflink from what I got) and opposite of dedup, causing overamplification. Will close for now. |
@tlaurion do you have any performance stats from this change? I've been using P.S. Keen to hear your presentation at this months summit! |
Just no more iowait. 2 minutes on backup send, btrfs subvolume snapshot doesn't lock, I'm only limited by network speeds and have little slowdown if using reflinked clones. Defrag, AFAIK is voiding reflink gains by expending qubes disk images consuming more space. Never an defrag, and not sure why @tasket suggests it. I get now why the subvolume snapshot is important following his first answer, but the defrag advice is still a mystery to me. Also no gain doing a balance unless multiple disks on btrfs volume. Cancelling qubesos revert capability also helped on performance. Pretty hard to get proper numbers here to be honest @kennethrrosen ! |
@tlaurion @kennethrrosen Defrag has two major effects: Placing data blocks adjacent to each other (enhancing sequential reads, at least for HDDs), and reducing the complexity of metadata that tracks the fragments (possibly by orders of magnitude). What I would consider a minor effect is that on a CoW fs more space will be used for data; I've seen usage increase by 3-4GB with 500GB of allocated data, for instance, YMMV. I can accept that certain conditions might invert the costs/gains relative to Btrfs behaviors. A deduped fs might end up with many fewer logical extents to manage if the volume data is very repetitive but carrying many dispersed, smaller, CoW-updated permutations. Keeping an eye on the relative data-to-metadata allocations on your fs (using BTW, Dedup is actually compatible with defragging, provided you give it parameters that prevent it from rewriting chunks smaller than defrag's threshold (usually 256KB). Its also possible to try smaller thresholds for both. Note that auto-defrag and auto-dedup can significantly increase overall drive activity and have a write-amplification effect; this is why I advise weekly or monthly use of defrag instead. As to what actually made the difference for you, it might be dedup (which would be very much a corner case IMHO) or perhaps switching from ssd_spread to ssd allocation. It seems unlikely that Btrfs was defaulting to using autodefrag or a non-v2 space cache settings, but if that was the case they would have a large impact on performance. |
/usr/sbin/btrfs subvolume snapshot -r /var/lib/qubes/ /var/lib/qubes/wyng_snapshot_tmp/
is slowing down the system to turtle speed during that operation, even freezing windows until done, even though I deactivated all qubes revert snapshot, letting wyng being the sole doing snapshots (reflink snapshots, where btrfs subvolume snapshot reason of existence unknown, requiring end users to create this with script for unknown reasons as well on my side). Seems like our usage of extensive reflink clones/derivatives is diverging which hides the impact to you where it hurts my use case pretty badly.
Most of system resources are stuck in iowait for disk ops to complete while this subvolume is created, while disk images are under use (while are most of the time reflinking to other disk images because clones+CoW/specialized clones per qusal usage), which sometimes spikes to 100% for all cpu (here 500% in screenshot, but with 12 exposed cores : sometimes going up to 1200% in iowait as seen under htop (in grey for CPU summary if "Setup->Display Options->Detailed CPU time" and "Setup->Meters->Disk IO" selected for second colum), with no clue how to tweak/diagnose things further more since hidden in kernel IO AFAIK) :
As said in other tickets trying to optimize things, my use case:
My question here comes to light since btrfs wyng usage is to do reflink cp calls for all templates/qubes to do its ops. Why wyng does a subvolume snapshot is still unknown as of today, and this seems to be problematic altogether for reflink where IOWAIT spikes and puts the system in a really weird state I have no clue how to troubleshoot.
As per prior recommendations and findings:
UUID=a8b2c51e-f325-4647-8fc2-bc7e93f49645 / btrfs subvol=root,x-systemd.device-timeout=0,ssd_spread,space_cache=v2,discard 0 0
Insights @tasket?
The text was updated successfully, but these errors were encountered: