[BUG] Persistent Hard Crashing with XFS Temp Drive in MDADM RAID0 #7712
Replies: 24 comments
-
Try EXT4 instead of xfs... I personally switched as XFS was performing oddly for me on mdraid. edit: and of course you tried that and I missed it, sorry =) |
Beta Was this translation helpful? Give feedback.
-
No problem! I'm indeed now using ext4, and it's proven quite stable. Just a shame to miss out on the ~7% performance improvement of XFS. Others (JM/Quindor) seem to be using XFS MDRAID0 without issue, so I'm not sure what's going on with our systems. Cheers! |
Beta Was this translation helpful? Give feedback.
-
I know LVM raid-0 uses MD under the hood, yet I would give it a try as using LVM might configure it differently and xfs reading the raid config via LVM might also lead to a different configuration. Assume 2 drives wipefs -a /dev/nvme0n1 More extreme option, try latest kernel. |
Beta Was this translation helpful? Give feedback.
-
@xorinox Thank you for the detailed instructions! I will give this a try and let you know how it goes :D |
Beta Was this translation helpful? Give feedback.
-
@andrewseid did you have success with lvm raid 0? |
Beta Was this translation helpful? Give feedback.
-
Same issue here with xfs after just a few minutes of plotting (full freeze, needed to hard-reset, but no entries in the logs/journal). A switch to ext4 seemed promising at first, but after ~23 hours on non-stop plotting with the madmax plotter, the same issue appeared. Same with the base chia plotter, albeit at a later stage. What provided final salvation was disabling continuous trim on my raid 0 temp drives! I'm currently running the lvm based raid with xfs setup proposed by @xorinox, and after over 36 hours without a single crash/freeze, I'm somewhat confident to say that continuos trim might be the culprit rather than the filesystem (as his steps mount the array without continuous trim enabled, and I did not explicitly enable continuos trim in To everyone who experiences the freezes/crashes with a raid 0 setup for the temp drives: did you try to run your plotting without continuous trim enabled on the array yet (e.g. no I'll try a vanilla mdadm raid 0 with xfs and without the lvm layer again soon just to cross-check and further isolate the issue. My best bet is still on disabling continuous trim though. There's a bunch of issues around this in the kernel maling lists and even in the source itself documented for different drives and configs, so it wouldn't surprise me for this to be the culprit of the iffyness we see with a raid 0 and hammering IO to those poor drives. Also just for completeness: plotting on single drives without raid 0 was stable at all times, regardless if I was using xfs or ext4 as the fs, and also regardless if continuous trim was enabled on them or not. The single drive setup was just a fair bit slower than the raid 0 for my particular drives. System config:
|
Beta Was this translation helpful? Give feedback.
-
@pwntr, I tried xfs on mdadm with discard unset and the system still crashed FYI. |
Beta Was this translation helpful? Give feedback.
-
i can confirm that @xorinox solution worked fine, whatever ubuntus lvm does under the hood it allowed me to have a 8 disk raid 0 with xfs |
Beta Was this translation helpful? Give feedback.
-
I have exactly the same problem. Crashes usually happen after a couple of minutes into a plot. |
Beta Was this translation helpful? Give feedback.
-
update I tried the commands provided by @xorinox to see if I could create a stable xfs filesystem this way, but unfortunately when using the mkfs.xfs command after creating the volumes with LVM it never completes the format, the process just hangs there, update 2 |
Beta Was this translation helpful? Give feedback.
-
@ramin-afshar is it hanging in there discarding blocks? Run this in a screen/background until it's finished. Can take some time (up to hours). |
Beta Was this translation helpful? Give feedback.
-
How does one break apart a RAID created with vgcreate and lvcreate as given above? Nothing shows when looking at cat /proc/mdstat despite having made a RAID0 array using xorinox's post as a guide above. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I see this sample output data when I execute the mkfs command in terminal, but it never finished. But when I used the options i mentioned in my previous message I don't have this problem.
|
Beta Was this translation helpful? Give feedback.
-
@xorinox Thank you again for the detailed instructions on trying LVM. I did eventually get around to trying RAID0 + XFS + LVM as you suggested, but unfortunately, the crashing issue persisted. I've also tried several different configurations (Discard On/Off, Clear Linux, MadMax), and those crash as well. It seems likely the issue is fundamental as @pwntr noted:
My conclusion is that we must currently choose between:
The only thing I didn't try was using a kernel newer than the default kernels in Ubuntu Server 21.04 and Clear Linux Server 34700. Would still love to hear from anyone with additional and/or different conclusions! Cheers! |
Beta Was this translation helpful? Give feedback.
-
I’m having the same issue. Today I converted my NVME raid 1 that I was using for plotting to raid 0. Almost immediately I started getting crashes. Probably going to switch to single drives and just run two plotters against them. I wonder, is everyone who has this problem running NVME drives in raid 0? I feel like that might be the issue. |
Beta Was this translation helpful? Give feedback.
-
As for my brief example: |
Beta Was this translation helpful? Give feedback.
-
It would be interesting to know if for my brief example you experience these crashes also if you used EXT4 instead of XFS? |
Beta Was this translation helpful? Give feedback.
-
Machine 1 has been running MD Raid 0 with 2x NVME and XFS, experiences random crashes Also converted Machine 1 to EXT4, no crashes since. |
Beta Was this translation helpful? Give feedback.
-
For me the solution from @xorinox fixed the kernel panics in Ubuntu Server 20.04. MDADM raid0 --> LVM raid0. Still XFS and mounted with discard option, so it was the MDADM. Many thanks. |
Beta Was this translation helpful? Give feedback.
-
Was anyone able to capture the kernel panic details? This is likely something that should be forwarded onto either/both the MDADM and XFS maintainers as one of the two is triggering these panics under the heavy IO load issued by madmax. |
Beta Was this translation helpful? Give feedback.
-
@malventano As I pointed out in my comment, only variable I changed was MDADM. Using LVM + XFS is rock solid, no issues. So it must be a problem with MDADM. |
Beta Was this translation helpful? Give feedback.
-
Understood, but it could still be an issue on the XFS end that is not playing nicely with MDADM. As it is XFS doesn't like MDADM's default chunk size, but I've tried custom sizes so it agrees with XFS better but still has the issue. |
Beta Was this translation helpful? Give feedback.
-
Hmm:
...well that's a new one. edit 1 - tried all sorts of clearing of superblocks and wipefs, but ultimately it would only work after a reboot. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
XFS is known to be the fastest format for plotting, and testing proves this out. However, it also seems to result in reliable hard crashes on Ubuntu, usually within the first day of beginning plotting. I have experienced this issue about 15 times, on a mix of Ubuntu GUI 20.04, Ubuntu GUI 21.04, and Ubuntu Server 21.04. I've experienced it on three different systems, two AMD builds (3960X and 3990X), and one Intel build (i7-11700K). All systems have been using between two and four Samsung 980 Pro NVMe drives in MDADM RAID0.
The issue seems to go away when I format the temp drive RAID0 array with ext4.
To Reproduce
Expected behavior
Observe eventual hard crash.
Screenshots
On Ubuntu GUI, the desktop just completely freezes wherever it is.
On Ubuntu Server, I got this:
Desktop:
Additional context
Random theory that you can feel free to ignore: since this is an extremely high performance setup, maybe it's hitting some kind of performance threshold or race condition during plotting? Or maybe it's something else entirely XD Thank you!
Beta Was this translation helpful? Give feedback.
All reactions