Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others #14793

admnd · 2023-04-25T13:20:51Z

admnd
Apr 25, 2023

Originally started as a bug, but after investigations and comments it is definitely more a hardware issue related to ZFS than a ZFS bug so I open a general discussion here, free feel to put constructive observations/ideas/workarounds/suggestions.

TL;DR: Some NVME sticks just crash with ZFS, probably due to the fact they are unable to sustain I/O bursts. It is not clear why this happens, the controller might just crash or a combination of firmware/BIOS/hardware makes it unstable/crash when used in a ZFS pool.

Hardware

OS: Gentoo Linux x86/64 with kernel 6.2.12 and ZFS 2.1.11.
Hardware:
- CPU: AMD Ryzen 7950X
- Motherboard: Asus TUF Gaming X670E-Plus WiFi (upgraded to the latest available BIOS => 1410 as of 05/25/2023)
- 3x NVMe WD Black SN770 2TB with latest firmware as of 05/25/2023 (731100WD) configured with 4K sectors
- PSU: MSI 850W

Issue observed

My system zpool is composed of a single RAID-Z1 VDEV composed of 3x WD Black SN770 2TB them selves configured in 4K logical sectors (I did not test with 512b sectors to see if the issue still happens....yet). The VDEV uses LZ4 compression, is not encrypted neither the underlying modules (they do not support that), standard 128K stripes are used. No L2ARC cache used. System has plenty of free RAM so no RAM underpressure.

Under "normal" daily usage I did not experience anything, the zpool is regularly scrubbed and nothing to report: no checksum error, no frozen tasks, no crash, nothing, the pool completes all scrubbings wonderfully well. The machine also experience no freeze or kernel crashes/"oopses", no stuck tasks (I have had reported an issue with auditd here a couple of weeks ago but this guy is now inactive, see bug #14697). Even "emerging" big stuff like dev-qt/qtwebengine with 32 CMake jobs in parallel or reemerging the whole system from scratch with 32 parallel tasks with heavy packages rebuilt at the same time succeeds. No crashes.

However, if I use zfs send to make a backup of the system datasets on a local TrueNAS box over a 10GbE link this is another story: most of the time one of the NVMe modules randomly crash. The issues also happens at different times in the data transfer: sometimes the issue appears after 12Gb, sometimes after 78Gb, sometimes after 93 Gb and so on. If I am lucky, sometimes it completes the operation successfully (less than a quarter of the time). Itchy and annoying. I have managed also to reproduce it with rsync-ing a dataset on an empty new one in the same pool also this happens more rarely. The TrueNAS box and network are out of concern as they run smoothly and as I can reproduce the issue locally by sending the ZFS stream in /dev/null (zfs send .... | cat > /dev/null).

When the crash happens, the following trace appears in the kernel logs:

[430771.216723] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[430771.216727] nvme nvme2: Does your device have a faulty power saving mode enabled?
[430771.216729] nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[430771.266732] nvme 0000:13:00.0: enabling device (0000 -> 0002)
[430771.266814] nvme nvme2: Disabling device after reset failure: -19
[430771.283392] I/O error, dev nvme2n1, sector 1812765936 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=1 offset=928127770624 size=16384 flags=180880
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=1 offset=1394183585792 size=24576 flags=180880
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=2 offset=1575062740992 size=4096 flags=180880 [430771.283399] nvme2n1: detected capacity change from 3907029168 to 0

At this point, if I am lucky enough, I can manage to bring it back to life using a sledgehammer:

echo 1 > /sys/bus/pci/devices/0000\:12\:00.0/remove
echo 1 > /sys/bus/pci/rescan

If the faulted device reappears the zpool becomes ONLINE again and completes its resilvering (a couple of KB or MB). In the worst case, another one NVMe also drops off the pool which becomes suspended so I have to powercycle the machine or push its reset button. Of course, doing a nvme list at this point either completely freezes either lists the two remaining NVMe modules, depending on what is alive.

My best guess so far is that the Western Digital SN 770 modules controller is not not beefy enough to handle a burst of I/O requests (knowing they have no DRAM cache) so it is put on its knees and become so unresponsive that it is unable to complete a reset request on its own (no AER reported in logs BTW). As not always the same module crashes, they do not seems be all defective or I am extremely unlucky. Pool scrubbing might by a bit lighter for the controller so the scrubs/resilvers work without any issue (maximum observed speed observe is around 4.5~5 GB/s when scrubbing the pool according to zpool status).

What has been tried so far

Several things! Without any improvements unfortunately:

As suggested in the error, put nvme_core.default_ps_max_latency_us=0 pcie_aspm=off on the kernel command-line;
Move the NVMe around in different slots (temperatures seems reasonable and they all have heatsinks)
Playing around with some zfs kernel modules parameters: lowering values of zfs_vdev_sync_read_min_active,zfs_vdev_sync_read_max_active and their async counterpart (I used the same values set as defaults for fs_vdev_scrub_max_active and fs_vdev_scrub_max_active) ;
Throttling with throttle : zfs send ... | throttle -M 300 | ...
Tinkering with the blkio cgroup
Running a short S.M.A.R.T. test: nothing special to say, all of the three NMVe modules pass it.
Put the whole machine hardware settings on their BIOS defaults (No PBO, no RAM overclocking)
Memtesting the RAM (3 passes, no errors)
rsync-ing the system dataset on a virtual disk over iSCSI (no crash! yeah! impractical however)
zfs send from a FreeBSD live media : FreeBSD allocates a 200MB host buffer for each module but unfortunately no more success and a zfs send also hangs :/
PCIe 3.0 & 2.0 enforced on all M.2 slots => still crashes
PCIe power management set at "off" in BIOS/UEFI.

Some thoughts / ideas of tests to try

Use 512b sectors (pool has to be destroyed)
Swap the WD Black SN 850 modules of my secondary machine with those and see if this solves the issue on this machine (while being functional on the other machine)
Burn a candle

Is there a "ZFS native" way to throttle I/O operations in the case of doing a zfs send?

Has anybody here experienced something like this? If so, what are the other brands/models subject to a similar issue?

admnd · 2023-04-25T21:07:52Z

admnd
Apr 25, 2023
Author

Found something interesting in a proposed patch in a discussion whose topic was "[PATCH] nvme-pci: fix host memory buffer allocation size" dating of may 10th 2022. The starting point of the discussion start here => https://www.spinics.net/lists/kernel/msg4339024.html

At some point (https://www.spinics.net/lists/kernel/msg4352567.html), it is mentioned that:

WD SN770 NVMe are problematic (the author experience the very same freezes than me but does not mentions ZFS so I guess that he uses a single standalone drive with something else than ZFS)
Switching the I/O scheduler to "mq-deadline" improved the situation without solving it completely.

Also in a subsequent message ( https://www.spinics.net/lists/kernel/msg4372632.html ) it is also mentioned that the situation has improved drastically with the patch.

And another point of the discussion about having the Host Memory Buffer of just 32MB. According to my logs, I have the same allocation:

[    3.264207] nvme nvme2: pci function 0000:08:00.0
[    3.264207] nvme nvme1: pci function 0000:0e:00.0
[    3.264207] nvme nvme0: pci function 0000:04:00.0
[    3.302554] nvme nvme2: allocated 32 MiB host memory buffer.
[    3.303343] nvme nvme0: allocated 32 MiB host memory buffer.
[    3.303721] nvme nvme1: allocated 32 MiB host memory buffer.
[    3.306596] nvme nvme2: 32/0/0 default/read/poll queues
[    3.307029] nvme nvme0: 32/0/0 default/read/poll queues
[    3.307622] nvme nvme1: 32/0/0 default/read/poll queues

For the record, here is excerpts of some messages:

Taken from https://www.spinics.net/lists/kernel/msg4352567.html :

On my current setup (WD SN770 on ThinkPad X1 Carbon Gen9) frequently the NVME
controller stops responding. Switching from no scheduler to mq-deadline reduced
this but did not eliminate it.
Since switching to HMB of 1 * 200MiB and no scheduler this did not happen anymore.
(But I'll need some more time to gain real confidence in this)

Initially I assumed that the PAGE_SIZE * MAX_ORDER_NR_PAGES was indeed
meant as a minimum for DMA allocation.
As that is not the case, removing the min() completely instead of the max() I
proposed would obviously be the correct thing to do.

Taken from https://www.spinics.net/lists/kernel/msg4372632.html :

So this patch dramatically improves the stability of my disk.
Without it and queue/scheduler=none the controller stops responding after a few
minutes. mq-deadline reduced it to every few hours.
With the patch it happens roughly once a week.

Current parameters for the nvme kernel modules on my system are on their defaults:

parm:           use_threaded_interrupts:int => 0
parm:           use_cmb_sqes:use controller's memory buffer for I/O SQes (bool) => Y
parm:           max_host_mem_size_mb:Maximum Host Memory Buffer (HMB) size per controller (in MiB) (uint) => 128
parm:           sgl_threshold:Use SGLs when average request segment size is larger or equal to this size. Use 0 to disable SGLs. (uint) => 32768
parm:           io_queue_depth:set io queue depth, should >= 2 and < 4096 => 1024
parm:           write_queues:Number of queues to use for writes. If not set, reads and writes will share a queue set. => 0
parm:           poll_queues:Number of queues to use for polled IO. => 0
parm:           noacpi:disable acpi bios quirks (bool) => N

Going though the code of drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c (checked with a 6.2.12 Linux kernel) suggests that the famous patch has not been applied because the "min_t" is still there:

static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
{
        u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
        u64 hmminds = max_t(u32, dev->ctrl.hmminds * 4096, PAGE_SIZE * 2);
        u64 chunk_size;

        /* start big and work our way down */
        for (chunk_size = min_chunk; chunk_size >= hmminds; chunk_size /= 2) {
                if (!__nvme_alloc_host_mem(dev, preferred, chunk_size)) {
                        if (!min || dev->host_mem_size >= min)
                                return 0;
                        nvme_free_host_mem(dev);
                }
        }

        return -ENOMEM;
}

The patch in question is mentioned at the very beginning of the discussion and is this one:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aacf1c0d5a5..0546523cc20b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2090,7 +2090,7 @@ static int __nvme_alloc_host_mem(struct nvme_dev *dev, u64 preferred,
 
 static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
 {
-	u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
+	u64 min_chunk = max_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
 	u64 hmminds = max_t(u32, dev->ctrl.hmminds * 4096, PAGE_SIZE * 2);
 	u64 chunk_size;

Another related thread is here => https://lore.kernel.org/linux-nvme/[email protected]/
Quoting:

I am wondering about the calculation of the NVMe Host Memory Buffer sizes.
It seems to me that the current algorithm to calculate this size does not lead
to an optimal result.

Hardware information:
mn : WD_BLACK SN770 1TB
fr : 731030WD
hmpre : 51200 (limited by max_host_mem_size_mb to 32768 -> 128MiB)
hmmin : 823
hmminds : 0
hmmaxd : 8

To me this looks like the disk wants 200MiB allocated that can be described in
eight descriptors.
However the kernel log has the following entry:

[ 8.981685] nvme nvme0: allocated 32 MiB host memory buffer.

Tracing through drivers/nvme/host/pci.c the following happens:

The loop in nvme_alloc_host_mem() is only entered once.
min: 3371008
preferred: 134217728
min_chunk: 4194304
chunk_size: 4194304

Now in __nvme_alloc_host_mem() the loop is called the eight times for hmmaxd,
each time allocating 4194304 bytes (4 MiB).
The end result is that a total of 32MiB of Host Memory Buffer are allocated
which is the bare minimum instead of the 200 MiB that are preferred and
available.

It seems that the logic to calculate min_chunk in nvme_alloc_host_mem() starts
with a too small value.

All of this is on a normal x86 laptop with plenty of system memory.
It's reproducible with current git (46cf2c613f4b10eb12f749207b0fd2c1bfae3088)
and 5.17.4.

0 replies

admnd · 2023-04-25T23:23:25Z

admnd
Apr 25, 2023
Author

Above patch tried, but in my case, worsens the issue :( The crash happens much more earlier than before.
Fiddling around with parameters of nvme.ko, I managed to have a higher allocation of 200 MB with nvme.max_host_mem_size_mb=512 + the above patch applied.

0 replies

admnd · 2023-04-25T23:54:25Z

admnd
Apr 25, 2023
Author

Basically at this point, I am out of options with those sticks. Those are a replacement for a trio of ADATA Gammix S70 Blade which were also problematic because their namespace had a bad value for EUI64: Basically all were all set to eui64=0000000000000000 which made the system totally confused about who was who.

So my only option at this point is to get another model :/ Perhaps I will keep them for a much-less intensive use.

Reality is: not all NVMe hardware can play nicely with ZFS. It seems that investing in higher end of hardware is not an option, especially with ZFS. I won't ever consider switching them back to 512b sectors, I don't think this will solve the issue and if ever it solves it, there is a significant performance penalty.

Hoping my hours of investigations would avoid someone wasting money in junk hardware. It is a bit disappointing that this junk is coming from a well-known brand.

PS: Free feel to further elaborate. I will post if I get something new on this.

0 replies

IvanVolosyuk · 2023-04-26T03:45:03Z

IvanVolosyuk
Apr 26, 2023

I would try to replace the PSU with another one and probably 1000W one. Often mysterious problems end up with replacing faulty PSU.

…

On Wed, Apr 26, 2023 at 9:23 AM admnd ***@***.***> wrote: Above patch tried, but in my case, worsens the issue :( The crash happens much more early than before. Fiddling around with parameters of nvme.ko, I managed to have a higher allocation of 200 MB with nvme.max_host_mem_size_mb=512 + the above patch applied. — Reply to this email directly, view it on GitHub <#14793 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HOVYHJWDVAHYS4RWYDXDBMHPANCNFSM6AAAAAAXLAAQ7E> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

2 replies

admnd Apr 26, 2023
Author

Thank you for this suggestion. It is still plausible and I keep it. However, it is very unlikely that it is the cause here for mainly two reasons: 1. The PSU is not even at half load, 2. I would have seen other symptoms while the machine is on very high load or while a scrub is running, 3. someone experienced a similar issue with other hardware (and managed to fix it).

Having to replace the PSU means throwing a significant cash amount for a test that might not be a success. Better to save the money for beefier modules. But if I manage to get one in a way or another, worth a try. I might try swap the PSU for a trusted one I have in my secondary machine (not 1000W however) as I cannot reproduce the issue on it (3x SN850, working #1 with ZFS since day 1).

Manawyrm Jun 22, 2023

NVMe storage uses 3.3V supply voltage, which gets created locally on the mainboard from the 12V (sometimes 5V) supply rails on basically all mainboards. The 3.3V rail on the ATX connector is unused on most boards.
If that doesn't work properly, the mainboard is at fault.

Flaaxxx · 2023-04-26T06:58:12Z

Flaaxxx
Apr 26, 2023

This might be a longshot, but where have you connected your NVMe? Did you use the onboard slots or a riser card with bifurcation? And if you used the onboard slots which ones did you use?

From the Manual you can see one of the slots shares bandwith with the Sata Ports if theres anything in there it could cause a Problem. Further x670 daisy chanins 2x the x670 chipset to give more connectivity. A Guess off mine could be that this issue could be cause by limited bandwith between chipsets and the CPU which might cause the controller to look like its dropping.

My suggestion to troubleshoot this, is to get a bifurcating riser card put it in the 16x Slot and have all the NVMes directly connected to the CPU. This would eliminate going over the Chipsets.

Unfortunatly ASUS has no blockdiagram of the Board and where which PCIe Lanes go with which speed. But I would see if limiting the speed of the drives could also be causing this issue. PCIe Switching link speed caused me a lot of headaches with my rx5700 xt GPU. It caused some weird issue of it disconnecting crashing the drivers etc. So pretty similar to what you experience.

Those 2 would be my guesses for this issue.

1 reply

admnd Apr 26, 2023
Author

Very savvy, thank you. I have no riser here to try your first suggestion (as 7950X has a built in GPU I can pull out the dGPU) this week. But what I can do is to rebuild a pool with 2x NVMe in mirror rather than 3 in RAID-Z1 and see what would happen.

Indeed, the description is a bit hidden in the technical details:
https://www.asus.com/ca-en/motherboards-components/motherboards/tuf-gaming/tuf-gaming-x670e-plus-wifi/techspec/

The paragraph "Storage" says:

AMD Ryzen™ 7000 Series Desktop Processors
M.2_1 slot (Key M), type 2242/2260/2280 (supports PCIe 5.0 x4 mode)
M.2_3 slot (Key M), type 2242/2260/2280 (supports PCIe 4.0 x4 mode)
AMD X670 Chipset
M.2_2 slot (Key M), type 2242/2260/2280/22110 (supports PCIe 3.0 x4 & SATA modes)**
M.2_4 slot (Key M), type 2242/2260/2280 (supports PCIe 4.0 x4 mode)

The actual configuration is one NVMe module in M.2_1, one in M.2_2 and the third in M.2_3. Two of them connected directly to the CPU, the third going via the chipset. I also tried M.2_1, M.2_3, M.2_4 but with similar results. BIOS being on auto settings, they run at their native speed (PCIe 4.0). I will try to lower to PCIe 3.0 or even 2.0 and see what happens.

I have the impression of being just above a certain threshold, not that far away.

Lyndeno · 2023-04-26T13:55:33Z

Lyndeno
Apr 26, 2023

It's interesting you're having issues with the SN770.

I was having issues with mine (2TB as well) in my laptop. ZFS, Btrfs on LVM/LUKS even ext4, my drive would reset just like yours in my laptop. Whether during boot or when sitting there doing nothing, or something. Seemingly random.

I took it to my computer store to get it replaced. Through their testing the drive passed all tests, so they did not replace it. I believe they were testing with windows.

I am going to RMA it with WD, hopefully my replacement performs better.

I have the exact same drive in my desktop(X570 5950X), using a single ZFS vdev as root. I have not experienced these issues. I would try putting the desktop drive in my laptop (XPS 9560)to see if it has issues but that would be quite an inconvenience to me. So I am just going to RMA it. The previous drive in my laptop did not have these issues.

This stuff occurred with both 512b and 4kb sectors I believe.

4 replies

admnd Apr 27, 2023
Author

Seems some other guys around encounter problems with this model (See links on the next comment bubble). This model has no DRAM cache and seems very prone to crash even idle it seems to crash. It its definitely not expected to see that (however a performance loss WAS).

My guess is those target the general market where I/O are not that heavy and only one module used with a machine not up 24h a day. Thus, WD engineers (maybe) have not put a high stress on those because it is not the use case they are supposed to fit in ;) WD is a reputable brand, products are tested. Companies are not always too big to fails but sometimes mistakes or more-or-less-stupid-management-decisions can be done for various reasons: not having though about a detail, cutting costs with sub-standard components, etc. Pure speculation at this point I cannot tell what the real cause is, I do not work at WD or have contacts there.

I am curious to see if your replacement improved your situation or if it it just as unstable as the replaced SN770.

"If you want high performance NVMe, use a model with DRAM". I learnt life the hard way on this one.

Lyndeno Apr 27, 2023

We'll see, I still have to send it in. But the SN770 in my desktop has been performing well, no errors to report.

I would have got a Firecuda (I do like Seagate, and in the case of nvme firmware upgrades, Seagate is way more Linux friendly) but for the capacity, it was almost double the price.

I am not doing heavy i/o normally, but I do game, compile and stuff on this computer and the WD has been performing fine. Which is why I suspect (I hope) it's a faulty drive in my laptop.

mabra Jun 22, 2023

Saw your message late.
I started with two FireCudas.
The first died after 4 weeks, the other one is causing pool-crashs and give messages like this:
Device: /dev/nvme0, number of Error Log entries increased from 756 to 760
According to the specs, they have ram as cache.
The WD never caused a problem for me.
Can say this, because my storage crashed again yesterday.

Lyndeno Aug 19, 2024

An update to my situation.

The RMA SN770 replacement was exhibiting the same issues on my laptop.

I have been running two SN770 in my desktop in a ZFS mirror for around a year and a half now. Recently, one of them is resetting/disconnecting, degrading the pool.

I have not checked to see if it is the same drive each time.

It seems to happen as a result of something. Sometimes, simply logging in to Gnome causes it to happen. Not sure why, as this pool does not hold any root files.

I also noticed it sometimes occurs when my phone starts backing up to immich, I have the postgres database stored on that mirror. I have not yet tried any troubleshooting, kernel params, settings, etc. Only change I have made is turn on the fan on my Hyper M.2 card. Still occurs occasionally.

My Firecuda 520 root on XFS has been rock solid for four years.

admnd · 2023-04-26T18:37:36Z

admnd
Apr 26, 2023
Author

Others pointers (FreeBSD):

At this point, I have opened a case with WD, perhaps something can be done at their level. As I should have some freetime tomorrow, I will try to exchange modules between my two machines.

3 replies

Lyndeno Apr 26, 2023

These are similar to other posts I have seen (different drives) where the power supply was the issue.

I am hoping that is not the case for my laptop. I guess I could replace the battery? But I just got a new battery last year.

In the meantime, I will continue with my RMA with WD. I hope it is simply a bad drive.

Lyndeno May 24, 2023

I have received a replacement SN770, within two days that drive started exhibiting the same problems as the last one.

I have ordered a cheap 1TB Timetec SSD for my laptop. It has been a few days so far and no issues. I will put the SN770 into my desktop to go with the other one. There seems to be some imcompatibility between the drive model and my laptop.

Lyndeno Aug 19, 2024

See my other comment #14793 (reply in thread), the WD drives are exhibiting these issues on my Asus desktop now.

admnd · 2023-04-28T00:27:02Z

admnd
Apr 28, 2023
Author

SN770 Swapped out for 3x WD SN 850 configured in 4K. Day & night! My 7950X is literally breathing again! Over 100K IOPS while emerging GCC 13, zpool scrubs are going easily to 5-6 GB/s.

Earlier this afternoon, I tried to swap one module at a time. Guess what? One SN 770 quit the pool seconds after the resilvering started, the second reset in the middle. I had thousands checksums errors reported. Fortunately I have daily snapshots stored on a TrueNAS box, so not an issue. This junk is even not able to sustain a pool resilvering.

So, gentlemen, moral of the story : Don't use DRAM-less NVMe stuff with ZFS
The troubles they bring do not worth it not counting they are a real bottleneck.

Will give news on what happens with my now famous SN 770 when I will have :) Perhaps they will do better in my secondary machine or in the junk-box.

Thank you, again, for jumping in and take some of your time to put suggestions here. This is greatly appreciated.

2 replies

mabra Jun 22, 2023

This does not explain, why each srub/resilver works fine for me with this model.
In opposite to my FireCuda, it even does not log errors.
For me, all the crashes followed a "return from hibernate", though not directly.

Lyndeno Aug 19, 2024

Scrubs also work just fine for me after rebooting after having one of the drives reset. Full speed

mabra · 2023-05-23T18:24:10Z

mabra
May 23, 2023

Stumpled over this by searching for consequences of my pool crash.
Just a side-note, I am not that deep in linux and modern hardware, as in earlier times.
I am using a ZFS mirror of two NVMEs, which are "Seagate FireCuda 520 SSD ZP2000" (2 TB) and "WD_BLACK SN770 2TB" (2 TB) in the original place on a Supermicro H12SSL-C motherboard with AMD EPYC 7252 (8 core) since about a year.
Originally, I started with two of the Firecudas, but one gave up very early and I made this experience with Seagate over and over my livetime and to come to a immidiate replace (because it is only a mirror), I bought the WD and was able to recover.
The first failed Firecuda was completely dead, looks like hw-only failure.
The crash, which leads to a loose of my complete pool, happened immidiate after return from hibernate (it is a workstation) .....,
which fails very often (using debian11) with kernel 6.1 (installed 14 days bevore!!).
See not any evidence, this this will be a ZFS problem, more the kernel ...
At this crash of 2023-05-19, the WD was the first one who has been checked, but the second (immidiately following) line was the Firecuda - but the order MAY say nothing, even though the ZED mails arrives in the same order.
Just as a note.

1 reply

mabra Jun 28, 2023

Found the debate about ZFS+HIBERNAT late, yesterday. There is even speaking, that something like "hibernate should not be used with ZFS" on the one side, and working on patches on the other hand.
Now, I can see, that my obersavtions - for my crash scenarios - was quite right - it happend always and only after return from hibernate. No crashes or errors otherwise with the mentioned disks WD/Seagate).

gregorst3 · 2023-06-08T18:07:46Z

gregorst3
Jun 8, 2023

Hello @admnd I'm experiencing the same problems on my server infrastructure, I recently added this wd nvme (sn850x) just for some low-spec VM that I did not prefer to run on my main nvme composed by different pm9a3.
As soon as I installed that nvme I got woken up during the night for a crash on my servers (random time , x days).
I found out that this can be related to a firmware problem on our nvme, I had to temporarily boot a Windows machine to update the firmware (because they only provide the tool only for windows) of the sn850x and after that seems like the problem is gone.

3 replies

admnd Jun 8, 2023
Author

No issues here with a pool composed of sn850x modules (and an older one with sn850 modules) but yes it is recommended to apply the latest updates from the manufacturer and, personally, this is the very first thing I do when I unbox a NVMe.

The issue appears with SN770 and probably some others DRAMless NVMe. Perhaps WD will release a fix in the future that correct the issue, until then, avoid that model.

posixpoet Mar 25, 2024

Firmware upgrades with Linux:
https://community.frame.work/t/western-digital-drive-update-guide-without-windows-wd-dashboard/20616

Thaodan Nov 29, 2024

I have also a WD SN850X. I never experienced issues with 4K LBA. Firmware is 620311WD.

Maybe this is something that is fixed in some WD SDD's but not in others.

x0rzavi · 2023-06-29T15:35:39Z

x0rzavi
Jun 29, 2023

I don't know if its related somehow but here's my 2 cents.

I had an SN570 500GB (dram less) NVMe, which was actually quite newish (less than 1 year old). I never had any issues initially with ZFS and gentoo on it, been using ZFS since the last 5 months. Until recently, I started noticing random kernel crashes and ZFS status reporting permanent errors while scrubbing. My RAM was perfectly fine concluding from the fact that memtest86+ tests reported pass twice consecutively.

To my surprise, upon rebooting to windows, WD dashboard reported that "NVM subsystem reliability has degraded" with 99% lifetime remaining. Even, SMART tests started failing. And unfortunately, the drive had to be replaced out.

0 replies

dm17 · 2023-07-04T17:55:18Z

dm17
Jul 4, 2023

Would be cool for a "ZFS NVMe Recommendations List" to come out of this discussion.

I imagine SLC and MLC NVMes would be above the rest. What are the other criteria of which ZFS users should be aware when identifying the best SSD hardware?

3 replies

justinclift Sep 1, 2023

As a potential starting point for this, these are the NVMe drive models we're using in our production servers (no issues at all for 12+ months):

SAMSUNG MZVL21T0HCLR-00B00 - 1TB model
KXG60ZNV1T02 TOSHIBA - 1TB model
SAMSUNG MZQLB1T9HAJR-00007 - 2TB model
SAMSUNG MZVLB1T0HBLR-00000 - 1TB model

They're all configured on our servers as ZFS mirrors, using two of each model per server. So, one server will have (say) 2x SAMSUNG MZVL21T0HCLR-00B00 1TB. Another server might have (say) 2x SAMSUNG MZQLB1T9HAJR-00007 2TB, etc.

justinclift Feb 23, 2024

~~For consumer level NVMe drives, the 2x (ZFS mirrored) 1TB Crucial CT1000P5SSD8 drives in my workstation have been working without issue since July 2021.~~

~~Would buy them again, but they don't seem to be available for sale any more. 😵‍💫~~

Since writing the above I've moved to using SAS drives (any generation really, but SAS3+ preferred) and no longer use consumer drives in my systems.

Ironically, it's actually cheaper to buy an Ebay SAS controller + a bunch of 2nd hand SAS SSDs (mostly with ~95% of their endurance left) than buy brand new SATA drives. And the SAS ones often have ~40x the endurance of consumer SATA drives. (!)

justinclift Jun 28, 2024

On the Proxmox forums, the Kingston DC1000B NVMe drives seem to be commonly recommended:

https://www.kingston.com/en/ssd/dc1000b-data-center-boot-ssd

Unfortunately they're tiny (480GB max), and the write speed of even those "large" 480GB ones is around SATA speeds. Their rated endurance is only 475TBW (.5 DWPD/5 years) so not great for write heavy use cases either.

rodrigoaguilera · 2023-08-24T12:44:01Z

rodrigoaguilera
Aug 24, 2023

I think I'm suffering from this on a 8TB Corsair MP600 PRO NH used as additional storage for a proxmox 8. rsync seems to trigger it specially.

The sledgehammer solution:

echo 1 > /sys/bus/pci/devices/0000\:12\:00.0/remove
echo 1 > /sys/bus/pci/rescan

Brings back the device for me but the zfs pool doesn't come back. I think it is because proxmox creates the pool with a /dev/nvme0nX and the X changes with every "resurrection".

I'm going to try ext4 next on that device and see how it goes.

I wanted to post here in case there is more people with the same device and similar problems.

2 replies

rodrigoaguilera Aug 28, 2023

Been stressing the drive with ext4 for a few days with fio, rsync and various file copying operations and no problem so far, 4 days uptime. With ZFS the controller died after 15-20 minutes of IO.

In the post above I forgot to mention that I was on the latest firmware 51.3

I won't be testing more on that drive with ZFS so I can't provide more info.

kftsehk Oct 18, 2023

have you tried force fsync on the test with ext4? rsync --fsync or so.

for the /dev/nvme0nX change, use /dev/disk/by-id/<find-your-disk-partition-id>, this id won't change when unplugged or resurrected

agrenott · 2023-10-14T19:53:49Z

agrenott
Oct 14, 2023

Just FYI, I had the exact same issue with a brand new WD BLACK SN770, and swapping my PSU solved the issue (while my previous one seemed perfectly fine)...

5 replies

agrenott Dec 6, 2023

Sad news, it's in fact not (only?) the power supply.
Just faced the issue on the exact same phisical config after updating to latest proxmox version (so not sure whether this is kernel and/or ZFS version related).
Kernel Linux proxmox 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64 GNU/Linux
ZFS

zfs-2.2.0-pve4
zfs-kmod-2.2.0-pve4

Skaronator Dec 6, 2023

Just offtopic, but make sure to update to 2.2.1 due to the data corruption bug "in" 2.2.0

agrenott Dec 6, 2023

Thanks! According to release notes it has been back ported into zfs-kmod-2.2.0-pve4.

justinclift Dec 6, 2023

Pretty sure there was some kind of serious bug found in 2.2.1 as well, so a 2.2.2 release should be out in short order.

fmagin Dec 7, 2023

Yes 2.2.1 had another similar looking issue, but it only showed up if you were using 4k sectors with LUKS #15533

kftsehk · 2023-10-18T21:47:00Z

kftsehk
Oct 18, 2023

[430771.216723] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[430771.216727] nvme nvme2: Does your device have a faulty power saving mode enabled?

Last time I saw this was with either firmware / hardware issue, RMA solves sometimes, if they return you a piece with newer version of firmware or an internal known defect fixed.

I would suggest not to buy same brand & model of the same batch for all vdev in a pool, that might put you at risk of faulting all disks if ever there is a hardware / firmware / manufacture issue.

0 replies

mainTAP · 2024-01-27T20:15:02Z

mainTAP
Jan 27, 2024

I'm having similar issues with two WD SN570 in ZFS mirror . This started to happen after upgrading Proxmox from 8.0.3 to 8.1.4

[Sat Jan 20 00:57:44 2024] nvme nvme0: I/O 778 (I/O Cmd) QID 1 timeout, aborting                 
[Sat Jan 20 00:57:44 2024] nvme nvme0: I/O 938 (I/O Cmd) QID 5 timeout, aborting            
[Sat Jan 20 00:57:44 2024] nvme nvme0: I/O 794 (I/O Cmd) QID 7 timeout, aborting            
[Sat Jan 20 00:57:47 2024] nvme nvme0: I/O 795 (I/O Cmd) QID 7 timeout, aborting            
[Sat Jan 20 00:57:47 2024] nvme nvme0: I/O 830 (I/O Cmd) QID 8 timeout, aborting            
[Sat Jan 20 00:58:14 2024] nvme nvme0: I/O 778 QID 1 timeout, reset controller              
[Sat Jan 20 00:59:25 2024] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[Sat Jan 20 00:59:25 2024] nvme nvme0: Abort status: 0x371                                                                      
[Sat Jan 20 00:59:25 2024] nvme nvme0: Abort status: 0x371                                                                      
[Sat Jan 20 00:59:25 2024] nvme nvme0: Abort status: 0x371                                                                      
[Sat Jan 20 00:59:25 2024] nvme nvme0: Abort status: 0x371                                                                      
[Sat Jan 20 00:59:25 2024] nvme nvme0: Abort status: 0x371

and after replacing that drive with a new SN570, the other one crashed too :

[Sat Jan 27 17:01:22 2024] nvme nvme0: I/O 415 (I/O Cmd) QID 7 timeout, aborting
[Sat Jan 27 17:01:22 2024] nvme nvme0: I/O 780 (I/O Cmd) QID 1 timeout, aborting
[Sat Jan 27 17:01:22 2024] nvme nvme0: I/O 416 (I/O Cmd) QID 7 timeout, aborting
[Sat Jan 27 17:01:22 2024] nvme nvme0: I/O 417 (I/O Cmd) QID 7 timeout, aborting
[Sat Jan 27 17:01:22 2024] nvme nvme0: I/O 418 (I/O Cmd) QID 7 timeout, aborting
[Sat Jan 27 17:01:53 2024] nvme nvme0: I/O 780 QID 1 timeout, reset controller
[Sat Jan 27 17:03:04 2024] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[Sat Jan 27 17:03:04 2024] nvme nvme0: Abort status: 0x371
[Sat Jan 27 17:03:04 2024] nvme nvme0: Abort status: 0x371
[Sat Jan 27 17:03:04 2024] nvme nvme0: Abort status: 0x371
[Sat Jan 27 17:03:04 2024] nvme nvme0: Abort status: 0x371
[Sat Jan 27 17:03:04 2024] nvme nvme0: Abort status: 0x371
[Sat Jan 27 17:03:14 2024] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[Sat Jan 27 17:03:14 2024] nvme nvme0: Disabling device after reset failure: -19

3 replies

justinclift Jan 28, 2024

@mainTAP Would you be ok to check them both and see what sector size they're using?

mainTAP Jan 28, 2024

Both are 512

justinclift Jan 28, 2024

Damn. So much for the theory that it could purely be a 512b vs 4k sector size thing then.

jpsalm · 2024-02-22T19:06:51Z

jpsalm
Feb 22, 2024

I'm also having the same issue with btrfs across two different WD Black SN770 2TB devices with the controller shutting off in certain load situations.

[ 1063.964588] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 1063.964594] nvme nvme2: Does your device have a faulty power saving mode enabled?
[ 1063.964595] nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[ 1064.031255] nvme 0000:23:00.0: enabling device (0000 -> 0002)
[ 1064.031304] nvme nvme2: Disabling device after reset failure: -19
[ 1064.047938] I/O error, dev nvme2n1, sector 1085480216 op 0x0:(READ) flags 0x0 phys_seg 5 prio class 2
[ 1064.047942] I/O error, dev nvme2n1, sector 23377696 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2
[ 1064.047944] I/O error, dev nvme2n1, sector 1085484280 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1064.047949] I/O error, dev nvme2n1, sector 437010120 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1064.047949] I/O error, dev nvme2n1, sector 437009712 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1064.047950] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
[ 1064.047951] I/O error, dev nvme2n1, sector 1085483848 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 2
[ 1064.047951] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
[ 1064.047954] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
[ 1064.047956] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 0, rd 4, flush 0, corrupt 0, gen 0
[ 1064.047956] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 0, rd 7, flush 0, corrupt 0, gen 0
[ 1064.047956] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 0, rd 6, flush 0, corrupt 0, gen 0
[ 1064.047957] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 1, rd 7, flush 0, corrupt 0, gen 0
[ 1064.047958] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 1, rd 11, flush 0, corrupt 0, gen 0
[ 1064.047958] BTRFS error (device dm-0): bdev /dev/mapper/system errs: wr 1, rd 8, flush 0, corrupt 0, gen 0
[ 1064.047996] BTRFS error (device dm-0): failed to run delayed ref for logical 223696154624 num_bytes 4096 type 178 action 1 ref_mod 1: -5
[ 1064.048006] BTRFS error (device dm-0: state A): Transaction aborted (error -5)
[ 1064.048008] BTRFS: error (device dm-0: state A) in btrfs_run_delayed_refs:2249: errno=-5 IO failure
[ 1064.048011] BTRFS info (device dm-0: state EA): forced readonly
[ 1064.048211] BTRFS error (device dm-0: state EA): failed to run delayed ref for logical 223696158720 num_bytes 4096 type 178 action 1 ref_mod 1: -5
[ 1064.048219] BTRFS: error (device dm-0: state EA) in btrfs_run_delayed_refs:2249: errno=-5 IO failure
[ 1064.049547] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed
[ 1064.049558] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed
[ 1064.049576] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed
[ 1064.049597] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed
[ 1064.050126] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed
[ 1064.050144] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed

I'm using the latest firmware (731120WD) and have tried all the combinations of the following kernel options: acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off.

12 replies

jpsalm Mar 26, 2024

Update: I moved my root back to WD because the 980 Pro I had moved it to performed awfully with LUKS + btrfs + transparent compression (seq. write speeds of around 480 MB/s). By sticking with 512 byte sectors I've had no further issues. Sequential write speeds aren't quite as fast as before (2000MB/s vs 3600 MB/s) but it's been stable.

justinclift Mar 26, 2024

@jpsalm That WD is still running with firmware 731120WD yeah? If so, then that specific firmware + 512 byte sectors might be a useful "seems to work" base point for anyone else having issues.

... unless it turns out WD changed the underlying electronics in the drives without changing the model number as well. Other vendors are known to do that occasionally (Kingston comes to mind). Not sure if WD does that kind of thing too.

jpsalm Mar 26, 2024

That's correct, I have two SN770 2TB drives both on 731120WD with 512 byte sectors. They're running btrfs on top of luks with no issues for the last two weeks now (and previously every big compile had a high probability of triggering a controller reset).

justinclift Mar 26, 2024

@jpsalm Oh, one other relevant question comes to mind. Do those drives have heat sinks physically on them? Just in case heat is some kind of a factor in this... 😄

Thaodan Nov 26, 2024

Did you found a fix? Strangely I had the same issue with SN 740 2242 TB, thought it's just an issue with his particular ssd.
I wasn't able to get logs as the FS remounts read-only as the controller doesn't respond anymore but here's a picture:

I also use 4k sectors. I don't know why that should be an issue. Doesn't make sense to me.

DiarrheaMcgee · 2024-02-23T18:43:28Z

DiarrheaMcgee
Feb 23, 2024

does this affect other western digital ssds like the wd_black sn850x

6 replies

DiarrheaMcgee Feb 23, 2024

i just got a western digital ssd so i guess il just wait before trying out zfs

markjdb Feb 23, 2024

For what it's worth, I'd been hitting the firmware crashes pretty much daily with an SN770 in a newly built workstation running FreeBSD; the system has been completely stable for over a month after I swapped it out for an SN850x. Just one data point.

justinclift Feb 24, 2024

@DiarrheaMcgee There are also reports of people having problems with btrfs (above) as well, so that might be another thing to exclude until the problem has been figured out. 😄

Besenreiter Feb 24, 2024

This is getting a bit "annoying". I wonder if WD reads this thread? Me and obviously a lot other people lost a lot of time with this crap. I am thinking of returning these SSD for warranty. But then, how to argue when the test it under Windows?

justinclift Feb 24, 2024

@Besenreiter You could try something along the lines "These are throwing errors when used under Linux, so I need to return them and get something that works".

I'd kind of expect that legally speaking their products are required to work regardless of OS. So if their testing can't find problems that are clearly known about (as above), that's their problem.

TheDom42 · 2024-03-18T14:31:32Z

TheDom42
Mar 18, 2024

Just to possibly add to the list of potential problematic devices: Verbatim Vi3000.
3D-TLC with SLC cache. Controller is Maxiotek MAP1202A. I have the 2TB version.

Got them relatively cheap and did not worry too much about them being DRAM-less as they were supposed to be the base for some VMs and light dockers. Using an Asus Pro WS W680M-ACE SE with both slots populated with these drives.

Issues appeared right away in the resilvering for the mirror: one drive completely dropped out with (as far as I remember) same error message as OP. The drives have a small green LED that indicates access (not sure if both read and write but I suspect). After the dropout but before the reboot, this LED stayed lit (not blinking).
The dropout was even (somehow) logged in the drive, because UEFI SMART test reported the drive as defective. I secure-erased the drive to send it in for RMA and just out of fun ran the SMART tests again - this time, no error. Both short as well as extended test. I thought this might have been a fluke and reinstalled the drive. I completely wrote 0's to it to verify correct operation (which passed without problems). I then readded the device as a mirror.
During this second resilver, the complete vdev crashed after around 800GB written. After a reboot, the pool reported data errors due to CKSUM errors on the first drive (which had not reported problems before). Tried to resilver again to salvage as much as possible and this time, it was a lot slower than before but completed - clearing out all errors in the process (currently running a scrub to verify if all is actually working).
Until I stumbled upon this thread, I believed that either both SSDs were defective or the M.2 controllers of my board (which is still fairly new but who knows). I came here because I noticed that the resilvers and scrubs on these drives are a lot more "bursty" than I'm used to. Therefore, I was also suspecting that the controller (or NAND) cannot keep up with the resilver/scrub load. Either due to temperature or just plain being a bad (cheap) controller. When checking the datasheet afterward, I noticed that the drive's operating temperature only goes up to 70°C while the SN770 is rated for 85°C.

To document: I'm on Unraid 6.12.8 with ZFS: Loaded module v2.1.14-1, ZFS pool version 5000, ZFS filesystem version 5
ASPM L1 is enabled but device is not sleeping during the scrub (highest poissble power level).

0 replies

justinclift · 2024-03-19T22:29:20Z

justinclift
Mar 19, 2024

Stumbled over this list of SSD's with Power Loss Protection:

https://www.techpowerup.com/ssd-specs/filter/?plp=1

Looks pretty comprehensive. How did I not see this before? 😄

0 replies

toastal · 2024-03-25T17:44:54Z

toastal
Mar 25, 2024

I’ve had similar issues after an RMA with back-to-back issues on SSDs running Linux 6.7 & 6.8 with bcachefs on a 4096 sector size WD SN740 NVMe (2242 size) with firmware 73110101 for a Lenovo laptop with AMD Ryzen 7 CPU. Drive completely shits the bed under heavy IO like compiling the kernel & Lenovo support is acting like Linux is the problem instead of the vendors they partner with. No kernel parameters helped.

[  151.428363] usb 1-5: reset full-speed USB device number 4 using xhci_hcd
[  357.031040] usb 1-5: reset full-speed USB device number 4 using xhci_hcd
[  628.451047] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[  628.451059] nvme nvme0: Does your device have a faulty power saving mode enabled?
[  628.451061] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[  628.478667] nvme 0000:02:00.0: enabling device (0000 -> 0002)
[  628.478822] nvme nvme0: Disabling device after reset failure: -19
[  628.489071] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489072] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489072] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489082] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489082] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489083] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489085] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489086] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489088] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489090] bcachefs (nvme0n1p3): btree write error: I/O
[  628.489090] bcachefs (nvme0n1p3): error writing journal entry 84258: I/O
[  628.489141] bcachefs (nvme0n1p3): unable to write journal to sufficient devices
[  628.489151] bcachefs (nvme0n1p3): fatal error - emergency read only
[  628.489155] ------------[ cut here ]------------
[  628.489157] btree trans held srcu lock (delaying memory reclaim) for 22 seconds
[  628.489166] ------------[ cut here ]------------
[  628.489168] btree trans held srcu lock (delaying memory reclaim) for 21 seconds
[  628.489169] ------------[ cut here ]------------
[  628.489171] btree trans held srcu lock (delaying memory reclaim) for 12 seconds
[  628.489210] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489216] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489218] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489220] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489222] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489224] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489226] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489228] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed
[  628.489232] bcachefs (nvme0n1p3): fatal error writing btree node: btree_write_all_failed

6 replies

DiarrheaMcgee Mar 25, 2024

i just installed void linux with zfs (2.1 compatibility) and i installed around 800 gigabytes of steam games and moved 500 gigabytes of files and it hasent crashed
im using a WD_BLACK SN850X with zfs-kmod-2.2.3-1

toastal Mar 29, 2024

I installed my old Micron drive from a different laptop with ZFS in that machine & everything was alright. WD says they offer no warranty on OEM laptop hard drive parts.

justinclift Mar 29, 2024

WD says they offer no warranty on OEM laptop hard drive parts.

How old is the laptop and ssd? If the laptop maker is out of business then in many countries WD would be liable for it, even if they want to pretend otherwise. 😦

But, Lenovo eh? Sounds like two crappy companies then. Can you return the laptop?

toastal Mar 29, 2024

It was a from year-old, previous generation Lenovo that got toasted in a power surge. lspci shows Micron Technology Inc 2450 NVMe SSD [HendrixV] (DRAM-less) (rev 01) (prog-if 02 [NVM Express]), & smartctl Micron MTFDKCD1T0TFK @ firmware 7003V5LN which should still be PCIe4 x4 (but it does not support 4096 sector sizes).

toastal Mar 29, 2024

As much as I hate how the Lenovo reaction is going, I wouldn’t fully return a laptop just because they partnered with a shitty HDD vendor… as it is like the only replaceable part. I’m hoping to convince them to give me a new brand (spec sheet never said part number) or at least return this drive for the lowest downgrade option for a cash return to buy a new drive. Also, if you want something in a non-gaming form factor in this country with >16GB of RAM & an OLED display, this is literally the only model on the market.

posixpoet · 2024-03-25T22:13:58Z

posixpoet
Mar 25, 2024

Update 2024-04-07:
Replace complete setup. Currently running a Zotec C-type with a N100. Crucial P3 and spec'd 16GB memory. With a TinyPSU. Running smooth. I'll update when I check the SN700 in this system. Thanks for listening.

Update 2024-0403:
Disregard my post below. Replaced the 770 with a Crucial P3. Still had crashes. Replaced the PSU which improved overall stability. Just have another crash in the morning (power save?). Investigating with netconsole et al.

Just to confirm... (modern HW noob here)
WD_BLACK_SN770 with 731120WD firmware won't even install Proxmox 8.1.4, even with
nvme format --lbaf=0 /dev/nvmeXYZ
It's a weakish Asus N100 Prime board and setting m.2 speed in BIOS anywhere from auto/gen1/2/3, doesn't change make it installable.
BUT! When I hook up an old SATA drive, I'm able to install Proxmox onto the NVME (kernel 6.5.11-8-pve).
Despite
acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
the system crash can be replicated by creating a ZFS storage.

0 replies

J4nsen · 2024-03-28T22:24:29Z

J4nsen
Mar 28, 2024

I can also report ZFS troubles with 4x WD_BLACK SN770 2TB, Firmware 731100WD:

[Thu Mar 28 22:42:54 2024] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[Thu Mar 28 22:42:54 2024] nvme nvme1: Does your device have a faulty power saving mode enabled?
[Thu Mar 28 22:42:54 2024] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[Thu Mar 28 22:42:54 2024] nvme 0000:08:00.0: enabling device (0000 -> 0002)
[Thu Mar 28 22:42:54 2024] nvme nvme1: 6/0/0 default/read/poll queues
[Thu Mar 28 22:42:56 2024] nvme1n1: I/O Cmd(0x1) @ LBA 176970179, 1 blocks, I/O Error (sct 0x0 / sc 0x4) MORE 
[Thu Mar 28 22:42:56 2024] I/O error, dev nvme1n1, sector 1415761432 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2

Mainboard: Asrock Rack EP2C602
CPU: 2x Intel Xeon E5-2670
Ram: 128GB DDR3 ECC
SSD: 4x WD Black SN770
Host: Ubuntu 22.04
VM-Guest: TrueNAS/TrueNAS Scale
ZFS: RAIDZ1

I have 4 of them in a riser to use the bifurcation feature of my motherboard (PCIE Slot 7, directly connected to the 2nd CPU, so no chipset in the data-path).

I saw controller crashes in a TrueNAS-13.0-U6.1 virtual machine to which i forwarded the four NVMEs. I feel that a installation of a Ubuntu vm on a iscsi volume often triggers the crash.

What I unsuccessfully tried:

Set PCIE Speed to Gen2 or Gen1
Switch to TrueNAS Scale
- set pcie_aspm=off
- set nvme_core.default_ps_max_latency_us=0
- set both (also on the host)
Get a better riser card: Asus Hyper M.2 X16 Card V2
- I think this somewhat helped? Better cooling, better power supply?

Observations:

Detaching and reattaching the borked NVME lets me put it into a working state

@admnd I'm super grateful for this discussion. Thanks for the inital write-up and debugging :) Luckily I can still return my NVMEs to Amazon.

13 replies

J4nsen Apr 13, 2024

Hey Justin,
i placed the NVMEs in a Asus Hyper M.2 X16 Card V2. So cooling should be fine. smartctl reports these temperatures:

Temperature Sensor 1:               57 Celsius
Temperature Sensor 2:               37 Celsius

I've also checked my 3.3V supply, because it was mentioned multiple times in this bug report, which itself mentions this discussion here again.
My 3.3V seems to be fine. The mainboard report it at 3.27V. It's a redundant PSU from FSP (FSP Twins Pro 500W ATX)

I'm now back at TrueNAS Scale. I hope that the Linux kernel can handle the NVMEs better.

J4nsen May 27, 2024

43 days uptime of my truenas VM and no problems so far

justinclift May 27, 2024

Cool, that sounds like it's working ok. Are the drives on that mostly lightly loaded, or do they go through periods of serious activity?

Am kind of wondering if the problem only shows up when the drives are under sustained load or something. 😄

J4nsen May 30, 2024

It is (and was) a virtual machine with a 10GBit-nic, so the maximal load is already very limited. If I find the time and courage, I will run some benchmarks directly on the vm and report back. :)

justinclift May 30, 2024

Awesome. If we can figure out a likely causing factor, that'd be super helpful. 😄

no-usernames-left · 2024-05-30T13:35:09Z

no-usernames-left
May 30, 2024

I see the NVMe errors, but Linux seems to have a long-running issue causing txg_sync timeouts; are you sure there's no connection? (To be fair, if the SSD is dropping off the bus, all sorts of timeouts would obviously follow.)

#9130

0 replies

foolab · 2024-07-22T19:05:08Z

foolab
Jul 22, 2024

Can I assume SN850x is working fine?

Does anyone have a recommendation for a 2230 M.2 (Framework 16 does not have 2x2280 😢 )

5 replies

admnd Jul 23, 2024
Author

Yes, I am using 3x 2TB WD SN850x (raid-z1) in my main workstation for more than 1 year and no ZFS crashes.

foolab Jul 23, 2024

Thanks. Great to hear.

Do you happen to have experience with 2230?

Paul-0123 Jul 24, 2024

I do have 2230, but not a long time, and only as second drive, and further with slow connection in an older laptop, only get ~1300 MB/sec with a Micron_2400. Should I run any tests?

foolab Jul 25, 2024

Thank you @Paul-0123 for sharing.
It seems drives work fine mostly but they are a bit iffy when used with zfs. Is the 2400 part of a zfs array?

Paul-0123 Jul 26, 2024

@foolab, no it is not part of a zfs arry, only a single drive pool.
I partitioned it, could create a mirror or a striped pool with a partition on the other nvme, write some data just for testing.

foolab · 2024-08-02T16:04:44Z

foolab
Aug 2, 2024

It seems there could be a way to use SN770 if formatted to 512 bytes. I have not tried it myself
https://www.reddit.com/r/zfs/comments/1ei46zo/comment/lg43ip1/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

3 replies

mabra Aug 2, 2024

It seems there could be a way to use SN770 if formatted to 512 bytes. I have not tried it myself https://www.reddit.com/r/zfs/comments/1ei46zo/comment/lg43ip1/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

Just note (because for my feeling, there is too much speculation in this topic), even though that my NVMEs are like this:
Disk /dev/nvme1n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors Disk model: WD_BLACK SN770 2TB Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt
together with a 'FireCuda 520' (NVME, 512 Bytes too), my zfs- storage (mirror)
crashed too (note: swap onto a standalone SSD ...) - both disks - though two disks, from different manufacturers!!) got read errors at the same time (even statistically not really possible - except for other hardware reasons) and the storage became corrupt.
I needed a little to much time, but then powered off the box hard.
After boot, the pool-pool (on the same "defect disk" ...) booted without errors, were other pools (on partitions on the same NVMEs) were faulty.

justinclift Aug 2, 2024

@mabra Which version of ZFS was this with?

foolab Aug 2, 2024

This is really a mess. But this really points to either zfs itself or the hw. Guess the only safe bet is 850x

mabra · 2024-08-03T02:50:24Z

mabra
Aug 3, 2024

Hi! I am on Debian11 (Bullseye) and installed ZFS 2.1.11 from backports (which stuck at this version, which will not be upgraded .... this may happen, if "top-animals" say, something like "ZFS is not neccessary", and Actions follow thoughts ....), so I stay on this version and kernel 6.1.0-0.deb11.21-amd64. BTW, dont remember which kernel I used, as it happend (2023-05-19), about 1 hour after return from hibernate (this is why I added, that I am using a separate swap ssd). I am using ZFS since 2012(!) and have never seen something bad, especially like this. It was NOT a high load problem, this is sure. BTW, at the beginning of this thread, there was a note, that the kernel nvme driver dont give the amount of bufferspace, nvme expects - but I lost the track and have not read the specs. I continue working on the same hardware (supermicro H12SSL) and have never had a problem with high load. Why I am using two different NVMEs in the mirror is, that I started with two FireCuda 510, but one died in the first weeks and I was afraid, it is a systematic error and the next will follow soon, but both (FireCuda and WD) are nearly the same in regards to their specs. I even plan to remove the remaining, because I am getting this error on each boot: >smartd[5015]: Device: /dev/nvme0, number of Error Log entries increased from 1049 to 1052< This has never been the case for the WD drive (WD_BLACK SN770 2TB). Regards, Manfred

…

----- Original Message ----- From: Justin Clift ***@***.*** To: "openzfs/zfs" ***@***.***> Cc: ***@***.*** Sent: Fri, 02 Aug 2024 16:10:10 -0700 Subject: Re: [openzfs/zfs] Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others (Discussion #14793) @mabra Which version of ZFS was this with? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

mabra · 2024-08-03T03:07:25Z

mabra
Aug 3, 2024

I dont see a really hard evidence for HW error, at least for my case. I runs > 15 month after the crash without any problem. If I see, what happens to the kernel (every version another crash), I see other possibilities - and, so my note in the last answer, why should someone worries about "a product wich taints the kernel" .... Regards, Manfred

…

----- Original Message ----- From: Mohammed Sameer ***@***.*** To: "openzfs/zfs" ***@***.***> Cc: ***@***.*** Sent: Fri, 02 Aug 2024 16:18:12 -0700 Subject: Re: [openzfs/zfs] Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others (Discussion #14793) This is really a mess. But this really points to either zfs itself or the hw. Guess the only safe bet is 850x — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

justinclift Aug 3, 2024

Interesting. I'm wondering if what you hit there was one (or more) of the ZFS bugs that were fixed with the ZFS 2.2.x point releases: https://github.com/openzfs/zfs/releases

Apparently there were some long standing bugs in there.

It's possible the Debian backports have the fixes backported into their packages though. No idea personally if that's the case.

mabra Aug 25, 2024

Interesting. I'm wondering if what you hit there was one (or more) of the ZFS bugs that were fixed with the ZFS 2.2.x point releases: https://github.com/openzfs/zfs/releases

Apparently there were some long standing bugs in there.

It's possible the Debian backports have the fixes backported into their packages though. No idea personally if that's the case.

I have to stay on Debian11 for some time and no newer updates on the backports/contrib repository (there is 2.1.11 .....) will be made available ....

marcus905 · 2024-10-16T19:19:23Z

marcus905
Oct 16, 2024

There has been another set of issues with the HMB buffer size causing crashes on Windows 24H2 with SN770 and SN580 SSDs (and all others based on the same controller)

https://community.wd.com/t/windows-24h2-wd-blue-screens/297867

Might this issue be related to something similar?

16 replies

agrenott Oct 21, 2024

Found this which looks promising: https://github.com/not-a-feature/wd_fw_update
Didn't try to apply the update yet though :)

kam821 Oct 25, 2024

SN770 500GB is definitely affected by the instability issue on ZFS and has not received an update to 731130WD and because the Windows 24H2 problem was caused by excessive HMB allocation and 500GB version only allocates 32MB, these two issues are most likely unrelated.

marcus905 Oct 25, 2024

It's not that clear cut. The HMB issue stems from the fact that Linux and Windows had very different allocation schemes for HMB.

Linux tries (tried?) to allocate the maximum number of 4MB chunks available up to 8 (so at most 4MBx8 = 32MB) while Windows always allocated a single chunk of max size. The size was capped to 64MB in 23H2, and capped to a higher value or uncapped for 24H2 on. The patch addresses this by changing from a 200MB max size and 8 max chunks, to 64MB max size in a single chunk only changing both OSes behaviors.

While, and I concur, this might not be a solution, it's still worth a try to check if the newer firmware on 2TB devices provides any benefit on the issue.

Xalaxis Nov 10, 2024

I can further suggest this is not the cause of the problem - I updated my SN770 to the latest HMB-fix firmware using Windows and still regularly encountered the issue. In the end I've switched to a Seagate FireCuda 530 which has not exhibited the same symptoms.

justinclift Nov 10, 2024

Ahhh well. Thanks for trying it out and letting people know @Xalaxis. 😄

Xalaxis · 2024-11-19T19:55:15Z

Xalaxis
Nov 19, 2024

An update on this issue, could this strictly be a firmware bug to do with 4096 byte sector sizes and nothing else? I've recently given my problematic SN770 to someone else who suddenly started having similar looking dropout issues in Windows. After reformatting to 512 byte mode the issues went away.

Someone else here reports issues with just the 4096 bytes mode: https://community.wd.com/t/sn770-nvme-controller-reset-when-formatted-with-4096-byte-sectors/282532

4 replies

admnd Nov 20, 2024
Author

Thank you for the hint @Xalaxis . That would probably explain why no one (except power users who switched to native 4k "sectors") encounters the issue as they are relying on the default "stable" 412b configuration (Windows might issue some kind of quirk not yet issued on Linux/FreeBSD, I assume it is not the case here).

A thing that could be tested: a zpool using a single NVMe module in both 512b/4k configurations. If drops-off happen only in 4k mode that would simply means WD has some serious undocumented hardware issue here. As the performance is already crippled by a small memory buffer in the computer RAM and no firmware update seems to fix the problem, those modules are nothing but pure cheap garbage. Had WD even tested this scenario? A "no" would be quite surprising, but who knows eh?

Another way is to use anything else but ZFS (with a significant I/O load) like BTRFS, XFS or EXT4. I am pretty confident to see the same crash.

In all cases, this is not a software (i.e. ZFS or Linux kernel) issue.

Asking WD for a RMA is absolutely useless as the issue is definitely a "by hardware design" one.

marcus905 Nov 20, 2024

Sadly those SSD (SN770M) fill a very specific niche (TLC + 2230 + 2TB + PCIe4) so it's somewhat bad to have this issue.

justinclift Nov 20, 2024

could this strictly be a firmware bug to do with 4096 byte sector sizes and nothing else?

It's doubtful, as there are reports of the problem happening in this GitHub issue even with 512 byte sectors. 😦

Another way is to use anything else but ZFS (with a significant I/O load) like BTRFS ...

There are also reports here (in this GitHub issue) of the crashes happening for people using Btrfs. 😦

mariusmuja Nov 28, 2024

This matches my experience: I initially formatted two SN770 with 4K sectors for better performance and I was getting the controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 error almost immediately after starting a scrub on the pool.

After reformatting the drives with 512 sectors, I've been using them without issues.

Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others #14793

Hardware

Issue observed

What has been tried so far

Some thoughts / ideas of tests to try

Replies: 42 comments · 146 replies

admnd Apr 25, 2023 Author

admnd Apr 25, 2023 Author

admnd Apr 25, 2023 Author

admnd Apr 26, 2023 Author

admnd Apr 26, 2023 Author

admnd Apr 27, 2023 Author

admnd Apr 26, 2023 Author

admnd Apr 28, 2023 Author

admnd Jun 8, 2023 Author

Replies: 42 comments 146 replies

admnd
Apr 25, 2023
Author

admnd
Apr 25, 2023
Author

admnd
Apr 25, 2023
Author

admnd Apr 26, 2023
Author

admnd Apr 26, 2023
Author

admnd Apr 27, 2023
Author

admnd
Apr 26, 2023
Author

admnd
Apr 28, 2023
Author

admnd Jun 8, 2023
Author