
Western Digital’s SMR disks won’t work for ZFS, but they’re okay for most NASes - rbanffy
https://arstechnica.com/gadgets/2020/06/western-digitals-smr-disks-arent-great-but-theyre-not-garbage/
======
zenexer
Current title is "Western Digital’s SMR disks won’t work for ZFS, but they’re
okay for most NASes." That isn't the title of the article (it's the subtitle)
and doesn't match the conclusion:

> Conclusions

> We want to be very clear: we agree with Seagate's Greg Belloni, who stated
> on the company's behalf that they "do not recommend SMR for NAS
> applications." At absolute best, SMR disks underperform significantly in
> comparison to CMR disks; at their worst, they can fall flat on their face so
> badly that they may be mistakenly detected as failed hardware.

That's definitely not saying it's "okay for most NASes." Could we get a
rename? It looks like Ars Technica themselves might be going back and forth;
the current URL contains the slug, "western-digitals-smr-disks-arent-great-
but-theyre-not-garbage," indicating that may have been the old title.

~~~
oefrha
Ars Technica publishes every article with two titles; there’s an A/B test in
the first few minutes after an article is published, after which the one with
better click-through rate becomes the permanent title. Obviously only one of
the two titles can be the slug, so often the slug reflects the title that has
lost. That does not indicate going back and forth.

Source: it’s a well-known fact that editors have disclosed many times over.
Also, as a subscriber I often see a different title in my RSS feed.

~~~
fragmede
Here's a more detailed discussion from Ars on their A/B testing of headlines
that I found:

[https://arstechnica.com/information-
technology/2016/10/tfw-a...](https://arstechnica.com/information-
technology/2016/10/tfw-an-obituary-you-wrote-five-years-ago-goes-
viral/?comments=1&post=32066759)

------
Jonnax
I built a 2 drive 4TB RAID1 with a DS220j using these SMR drives the other
day.

To be honest performance is as I'd expect out of a hard drive. But I've mostly
been using SSDs on my PCs.

I've put my instrument VSTs on it and I'm using the Windows NFS client (you
use cli command to mount after enabling in add windows features)

Standard windows file sharing (SMB) seems to top out at 25MB/s and I've no
idea why.

I'm not too worried about it being SMR at least looking at my use case.

But it's really messed up that WD hid the technology used.

If anything it would drive sales for their more expensive products.

Also it does highlight that in a lot of tech product fields we don't have good
reviews. Because there's no money in it.

Like with laptops there's notebookcheck doing technical reviews.

With displays rtings and tftcentral.

But with other less used stuff it's a YouTube video from someone saying "yeah
it's a harddrive" and reading the marketing.

Or a drive will get reviewed. Then the manufacture will start selling
something with the same brand but different model number to confuse customers.

But nobody will re-review, partially because they weren't told.

~~~
ubercow13
If you just built your array, you wouldn't encounter this issue yet. The
problem would come (assuming you're using ZFS) when your NAS is full of data
and one of your disks fails, and you try to replace it with another SMR drive.
The drive would be rejected before it could be added to the array.

~~~
dannyw
As benchmarked in the article, SMR drives rebuild Linux RAID fine because it's
just a sequential data overwrite.

The problem occurs when you are dealing with sequential files with small
sector sizes, and non-sequential files, where performance is abysmal: 93%
slower than CMR as benchmarked by servethehome and Ars.

~~~
zozbot234
The problem is not just the slow performance, but the fact that these drives
can be dropped from the array when rebuilding - RAID implementations including
the one in ZFS treat abysmal performance as damage and route around it. The
performance problem could be addressed simply by raising the number of drives
in the array (RAID5 to RAID6 or more).

~~~
ezoe
Isn't that a problem of ZFS? I don't mind SMR HDD if it was labeled as such
and warn about it's weird performance traits.

~~~
ubercow13
This whole issue was originally that Western Digital was trying to hide which
HDDs were SMR. ZFS can't be changed to have special behaviour for SMR drives
if the vendors try to hide the fact they are SMR, even from software. It
wouldn't be able to identify them.

~~~
ezoe
Well, that's bad.

------
acqq
> the WD Red's firmware was up to the challenge of handling a conventional
> RAID rebuild, which amounts to an enormous, very _large block sequential
> write test_. (...) the Red is _only 16.7 percent slower_ than its non-SMR
> Ironwolf competition

> we can model an ideal ZFS resilvering workload with a _massive sequential
> write_ — and we did exactly that, using 32KiB blocks of incompressible
> pseudorandom data (...) With this test workload, we achieved a throughput of
> 209.3MiB/sec on the Ironwolf, but only 13.2MiB/sec on the Red — a 15.9:1
> slowdown

So if I understood correctly "very large block sequential write test" is only
16% slower, but reducing the size of the written block to 32KiB, like
apparently ZFS does, it becomes 16 times slower?

Couldn't that be fixed in ZFS? 32 KiB block is too small for SSDs too, if I
understand correctly. (Edit: I refer to:
[https://tytso.livejournal.com/2009/02/20/](https://tytso.livejournal.com/2009/02/20/)
"However, with SSD’s (...) you need to align partitions on at least 128k
boundaries for maximum efficiency." \-- it seems that 128K was an important
number?).

If the writes are sequential, some kind of caching (i.e., blocking the
sequential writes) could be made to avoid the slowdown? How big were the
blocks written when the 16% slowdown happened?

And what is the minimal needed block size to avoid many-times slowdown?

~~~
dannyw
You can change the block size with ZFS easily, it is a supported function:

    
    
        sudo zfs set recordsize=[size] data/media/series
    

This can be done at any time, but will only apply to new files.

ZFS is a Copy on Write system, which means that if you take a snapshot (or
duplicate) a file and modify one byte, the modifications (min size of a record
size) is created then (so 128KiB default). Each block is then distributed
reduced to about ~32KiB of writes because of striping, parity, etc.

If you made the record size something large, let's say 1MB, then it would mean
that changing 1 byte of a duplicated or snapshotted 1GB file would cause an
additional 1MB of storage. Very, very very inefficient. However, still better
than a non-CoW system, where duplicates and snapshots would occupy an
additional 1GB irrespective of modifications.

So to summarize, ZFS is intentionally designed and benefits from small block
sizes, and rather than changing it, you should simply buy a non-SMR drive esp.
because they are available at comparable prices.

Also keep in mind that an advanced format HDD sector is 4096 bytes. I haven't
heard of any drives with more than ~4096 bytes per sector. So as long as your
block size is an integer multiple of 4096 bytes, you are getting all the
benefits of your HDD medium.

I believe SSD native sectors are similar, so again, there's not reason to go
for huge sectors. SMR's abysmal performance with small sectors is a _defect_ ,
something you should RMA the moment you receive a SMR drive (WD is honoring
replacements, just mention the magic word of 'ZFS rebuild' to support).

~~~
RNCTX
> Also keep in mind that an advanced format HDD sector is 4096 bytes. I
> haven't heard of any drives with more than ~4096 bytes per sector.

There are some NVMEs which are 8192. Not sure which, because, we haven't
learned a damn thing in ~15 years.

I distinctly remember back in the 2000s Adaptec giving Theo at OpenBSD a run-
around about driver code because they had lawyers in their ear trying to
convince them that they should be patent and licensing trolls rather than
hardware vendors (worked great for Lucent) and finally, in retaliation, he
shipped a version of OpenBSD without Adaptec RAID support.

[https://openzfs.org/wiki/Performance_tuning#Alignment_Shift_...](https://openzfs.org/wiki/Performance_tuning#Alignment_Shift_.28ashift.29)

Why is this relevant? Because some of these NVME drives report wrong values
for their sector size, so we're still in that same boat of being flat out lied
to by the people we buy this hardware from.

Some NVME SSDs, per testing, report 4096 when they're actually 8192. So you
don't know what value to use until you set up a pool/dataset, test it for
performance, and then if the performance seems off versus say... Anandtech
benchmarks, destroy the pool/dataset and try the other value (this becomes
ashift=12 or ashift=13 currently).

You can find discussion on the ZFS subreddit about this pretty commonly, and
also on the PostGIS and OpenStreetMap user groups and github issue forums
since those people tend to be on the bleeding edge of the disk performance
market.

------
willis936
Oh my god. No wonder they lied about what technology was in those drives. If
customers knew they would never buy it, regardless of price. They should have
wrote it off.

------
mindslight
Nobody is saying that you can't use an SMR drive in a RAID with the
appropriate write strategy. It's just that you'd expect those tweaks to be
enabled when the drive actually reports itself as an SMR drive. Which these
drives don't do, because WD tried to defraud their customers by doing a gross
cost+feature reduction on a product line that had been shipping for many
years.

~~~
StillBored
Until WD published a worse case write timeout on these drives, and gets it
merged into linux/etc I would claim they aren't fit for any purpose because
your just crossing your fingers and hoping your write pattern is ok.

They are still hiding fundamental parameters needed to make these drives work
properly (number/size of SMR regions, how much CMR is available, and ways to
read the percentage the cache+CMR is full at any given time).

------
rbanffy
Does anyone know the size of the non-SMR data area on the disk? I assume
that's the size of a sequential write before the performance drops to a halt
while writing all that to SMR. Also, random writes would perform much worse if
they require multiple SMR blocks to be read and rewritten each time the non-
SMR area fills up.

Perhaps a good analog for an SMR drive would be some of the early hierarchical
storage servers that served data from RAM, then from spinning metal, and then
from either tape or optical disks, each being slower than the previous. You
could get fast writes until the RAM was full, then slower, disk-like writes
while the HDs were getting filled, then excruciatingly slow throughput when
you needed to hit the MO drives with little RAM or disk storage to help you.

~~~
insulanus
It's not like the disk is divided into SMR and non-SMR zones. Adjacent tracks
that already contain data must always be re-written. So, these driveswork well
if managed in one of two ways:

* Append-only * Only use every other track

If used like that, they can be considered just another hard drive, with an
unusually large gap between tracks.

But if you try to pack on as much data as the drive will hold, and then modify
parts in the middle, the drive is going to have a bad time.

~~~
dnr
The drive firmware and layout are more sophisticated than you're imagining:
All SMR drives have some non-SMR (CMR) zones for buffering and metadata. For
the archival drives I was looking at a while ago (host-managed), it was
usually 1% or 0.5% of the total. I would imagine these stealth-drive-managed
ones have even more, maybe a few percent, to allow more buffering and hide the
cost of shuffling data.

------
013a
Maybe this is mentioned somewhere in the article, I haven't read the whole
thing, but its important to elaborate: SMR is only in-place on WD Red drives
between 2TB-6TB [1]. Everything smaller and larger than that uses CMR, and SMR
isn't used on any other 3.5" drives beyond their budget "blue" drives. This is
now cross-elaborated on their Amazon (US) istings; their 2TB-6TB Red drives
are in a totally different product listing [2] which says "SMR" in the title
(versus their CMR drives [3])

I don't like that they hid it. If they want to keep SMR around for their 2-6TB
Red variants, I think they should reclassify those drives with another "color"
and not certify them for server applications (WD Pink?)

[1] [https://blog.westerndigital.com/wd-red-nas-
drives/](https://blog.westerndigital.com/wd-red-nas-drives/)

[2] [https://www.amazon.com/Red-4TB-Internal-Hard-
Drive/dp/B07MYL...](https://www.amazon.com/Red-4TB-Internal-Hard-
Drive/dp/B07MYL7KVK)

[3] [https://www.amazon.com/Red-12TB-Internal-Hard-
Drive/dp/B07RQ...](https://www.amazon.com/Red-12TB-Internal-Hard-
Drive/dp/B07RQ99XJH)

~~~
colejohnson66
What I don’t understand is: SMR is for increased density, no? So, why don’t
the 8 TB Reds use it?

~~~
alias_neo
To take a guess, I'd say they managed to remove a platter or two to lower
costs on the smaller drives (I wonder if an EFAX is physically lighter?) but
couldn't get the same benefit or didn't need the cost reduction in the more
expensive 8TB+ drives. It's also possible that in testing, the 256MB cache
just want sufficient to make DM-SMR "usable" at those larger capacities or
that the firmware/controller just wasn't up to it.

------
derekp7
Question about SMR drives. From my understanding, essentially at the low level
they have to modify data in much larger block sizes -- so if you are modifying
a small amount of data, a block has to be read in, modified, then written out.
Unless the block is known to contain no data after the point where the write
occurs (such as sequential writes).

What is the size of these blocks? And if the problems occur during raid
rebuild, aren't the rebuilds writing data sequentially?

~~~
karmakaze
This all happens at the hardware level, so every block has data--every write
has to write two tracks as I understand it.

------
e40
I had two failures on a RAID 6 in the same week. All 3TB WD Red's. I'm slowly
replacing all of the drives with Seagate 4TB NAS drives. I definitely do not
trust these drives.

And, I had had failures of some of the drives before, but when I pulled them
and did extensive tests of them, they were always OK.

~~~
StillBored
I'm betting its possible if your drives are sufficiently fragmented and you
start a heavy write workload that shotguns sectors all over the disk from
multiple TCQs the default write command timeouts on most RAID systems aren't
sufficiently long.

So the drive may not have failed, the RAID just determined that the command
was taking to long, fired a reset at the drive and when it didn't immediately
respond marked it bad.

------
blackflame7000
The whole point of ZFS is to tolerate bad drives. You could run ZFS on green
drives if you want. (but don't count on high performance).

------
dehrmann
Except for the uberblocks (which will definitely have a problem with SMR),
shouldn't SMR be fine for ZFS because it's copy-on-wirte?

------
fredsanford
This article makes me wonder how much WD paid...

Since they hid the change to SMR, what else has been hidden?

~~~
donmcronald
Paid for what, the article? The author has been around the ZFS community for a
long time and helps tons of people (including me) for free. He’s not schilling
for WD if that’s what you’re suggesting.

------
Stranger43
Now you could turn the argument around and ask if ZFS is at all appropriate
for consumer/small office NAS systems.

There is an argument to be made that ZFS might be getting a bit to much hype
as the future of mass market storage given how dependent it seems to be on the
hardware being "perfect".

~~~
growse
Having suffered from silent data corruption affecting some old photos that I
only noticed years after it had happened (backups going back that far had
already been purged), I'd argue that ZFS is appropriate for everyone who cares
about their data.

Given that ZFS will happily run on pretty much anything, I think the bar for
hardware dependency is slightly lower than "perfect" and somewhere around "not
completely crap".

