Hacker News new | past | comments | ask | show | jobs | submit login
WD refused to answer our questions about its self-wiping SanDisk SSDs (theverge.com)
257 points by kjhughes 11 months ago | hide | past | favorite | 198 comments

> And that’s just lawsuit number one: Ars Technica reports there are two more. Three SanDisk SSD lawsuits in total, each seeking class-action status for the same failures.

Good. I don't really care about whether WD answers questions, I care about them losing a lawsuit and having punitive damages per affected customer that are far more than the retail value of each product, due to loss of data.

I am baffled by why WD isn't voluntarily issuing product recalls, replacements, and/or firmware updates as necessary in this situation -- the hit they're taking to their reputation is far worse in the long run. But fortunately we have the courts.

I agree, I've naively trusted the SanDisk brand but just today I went compared and bought an SSD, didn't even look a WD or SanDisk. Their name is now tainted in my book.

Interestingly Framework (the laptop company) ships exclusively WD SSDs, although I suppose you can bring your own SSD if you really want.

Though I wonder which other brands are trustable? Just yesterday I read a comment on another thread mentioning how Samsung’s firmware is subpar (which, having owned several other Samsung products, doesn’t surprise me).

The solution is to give up on the entire idea of brand loyalty. These are commodity products. None of the products or companies are 100% reliable. To some extent, you get what you pay for, but paying more won't guarantee a trouble-free experience. When some brands command a premium price in the market, across their entire product line, that's really just a sign that they've managed to get customers to think their products are special. But everyone has shipped some lemons, and everyone has screwed over somebody with bad customer support.

No, I don't think so. Off-brands are hyper-unreliable. There is a stack of chips for reliability, ranked from 100% working, to ones which e.g. have half disabled, or ECC+scrambler.

On-brands at least /try/ to be reliable. Off-brands don't.

WD/Sandisk is a little weird, in that they went from on-brand to off-brand.

I also haven't seen many lemons from Crucial. They control the process end-to-end, and don't really go for bleeding-edge performance or rock-bottom price. For me, they've always been good and reliable for both memory and flash.

Kingston, Lexar, and Sandisk were premium brands which really tried to be reliable, until WD bough Sandisk.

Samsung cares about reputation too, but pushes performance a bit more, so you sometimes get lemons. But on the whole, they at least try.

> Off-brands are hyper-unreliable. There is a stack of chips for reliability, ranked from 100% working, to ones which e.g. have half disabled, or ECC+scrambler.

Nah, that's not at all accurate. There are tons of small brands that have both dirt-cheap drives built using whatever components are cheap that week, and more expensive products using high-grade current-generation flash and controllers. Kingston and Lexar brands operate that way, too, with a mix of cheap crap and more expensive products with quality components (often overpriced). And some of the small brands don't really do bottom-of-the-barrel drives at all (eg. Corsair, Sabrent). And certainly Crucial/Micron have had their fair share of serious SSD issues. Brand just isn't the reliable indicator of quality that you and many other consumers wish it was.

No. Some of the best brands are the ones you’ve never heard of like innogrit and phison, because they’re the ones making the chips everyone else uses. Most companies are just building out reference designs from one of these handful of companies and the quality is completely identical.

And the ones who aren’t, like Samsung, are notorious for even worse reliablity.

The situation is completely reversed from 10 years ago, these are commodity products now and the people who use the reference design do better than the ones who go their own way, like Samsung. Samsung’s reliability has been worse than a random small brand like innogrit or Sabrent in the modern era, 100%.

I wouldn’t really consider WD an off-brand. Off-brands to me are the cheapie no-name brands from China.

I would definitely consider WD an off-brand at this point.

It's not about the size of the company. Many small, boutique brands are excellent. By revenue, Walmart is the largest company in the world, but no one praises them for their commitment to quality. Indeed, the stores I trust most are small family-run shops where the proprietor has been running them for decades.

I also no longer consider Chinese brands to necessarily be off-brands. China has some excellent brands too these days.

Colloquially, those are the six-letter brands.


Crucial is my go-to brand for reliability at reasonable prices.


* Kingston has an excellent reputation for reliability, as does Lexar, but you often pay a premium.

* Samsung and Intel are considered good, but they do occasionally make specific subpar models.

Crucial is my go-to, thought.

I also buy direct to avoid knock-offs. Buying premium brands from eBay or Amazon especially makes no sense, since knock-offs are so !@#$% common.

Kingston were being anti-consumer around 10 years ago. I don’t know if they have improved their practices since then.


Lexar was bought by the Chinese company Longsys.

Kingston has a reputation for bait and switch as in the V300 scandal (https://www.anandtech.com/show/7763/an-update-to-kingston-ss...).

Did that result in a decline in Lexar quality? It's a serious question.

- Many Chinese firms bought established brand names in order to milk a cash cow and sell junk at premium prices.

- Many others did so with very good products, and seem to be doing a nice job sustaining those brands and are investing in this long-term.

I don't know what happened here.

ThinkPads don't seem any worse under Lenovo than late IBM, and GE Appliances, if anything, went up in quality.

I bought an SD card for someone a couple of years ago and he had no issues, but I can't remember if it was after the acquisition or before.

SK Hynix is an OEM supplier for a bunch of brands, and seems to have good ratings across several different e-tailers.

As for anecdata, I chew through SSDs due to high disk I/O loads from running continuous integration test suites for PhotoStructure on a bunch of different test rigs. I've had several WD and Samsung SSDs fail, but have yet to kill one of my 10+ SK Hynix sticks.

The best solution, of course, is to have automated backups in place, so drive failure is only a minor inconvenience, not a tragedy.

Where do you buy SK Hynix from?

> But fortunately we have the courts.

How long until next drives are sold with a shrink-wrapped forced arbitration clause? (In the US)

It is nice that the Democrats are trying to address this: https://www.banking.senate.gov/newsroom/majority/brown-intro...

Probably unenforceable.

My first “private” hard drive, that was all my own, was a WD external. It was a birthday gift from my parents. Ever since then I had purchased WD drives and recommended them. I used them exclusively in every computer I built and used them exclusively whenever I needed external storage.

They were hanging on a thread as far as I was concerned when they shipped NAS drives with SMR. I was burned by that. I’ve moved to Seagate now. WD has lost all the trust I had in them. I’m soon buying over 100TB worth of drives and not one of them will be WD.

>>>I am baffled by why WD isn't voluntarily issuing product recalls, replacements, and/or firmware updates as necessary in this situation

The last (class?) lawsuit seemed to clear for a pittance, I'd think objectively, its cheaper to resolve the lawsuit than issue the recall.

This is not long-term thinking, but keep in mind this is the same (product) management that seems to have created these hardware problems in the first place.

And WD can always change names and remove negative brand equity.

Every disk comes with a statement the manufacturer does not warrant damage caused by the loss of data. This suit sounds tough and is going nowhere.

This journalism is incredible. I love when reporters absolutely go in on these companies and expose them for their sloth.

> Schultz repeatedly refused to answer any of our questions. Her statement doesn’t even contain an acknowledgment of the issue and has no specific timeline for any real answers.

This kind of criticism was everywhere in the article and I absolutely love it. Hold their feet to the fire for making shitty products. WD deserves it.

Anybody else remember when they has small NAS devices with critical flaws that were actively being exploited and WD didn’t patch them because the devices were too old?

Don’t buy anything but physical hdd from WD.

> Don’t buy anything but physical hdd from WD.

Don't buy anything from WD.

I RMA'd a failed SSD to to them under their "advanced replacement program".

They put a hold for the value of a new SSD on your card and ship it to you before you ship them the dead drive.

When they received the dead drive, the claimed that they would be unable to service it under warranty because the sticker had been removed.

I quoted the FCC's Magnuson-Moss Warranty Act to them and outlined how it is illegal to void a warranty based solely on the removal of a sticker.

They stopped answering my emails/tickets. Charge went through on my CC for replacement drive + $25 advanced RMA fee + shipping.

I sent Apple Card CS team a transcript of my interactions with WD, they filled a chargeback for everything.

Similar story with an $800 Asus monitor. The power supply blew inside warranty, I shipped it back in the original packaging, and they flaked on the warranty because the panel was cracked. Almost certainly on their side. They strong-armed the matter, but my proof pictures were a bit fuzzy so I didn't feel like I had a small claims case or anything. They won. Ugh.

Similar story with a $300 Fitbit. The battery died inside warranty (shorted capacitor) and they flaked. I bought off Amazon, but I guess you have to watch closely which store because only some of them come with a "manufacturer's warranty." Really? They can mention a manufacturer's warranty in advertising and then just drop it in fine print somewhere? Ugh.

Warranties are trash in the US.

Happens a lot with Amazon. They’ve been making it harder to find 1st party shipped and sold for sometime now. received a nice set of knives for wedding that instantly rusted, I then discovered all the things on my list were sold from sketchy third party, I then proceeded to free return many many pounds of kitchenware, cast iron, etc because ain’t nobody got time for counterfeit nonsense. This was a handful of years ago and it’s only gotten worse.

There’s a lot more shady things going on with mismatched skus and reviews. What you see is often not what you get. Amazon is gutter level now.

Amex buyer protection is great for these situations.

The solution to this that I’ve seen many businesses taking now is to refuse Amex.

I've seen that since being a kid. In my experience, businesses that refuse Amex are either great enough to be comfortable rejecting customers and mostly good faith, or mediocre to bad and not worth patronizing.

The companies figured out that if they all suck, you’ll have no choice but to accept the suck. This is what happens when we have no real consumer protections and why stuff like USB-C iPhones and replaceable batteries and banning some bad ingredients in food come from the EU.

Thank you for pointing out the specific Magnuson-Moss Warranty Act vs sticker and such. Appreciate it.

At this point SSDs are so cheap, I'm inclined to just buy a 7.5TB Enterprise SSD from Kioxia for $800 and call it a day.

I understand that's pretty expensive for a hard drive and probably not in most people's budget, but having an enterprise drive seems like it would avoid most quality issues.

Also, 7.5TB of SSD space is a good amount to last me a while.

I use enterprise SSDs now for the same reason - serverpartdeals has been reliable so far and has old stock 8TB under $500.

I am considering something like that, and noticed some enterprise / data center SSD makers do not publish firmware updates. Checking as I type this: Kioxia seems to only have consumer SSD firmware/tools/binaries. Micron has a list of firmware downloads for their drives. Samsung seems to have only consumer SSD firmware/tools, except for SM/PM863. Solidigm has a comprehensive status and download page for their drive firmware, including all(?) previously-Intel drives. Western Digital seems to only have consumer drive firmware available without a support ticket.

I am discounting whether email support could provide firmware, as I believe firmware updates should be as easy to access as possible (and have read forum posts describing the issues encountered when trying and usually failing to request via support channels). - If I missed something, please post it, would be good to know.

You’re still going to want backups though, and at that point, drive failures shouldn’t be catastrophic because you can just restore onto a fresh drive.

Yeah, but can you use it in a regular desktop (not to mention laptop) since it doesn't use M.2?

Uh yea, probably should be sent to the state attorney for that bullshit.

>Don't buy anything from WD.

I don't doubt your bad experience. But do you really think that there do not exist other customers of other vendors with similar, or even worse, stories? The tacit assumption is that if even one such story exists, then one should avoid the vendor, but I question if that's ever really possible past a certain scale. (Which is an indictment of scale itself! But that's another discussion)

It reminds me of an HN post during the Covid lockdowns telling people not to go to the hospital because they got MRSA at a hospital and were still dealing with it years later. It's not that they were lying, it's that they were giving bad advice based on their experience. It's a risk/reward calculation. If one bad experience is enough to delegitimize every service or product, then there would be NO services or products available to us.

I agree with what you’re saying but this talk of WD customer support going silent after a customer points them to a law they seem to be breaking -- and then proceeding to actually violate said law -- is not something I would write off as simply a bad consumer experience.

Their CMR 5400 WD Reds are the best drives nowadays quiet and fast.

If you don't need any kind of support maybe.

What are other good options ?


If I'm correctly reading the comment you're replying to, it was against the law for WD to void the warranty. So, the warranty was not void.

It appears that their legal council would disagree. Treating call center workers poorly is not an acceptable response IMO.

> I quoted the FCC's Magnuson-Moss Warranty Act to them and outlined how it is illegal to void a warranty based solely on the removal of a sticker.

> They stopped answering my emails/tickets.

It reads to me that the communication happened in an asynchronous channel. This doesn’t appear to have anything to do with a call center employee having a bad call experience.

To engage with your prior comment: sending such a claim in an email seems like a plausible way to get a company to at least think more about their practices. To your point, it would likely take a successful lawsuit to change their lawyers’ minds on this.

Regardless of the incompetence of the company representative, the company is obliged to obey the law.

It appears that their legal council would disagree with their interpretation of that act. Treating call center workers poorly is not an acceptable response IMO.

My parents still have one. I had to tell them how to unplug the Internet from their router so they could power it up and get their photos off.

Who sells a product that at purchase time brags about it lasting a long time, but then when it needs support it is too old?

Intel does (or at least did) do this too, by not providing cpu microcode updates for some cpu models. "Only a small percentage of people use those..."

Thus our "Not a single damn thing from Intel" policy for the last few years. FUCK that crowd. :(

Never, ever again.

Does AMD do better on that?

So far it seems to be "yes". At least, none of the AMD gear we're using has had this happen.

Hopefully AMD doesn't go down that same dark path.

Pretty interested about this since I want to upgrade my hardware soon, do you have any sources (for both brands)?

I’d love to see a consumer / environmental protection law that says you need to provide functionality-preserving security updates to a product for as long as it works, or pay the current owner retail + 10% interest to buy it back.

I have a sphero bb-8, and two 2017 or newer cars that no longer function as purchased due to software updates. I’d like my $125,000 in cash please.

Even their physical hdds you have to be careful.

I haven’t kept up with their SSDs but last I remember they are still competitive/good there.

My x299 board refused to boot from a 2TB WD SSD drive with a single memory chip (forgot which color it was)... No issues with Crucial, Samsung nor Intel.

It may be due to some bad luck and selection bias, but I‘d recommend against physical HDD from WD, too. As far as I can remember, the only physical drives I owned that died where WD drives. Granted, they where WD Green, an especially cursed series, but I‘ve not bought another WD product since.

I haven’t had any such problems across various WD drives. They have marketing shenanigans (what with the Red/Red Plus thing) but the drives themselves seem to be solid in my experience (haven’t tried any of their SSDs though).

I feel like back in the day, it was actually Seagate who had the bad reputation, but when recently looking at the Backblaze stats, it seemed that all three remaining brands (WD, Seagate and Toshiba) are solid.

It depends how far back your "back in the day" is.. for me, the meme bad hard drive manufacturer was IBM with their Deathstar^W Deskstar drives


I wrote about my experience with broken SSDs before. I had that one from SunDisk, and when it broke, I contacted support. Without much difficulty, I received a replacement, which works fine. I also had the same problem with a Samsung SSD, but I wasn't able to get any meaningful response from them at all.

So the lesson is not about the broken drives, as all of them could be broken. It's about who is actually willing to fix the inconvenience for you.

SunDisk/WD will do it, and you'll have a working drive. Samsung, on the other hand, will do nothing in this case.

For years, I have been convinced that the only reliable storage solution is an array of drives from different suppliers.

My primary desktop runs ZFS with 4x10TB HDDs for bulk storage. These are in two mirrored pairs, so I have 20TB usable space. These consist of three different brands. The two same-branded drives are in different mirrors. When it is time to expand, I intend to swap new drives into the existing mirror, so that I will have mirrors with drives of different brands and ages.

> My primary desktop runs ZFS with 4x10TB HDDs for bulk storage. These are in two mirrored pairs, so I have 20TB usable space.

You can have 30TB of usable space with SHR or RAID 5

I sure could, with

1. write performance equivalent to the slowest drive in that single vdev, rather than being striped across two vdevs

2. reads that cannot be distributed across the devices

3. riskier resilvers in the face of a failed drive

4. no ability to safely swap in new drives to existing vdevs as I described later in my post

5. less resilience to device failures

There are always tradeoffs. I hope this illuminates some that I have considered.

Go with LVM parity volumes.

1. For sequential it's up to 3x with full stripe writes, up to 3x in random.

2. Reads in a mirror can't be 'distributed' for the sequential access anyway - repositioning would eat all benefits, it's always faster to read further than to seek and read.

3. For 4 drives it's negligible

4. You can even swap in smaller drives, if your total dataset would fit, no hard limit

5. Exactly the same

And you are free to use the free space as JBOD, mirror, mirrored mirror or stripe.

That would be awfully hard to do as I am not using Linux (:

1. Yes, two mirrored vdevs are not the highest performance configuration for four devices, but they beat a single raidz vdev, which was the suggestion from the prior poster.

2. Random reads are more important than sequential for desktop responsiveness, and mirrors can serve different reads from each disk in the mirror, effectively doubling IOPS (minus overheads).

5. I am not sure I understand here. It seems like you're suggesting 3 data drives and a parity drive. This would allow at most one drive loss before data loss. Two mirrors allow at least one drive loss and up to two drive losses. If there were two parity drives, this would allow up to two drive losses without data losses, but it seems this would have to reduce your claim in point (1) to being 2x performance, since two drives would have to be used for parity writes.


2. Double IOPS sounds good on paper, at 100% it is still to slow, 100 IOPS at best.

5. Wait. Mirrored mirror (RAID11) on 4 drives would net you only 10TB usable.

'Mirrored pair' sounds like an ordinary RAID10 which can tolerate 1 drive loss because it's no longer redundant for half of the data. 'It can tolerate another loss on the still redundant data' is... like pulling out - works most of the time, but some become parents on the first try.

Hence RAID5 have the same redundancy tolerance (1 drive loss for losing R.) then, anything other is gambling.

The better question - do you have a proper backup to protect from a logical corruption? Even RAID11 wouldn't help from rf -rf in the wrong path.

It is objectively measurable and subjectively it feels better than tests on a single drive for day to day activity. Objective and subjective improvement is good enough for me.

Two mirrors == two vdevs, each a mirror of two disks:

    $ zpool status
      pool: zroot    
     state: ONLINE
      scan: scrub repaired 0B in 02:01:39 with 0 errors on Wed Aug 16 14:08:39 2023

            NAME            STATE     READ WRITE CKSUM
            zroot           ONLINE       0     0     0
              mirror-0      ONLINE       0     0     0
                ada2p1.eli  ONLINE       0     0     0
                ada3p1.eli  ONLINE       0     0     0
              mirror-1      ONLINE       0     0     0
                ada0p1.eli  ONLINE       0     0     0
                ada1p1.eli  ONLINE       0     0     0
In a best case failure scenario, I can lose one disk from each mirror. In a worst case failure scenario, I can lose one disk.

With one parity disk out of four drives (for 30TB usable), in a best case failure scenario, I can lose one disk. In a worst case failure scenario, I can lose one.

These two descriptions of resiliency simply are not the same. I would obviously not run the pool in a degraded state for any longer than it takes to resilver the degraded mirror.

Assuming disk failures are independent and one disk has already failed:

- 2 mirror vdevs: 1/3 chance that a second disk failure causes data loss, 2/3 chance that it affects the other mirror

- 4 disks w/ 1 parity: 100% chance that a second disk failure causes data loss.

Again, this simply is not the same characterization of resilience. Please do not mistake this for me claiming that running this pool with three disks is anything other than a situation to remediate immediately.

I take regular snapshots, and scrub regularly to ensure data integrity. Critical data is stored in at least three places, one of which is not my home.

> Please do not mistake this for me claiming that running this pool with three disks is anything other than a situation to remediate immediately.

About a decade and a half ago I setup a 48 drive Sun 4540 this way. The OS needed 2 dedicated drives on the first controller so I assigned two neighboring drives on that controller near the intake fans as hot spares. The remaining drives were organized into mirror pairs with each drive on a separate controller. The final root was a concatenation of the pairs.

This setup proved to be very robust. Although I left the company when financial troubles pushed the founders into a messy divorce, one of the founders asked me to come back a few years later after they restructured. The 4540 had been moved to a new colo but otherwise ignored because none of the admins knew Solaris or bothered to check the monitoring systems.

When I returned I discovered in the time I was away 4 drives had failed but ZFS/smf happily took two of the failed drives offline and replaced them with the spares. The two other failed drives left their respective pools degraded but functional. Replacing the failed drives was easy once they arrived.

Although neglected for years, none of the applications or databases that relied on that server ever suffered any downtime from the drive failures.

> These two descriptions of resiliency simply are not the same. I would obviously not run the pool in a degraded state for any longer than it takes to resilver the degraded mirror.

Then you are just wasting 10TBs.

I'm firmly in Murphy's camp regarding the storage troubles (hell, I can't find a specific pic of my passed dog which I know I had from 4 (four!) copies of my photos, ffs) and I treat RAID just as a mean to increase availability.

In your case, if you really wouldn't run the degraded array until it is resilvered then your availability is excessive in the all cases except:

how long it would take you to recover/restore your data in a catastrophic failure?

And why do you have only the critical data 3-2-1ed?

For the third time, my plan for growing the pool is based on being able to swap in new drives to trade them out with one existing drive in current mirror vdevs. As described previously, this allows me to grow two drives at a time and reduce risk of simultaneous failure in a vdev. As I mentioned before, this sort of swapping is not possible without degrading a raidz vdev.

As you have brought up, this configuration optimizes resilvering speed, aka minimizing time spent in a degraded state. Resilvering a mirror is faster than resilvering a raidz, as it requires only sequential read from the good drive and sequential write to the new drive. Resilvering a raidz thrashes all drives in the vdev and requires re-calculating parity along the way. Is it more likely that one of three drives in the degraded raidz1 fails during resilvering or that the one drive in the degraded mirror fails? That is a tough question to answer. What if all the drives are the same age? Well then the likelihood is greater that a resilver will cause a cascading failure. Hence my thrice-repeated desire to use the ability to swap new drives into old vdevs to mix ages, and my mixed-brand vdevs, as described in the root of this thread.

In short, resilvering is the highest-risk phase in a pool's lifecycle. I am paying an extra drive's worth of storage to optimize around this risk, and also to realize performance benefits we have discussed upthread.

Again, I will state that objectively measurable improvements combined with subjective perception of better desktop performance is sufficient for me to claim that the performance improvement matters.

My non-critical data is largely recoverable. Things like VM images and installers. These can be re-created and deployed from code that is in the critical portion, or downloaded from the internet. Indeed, my DR plan does not account for broad internet outages concurrent with a drive failure. A lot of the rest of the non-critical data is media that could be re-ripped from physical media (DVDs, Blu-rays). Being recoverable, it is not worth it to me to pay for cloud storage for the rest. Even "wasting" a drive's worth of storage, it is still a lot cheaper per gigabyte-year to ... you know, not pay for that.

So, why do you care so much about my specific storage setup? Does it pass muster yet, or am I still someone who is wrong on the internet to you?

> Is it more likely that one of three drives in the degraded raidz1 fails during resilvering or that the one drive in the degraded mirror fails? That is a tough question to answer

Not that tough. If the load of 4 drives working bring them to malfunction then resilvering by pairs only prolongs the agony a bit and technically you are now 2 (or even 3 or 4) times more susceptible to a failure during rebuild (foreach stripe what would fail).

> Hence my thrice-repeated desire to use the ability to swap new drives into old vdevs to mix ages, and my mixed-brand vdevs, as described in the root of this thread.

LVM allows that without limits imposed by ZFS. Not applicable for your setup, sure.

> , and also to realize performance benefits we have discussed upthread

Depends on the workload. I had very good response from RAID0/10 SATA setups a decade ago (even from WD Greens), nowadays I find it inadequate for the current bloat.

> A lot of the rest of the non-critical data is media that could be re-ripped from physical media (DVDs, Blu-rays).


> So, why do you care so much about my specific storage setup?

Mostly idle curiosity. Especially with my first take (using LVM) is just inapplicable for you.

> Does it pass muster yet, or am I still someone who is wrong on the internet to you?


But besides the idle curiosity I'm interested in other people setups and in giving a free, no strings attached, advice[0].

To sum up your setup from my POV:

You are running a RAID10 setup to have the benefits of a slightly faster disk access and ease of resilvering. If there was a need to do something with this setup (and not limited to ZFS), I would consider the following:

use a part of the disks as a RAID5 (3+1) for the WORM-like data (ripped DVDs, installers etc): greater SEQ and RND access, more usable space

running RAID11 for the super critical data (so you can really lose 3 drives until you lose redundancy)

there are also an option to run one mirror and another mirror or two simple volumes to have a protection against the logical failure, but you have ZFS snapshots which covers this use-case, for the most part

Of course this is just fishes for though at best and again with ZFS (and [1]) there is not that much leeway, less the need to do anything.

[0] heh

[1] https://news.ycombinator.com/item?id=37190493

Thanks for the quiz and free advice.

I'll ask you a few questions in return: does it seem like I have not put a fair bit of thought into my storage needs? Do I seem ignorant of disk topologies and their tradeoffs?

In fact, I have run LVM as a storage manager in the past. There are many reasons I prefer ZFS. It is more rigid in some ways, and the benefits outweigh the bit of vdev inflexibility that I have to put up with. Perhaps understanding these preferences would be worthwhile before trying to tell me how to improve.

Frankly, you come across as trying to convince me that I am wrong and that you know better. There were several points that I brought up repeatedly which you have dismissed out of hand. As just one example, you insist on dismissing the benefit of improved random IOPS, despite admitting that there is a measurable improvement when running a pair of mirrors; I stated several times that it is objectively better (which you never debated) and I experience a subjective improvement in responsiveness (which you repeatedly dismiss as trivial, negligible, or otherwise unimportant. Regardless of any difference of opinion, flatly dismissing what I have to say is not a great approach to educating or advising. Forcing me to repeat myself without actually addressing the point is rude at best.

So, given that you clearly do not listen to me, I will treat your advice as worth exactly what I paid for it. And I hope you will forgive me if I do not consider repeating the same ideas and dismissing what I have to say as "idle curiosity". I do not say this out of malice. I have no reason to believe your intent is anything other than you say, but your intent is irrelevant; the effect of your communication is as I have described above.

Hopefully someone else reading this thread will learn something.

The only reliable storage solutions are s3, abs, gcs, etc. Nothing I’m doing at home is going to come close.

As an example, one thing I use my storage pool for is streaming movies via Plex (from Blu-rays I own). If these were stored in S3, in addition to paying for that storage monthly, I would have to pay >$2/movie to stream it to my own house, thanks to Amazon's $0.09/GB egress fees. This is on par with what it would cost to rent a movie from any of the major online streaming services that do rentals.

There are some things that are worthwhile to store and have available on the network, but not to pay a third party for.

Fair enough but then on the flip side, how bad is it really if you take a data loss?

I’m storing baby pictures and the like at s3 glacier.

It is annoying and time consuming to recover.

There is a whole sort of middle area between ephemeral and needing 11 9s of resilience that you seem to be discounting. Perhaps you live in a world where those two extremes dominate. Maybe you can accept that others do not, and not ask strangers to justify their choices to you?

Here is a master class on how not to handle public relations and how to erode your brand reputation.

They don’t have much of one left as it is.

It’s what happens with limited competition and no real repercussions from poor behavior that takes them in billions.

I know WD used to be the darling, mostly of gamers it seemed. I never REALLY trusted them because the clients we had with the WD Black drives seemed to have more failures than my beloved Ultrastar or even Deskstar (I seemed to miss the Deathstar failures, I believe it was largely an issue of cooling). I sure was sad when I heard WD bought HGST.

I had one of the Deathstar drives that I started getting I/O errors on, so I went to replace it. When I was taking it out of the system I realized it was quite hot to the touch. This was a mini tower system and I had two of the 5" bays open, so I closed them off with cardboard, left the original drive in the system and booted it back up, and it ran without errors for a couple three years after that.

WD cleaned up their act after the deathstar era and Seagate repeatedly shat the bed on HDD reliability (which continues to this day, backblaze had a bunch of seagate failures again). There’s literally only 3-4 players left, HGST was bought by WD, and then there’s Toshiba, so the choices aren’t very deep. I’ve had good results with Toshiba but they’re often more expensive, and volume is a lot lower than the other players.

Samsung SSDs are another darling of the gaming community that has produced a parade of defective products over the last decade. Literally since the 830 series I think there has only been one or two models without a massive firmware defect or faulty flash.

Deathstar was well before WD owned them I believe. I think that was post IBM, Hitachi or HGST.

Isn't it funny that Backblaze finds Seagate fairly high in failures, but keeps going with them. But they are optimizing something different than I was. I was managing fairly small clusters, in the 10-150 node range. I considered it fairly expensive to have a drive failure, even if we used "remote hands" to replace the drive, we'd spend easy a couple hours on a drive failure between replacing, rebuilding, verifying, procurement and testing replacement drives. We ended up using mostly pairs of Deskstars between 2000 and 2010 when I was dealing with most of my mechanical drives.

We have been using Samsung Pro and Evo 860 drives at my current job, probably have 40-50 of them, and have had good luck with them. But, I'm starting to wonder if we should rely on them based on the discussions here. We only really use them in dev and staging, production is largely Intel.

I only recall one Samsung failure I've had, I had one as a ZFS log/cache drive for ~3 years and it started reporting errors.

Yes, Backblaze optimizes for planned failure rates and the consequences for their operations, and they will take cheap and fails if it's cheap enough and you can give them 10 million. that's why they didn't overly care about the seagate failures from the thailand thing etc. we need hard drives, we need to not pay $1k a drive (whatever), let's go have people buy the externals we can get. And seagate has always shipped by the truckload, and now they're cheap because people avoid them lol (legitimately).

I also agree with your point that you're optimizing for something different than backblaze. Same for a lot of homelab users. High failure rates on seagate? Umm, if they're sensitive that actually might make them worse in consumer service if they actually do fail all the time. Your whole disk (possibly array, hope you had backups) can go down and some of these failure rates have been absurd, it's like 130% annualized failure rate (everything is dying within 8 months) on the worst of it. And it's just consistently always seagate that is fucking up and while I'm sure This Time Is Different I have so far managed to evade all hard drive failure completely since I kicked seagate to the curb. including my seagate 3tb from the worst model (probably bet it's dead now tho lol) and an ancient seagate laptop hdd from my first laptop (conroe).

WD has mostly been fine except for the stealth SMR thing. I've never had a problem with any of my WD Reds, I knew what was on the market at the time (8TB red and hgst white label heliums) and I bought a bunch (8x8tb for $135/dr, it was cheap) and shucked them, and I got those exact models and it was fine. I should spin up my old USB 3.0 raid array and see if it spins, 4xWD Red 2TB as well, never been a problem (and that was 5400rpm). Toshiba has also been fantastic in terms of quality, and the X300 and N300 are basically the same in all respects other than name (the bit error stuff is bullshit). Toshiba has much smaller volume than either of the other two, but, I love their products when they're available and price-competitive, they've done great for me.

Personally for HDDs I just bought some HGST helium 14tb SAS3s from serverpartdeals for $135 a pop. the fanout of a SAS controller is nice in my environment anyway. that's gonna be my cold storage to migrate off 8x8 - I don't see a reason to run 6+2, I do see a reason to run 3+1 and just increase capacity and have alternate backups of important stuff.

You can also just buy 7.92TB enterprise U.2 2.5" TLC NVME drives for around $350, but then you need a trimode adapter (if SAS) to run them. But that's preferable to me rather than running consumer M.2 NVMe at $170-200 for 4TB. You can also run them off pcie adapter cards or M.2 slots (with riser cable) of course, but, they do run hot and I like the fanout factor, not like I'm ever going to hit anything super hard.

I saw someone else mentioning 7.92TB enterprise for ~$800, that's pretty compelling and crazy at $350. But, I'm just putting together a home server right now and it's all built out with 2.5" SATA drives (9x Samsung 860 EVOs I got cheap when my previous company went out of business, going to put 7x Toshiba spinning drives in once I get the data on them backed up). Got a Dell R730 with 256GB RAM and 24 cores for $350 landed off ebay and couldn't resist. Set up a VM last night for my son to do a backup of his laptop before he upgraded the OS, wanted to foster his desire to do a backup for sure. :-) "Ok, send me your SSH public key by magic-wormhole and install restic..."

> I seemed to miss the Deathstar failures, I believe it was largely an issue of cooling

The biggest trouble with WD drives (at least pre 2015) was quality of the power. If you had a good PSU you wouldn't had a problem with WD. If you had some shit like Thermaltake...

I had a case when TT PSU consistently broke RAID10, because there were too much USB devices. Replaced with a proper PSU with the same rate and everything went fine.

>It’s what happens with limited competition

What do you mean? The SSD space is pretty competitive. If you go on pcpartpicker.com and search for SSDs you see dozens of brands. Any company can slap together an SSD by soldering a controller and commodity flash together.

>Any company can slap together an SSD by soldering a controller and commodity flash together.

That's kind of why I trust and prefer Samsung (yes, despite the firmware woes) and Micron (aka Crucial) for SSDs. Kioxia too, if we're talking enterprise.

They're all in the business of making the actual NAND and maybe also the controller.

To Samsung's credit, they tend to fix firmware problems eventually.

As such, by avoiding recently-released drives, I've had nothing but positive experiences with my fifteen or so Samsung drives, including several models that originally shipped with serious firmware bugs.

While essentially obsolete at this point, I've also had excellent long-term experiences with (pre-WD) HGST SAS SSDs that use third-party (Intel) flash.

Kingston still seems to make their own flash and controllers too. They’ve been around for a long time (since the 80s, per Wikipedia)


Let's not forget the Kingston V300 scandal [1]:

> The first generation V300 (which was sampled to media) used Toshiba's 19nm Toggle-Mode 2.0 NAND but some time ago Kingston silently switched to Micron's 20nm asynchronous NAND. The difference between the two is that the Toggle-Mode 2.0 interface in the Toshiba NAND is good for up to 200MB/s, whereas the asynchronous interface is usually good for only ~50MB/s. The reason I say usually is that Kingston wasn't willing to go into details about the speed of the asynchronous NAND they use and the ONFI spec doesn't list maximum bandwidth for the single data rate (i.e. asynchronous) NAND. However, even though we lack the specifics of the asynchronous NAND, it's certain that we are dealing with slower NAND here and Kingston admitted that the Micron NAND isn't capable of the same performance as the older Toshiba NAND.

[1]: https://www.anandtech.com/show/7763/an-update-to-kingston-ss...

It depends on the specific product. For instance the KC3000 uses Phison E18 + Micron 176-Layer 3D TLC combo[1] that's used by a bunch of other brands[2].

[1] https://www.techpowerup.com/review/kingston-kc3000/

[2] https://www.techpowerup.com/ssd-specs/?f&controller=Phison+E...

Same, I can vouch for them as well. I still have one thats nearly a decade old and still being used fine.

Has Samsung replaced the SSDs with little life left because of their broken firmware?

So you trust the companies that have less competitors and outsized market power?

They trust the companies that have ownership of key parts of the stack, rather than picking vendors for each piece and slapping them together. Such companies have a lot more internal expertise.

Yes, it'd be nice to have more vendors like this, but you don't necessarily build a more robust industry just by having more vendors that buy NAND and a controller, make a few firmware tweaks for differentiation, and slap together a design based on an app note.

The presence of the second-tier manufacturers forces an interoperable world at the component level.

If Samsung made flash that only worked with Samsung controllers and Samsung-exclusive PCB designs, they'd have a far smaller addressable market.

I'm surprised we aren't seeing more flexibility out of the third-party market though. I could see someone saying "I'm willing to spend $100 for the most bestest possible 256Gb SSD for an industrial use case", and they'd sell you the parts that are normally sold as a 2TB MLC drive, but with a controller permalocked to using it all in psuedo-SLC cache mode, for example.

> The presence of the second-tier manufacturers forces an interoperable world at the component level.

This is ignoring that there was a market for flash before we had NVME SSD controllers, and that there still is one.

NVME is a big chunk of where flash goes now, but I don't think vertical integration is going to make flash disappear in the secondary market.

In any case, -- yes, it's good to have a lot of manufacturers around for the industry. But it may not be as safe of a bet to buy from those other guys. There's enough gotchas in the design and support, and there's also the whole issue of supply chains and component quality.

The HDD space has fewer vendors (especially in the consumer space).

It certainly seems that way. Earlier this year I was looking for internal 2.5" HDDs with more than 1TB of capacity, and I could only find Seagate drives.

I try to always buy from different vendors, but in this case I had no choice but to buy 2 drives of the same model.

I just looked because I thought that can’t be right, and it seems like all three manufacturers (WD, Seagate and Toshiba) sell such drives


Huh, okay I'm facepalming now. I have no excuse.

I even used PCPartPicker for this PC build but for some reason it didn't occur to me to search for possible drives there to see what other stores could have them besides my usual stores (also I avoid Amazon for anything related to computers).

Thank you for removing my tunnel vision.

Be warned - all SATA drives there are SMR.

For a desktop drive, it shouldn’t matter since you’re probably not going to RAID it, right?

(I guess it might be slower for big transfers, but since you’re probably going to keep frequently-accessed files on an SSD, that shouldn’t be a huge problem either)

I do have them in a ZFS mirror, actually. And I noticed the performance is abysmal, both read and write. So the warning is a good one.

For my use case it's just Yet Another Different Place (tm) to keep another copy of some files, so I can tolerate low performance if it means it's cheaper.

Since I'm still learning and don't have enough confidence in my abilities when it comes to storage reliability, I know that it's just a matter of time until I mess up and lose a few drives due to a silly mistake from my part.

Hmm, I’ve read that the combination of SMR drives and ZFS is a surefire way to lose data though.

Apparently, if you need to rebuild the array due to losing a drive, it will take so long (and stress the remaining drives so much) that the chances of losing another drive in the process are non-negligible, thereby losing the data on the entire array.

Therefore, it might be wise to consider CMR drives for your use case.

As a general tip, all three manufacturers have specific NAS-class drives, which cost a bit more but are more reliable and usually CMR:

- WD has their Red Plus/Pro line (the non-Plus/Pro Reds are SMR, so avoid those)

- Seagate has their Ironwolf (Pro) line

- Toshiba has their N300 line

edit: ah, you just have them in a mirror. In that case, the magnitude of the risk may be less as you only need to copy data once to rebuild. I’m not sure though as I’m not an expert on the topic.

> edit: ah, you just have them in a mirror. In that case, the magnitude of the risk may be less as you only need to copy data once to rebuild. I’m not sure though as I’m not an expert on the topic.

Sadly, the problem still exists even for a simple mirror.

Though it can be mitigated by configuring a slow rebuild rate, so the new drive would have the time to perform the maintenance.

I appreciate the advice. Thanks!

Re-checked, by the cache size Toshiba MQ03ABB200, 2TB is a CMR drive.

Rule of thumb - drives with 16Mb (or less, but you probably don't want a 8Mb cache drive in 2023, lol) are CMR, drives with >128Mb cache are 99% SMR.

And look at my sibling comment

Related question since we’re on the topic of cache: does it matter how much cache your drive has? I’ve seen drives with various cache sizes for sale, but I’ve never really looked into the difference.

Seagate Exos X20 hard drives have a cache of 256 MB and they're using Conventional Magnetic Recording (CMR).

You are technically correct, best kind of correct. But you missed this:

> It certainly seems that way. Earlier this year I was looking for internal 2.5" HDDs with more than 1TB of capacity, and I could only find Seagate drives.

I didn't miss it, I was under the impression that the discussion steered to all kinds of hard-disks. I don't think you can install into a laptop the previously mentioned WD Red and Seagate IronWolf :-)

By the way, I'm shocked that someone would put a SMR HDD into a laptop. Most laptops already had slow disks when CMR was used, but with SMR, I don't even want to think of the results.

> I was under the impression that the discussion steered to all kinds of hard-disks.

Fair assumption.

> [...] laptop [...]

If you said this because of any message from me, then I probably should have given more context.

This was for a desktop PC upgrade, in a Corsair 5000D case. The OS (FreeBSD), user files, and almost everything is in M.2 (two of them, in a ZFS mirror). I just wanted a mechanical drive where I could send ZFS snapshots, local git mirrors, and some other rarely-read rarely-overwritten files (stuff like also videos/epubs/etc that I occasionally serve over HTTP). The purpose is just having as much storage as cheap as possible, and more conveniently than an external USB HDD (which I also have and use for backups, but I don't want to keep it connected to the PC).

The initial plan was to use 3.5" HDDs because the 5000D case technically supports them, but while actually building it and saw the PSU cables my noob lazy self thought how much of a PITA it would be to actually use 3.5" HDDs (or how ugly it would look if I put them in a different place than intended).

So the plan changed to using the 2.5" mounts, and here I am now.

I'm currently working on building a NAS (separate from this Corsair 5000D desktop build), so soon-ish I should be able to just use that NAS for these files, and use the free 2.5" mounts for SSDs to increase the system ZFS pool.

Hopefully this additional context clarifies things.

SMR is ungodly slow as a NTFS Windows boot/system HDD, compared to a regular HDD.

Why would you use an HDD as a boot drive in this day and age anyway? SSDs are cheap (around €10 for the cheapest 128GB drives where I live [1]).

[1] https://tweakers.net/solid-state-drives/vergelijken/#filter:...

>Why would you use an HDD as a boot drive in this day and age anyway?

I didn't purchase it but thought it would have similar boot times to a regular HDD, which is only a few seconds more of a delay than a SATA SSD.

Nope this was one of the 8TB WD SMR's and takes about 10 times as long to boot compared to regular HDD. Still that's only a few minutes total and it's easy to just boot a few minutes earlier before I want to use that particular PC. Once Windows is initially loaded into memory (against upstream return data being slowly written to the SMR) it mostly works like normal after that.

Plus when you've got terabytes there's plenty of room for a number of 64GB partitions to install various bootable OS's in, without hardly compromising the major amount of free space you have left over for bulk storage in the large partition(s).

This situation is particularly volatile because some very effective journalists have lost their work stored on Western Digital's broken drives.

My favorite story, of which I heard a few very similar sounding ones over the years, is the case of dead SanDisk SATA SSDs _sometimes_ timing out on all commands after booting up. So if you have one of those drives in your system and you have one of 3 common motherboard vendors, you'll sometimes just get a black screen. This seems to occur consistently when the computer was on in the last 30 minutes, leading most to assume bad motherboard or PSU. I replaced an entire aging system because I couldn't nail down the issue - and then it followed me into the new one because I falsely assumed a misbehaving drive wouldn't cause my BIOS to wait forever. This was the most frustrating year I've had with computers where I'd run my machine 24/7, just so I wouldn't have to deal with repeatedly trying to turn it on with 30 minute breaks in between.

So, what does HN consider to be good brands to buy these days, when it comes to SSDs? Sorry for bluntly asking.

All the major manufacturers have had significant issues at one point or another particularly with consumer drives, and it seems to go through cycles as well so there is no satisfying perfect answer here imo. I also haven't seen as much hard data on SSDs and they've been evolving so fast, so everything feels a lot murkier. At least with spinning rust we have Backblaze's drive stats going back many years now.

- If it's a single drive situation and reliability is very important, something aimed at "enterprise" tends to be a good bet. More firmware testing, more over provisioning, and so on. And if nothing else there tends to be less BS around getting warranty coverage honored. Also costs significantly more. Micron, Intel D3s etc.

- If it's multi-drive, don't focus on finding the one best brand, use ones from different major brands with redundancy. The goal isn't to try to ensure no drive failures, but rather lower the risk of correlated drive failures so that when they happen the system has time to restore redundancy and there's no issue.

- If it's single drive and some down time is acceptable for cost then just roll the dice, though even there I'd stay away from QLC and the ultra mega bargain basement stuff.

In all cases buy from somewhere reputable that actually sells drives directly, like B&H or CDW or the like. If you buy from Amazon beware of counterfeits and also "new" stuff that is heavily used. Always check SMART stats right away and make sure drive lifetime is where you expect (not hundreds/thousands of hours on something "new"). And of course, always have good backups and/or some way to rapidly bring stuff back up.

Why stay away from QLC? Does it not have its place in high-read low-write consumer workloads, such as software or media libraries?

>Why stay away from QLC?

Same basic idea as SMR hard drives: the savings are too minor to be worth the asterisks. If you're running at megascale and can afford to customize your workloads, software, OS, and firmware, around it and at scale the ROI crunches out as worth it that's fine of course. But at the individual or even SMB level? Nah. Like, taking Samsung as a fairly large, decent brand example, their 4TB TLC EVO drive is ~$220 right now, whereas their 4TB QLC QVO is... $200. Heck for whatever reason their 1TB at B&H costs (a very little) more, $65 for the QVO, while the EVO is $60. Is $5-20 over terabytes and years worth the slightest hassle or weirdness? QLC is IMO essentially a scam at the consumer level, dramatically impacting reliability/flexibility/performance consistency for really minimal cost savings.

>Does it not have its place in high-read low-write consumer workloads, such as software or media libraries?

Frankly no. I mean, putting aside whether SSDs themselves make sense in such an application at all vs spinning rust (same money will buy you a nice 14TB hard drive), but to the extent they do I still think the above holds. QLC is just physically fundamentally sketchier tech. And how sure are you (and how much trouble do you want to go through) about what your workload will be like long term? If you put the money/time investing into getting together a decent little SSD NAS with a bunch of terabytes, and then a year or two down the road want to have some VMs on it too, or have a fileshare and then it gets hit harder than you expected, or there's some bug that causes some write amplification, and then your QLC drives end up hosed was that really worth a few cups of coffee?

Maybe so! SMR really is worth it to a few places as well. Or ZFS dedup for a software example (with hardware implications). But all of this for me falls under the heading of "if you don't have a specific reason for it, don't".

We still hear comments about specific brands and models but the lessons on these are always learned after months or years of the devices being sold. We can still ask the question, sure. But a more practical lesson seems to go for redundancy because this will still happen to us: Not purchasing all items from one manufacturing lot; Not pairing more items from one lot than you can afford to loose; Staggering device lifetime when possible; Running long tests when putting in service; Keeping an eye on drive statistics once in production and acting on them; Running schemes (RAID, backup, whatever) where it's actually practical to replace one device while in operation; Testing schemes so they will actually work out; Effective backups; Being ready to lose the cost of occasional drives; etc.

I've been quite happy with my SK Hynix NVME drives.

But I think to a certain extent it's a bit of a crapshoot as to whether you're hitting a manufacturer while they have an undiscovered issue.

I bought a whole series of Samsung drives in both NVME and SATA and they've all been perfect but this just all happened to be before their more recent issues. Before that I bought a bunch of Intels that all died in annoying ways (I believe they turned out to have a pretty bad write amplification bug tied to their power saving behavior).

In production we try to use only Intel, and have had good luck with them over the decades. I say "try" because Dell makes it hard to tell who makes the different drives, but sometimes you can track it down by the specs or part numbers.

Our buying cycle a couple years ago we ended up with some "Toshiba THNSN81Q92CSE 1.92TB SATA eSSD" in two systems, and ran into some severe performance issues, system IO wait time going through the roof. Ended up spending significant time with Dell support, them identifying a couple drives in each box to replace, and also doing full "RAID chain" replacements (controller, cabling, backplane). All of the replacement drives we've gotten have been Intel 1.92TB, so maybe Dell is seeing problems with Toshiba? Since then the systems have been solid, so maybe it was part of the RAID chain.

Aside: Decades ago, right when Intel was coming out with their SSDs, I had a consulting client that was a Rural ISP. Their NFS server was on 8x 10K WD hard drives in RAID-10. The continually struggled with performance, and we kept tuning, but would regularly just outrun the IOPS, mostly with e-mail access. A friend of mine happened to have a 600GB Intel SSD production sample, so I asked if I could borrow it for a few weeks. Sent this laptop SSD that would fit in the palm of my hand out to this client, had them hook it up to the NFS server, and then did "pvmove" from the RAID array over to the laptop drive. Steadily, over the next few hours, the performance issues were gone. It was pretty fantastic to see an 8 drive RAID array bested by a laptop form factor drive.

This can hardly have been a surprise. Intel's very first mass market SSD from 15 years ago had enough IOPS to embarrass a whole room full of HDDs, and unless you know exactly what you are doing, a Linux NFS server will not benefit from any width of RAID-0 due to concentrating all filesystem metadata load on a single disk or pair of disks in the array.

Don't have one to recommend for NVMe.

For SATA though... we have several Crucial MX500s 1TB that are approaching 2PB written, and none have failed. It's a shame this didn't transfer to their NVMe SSDs- they are all pretty harsh underperformers (though none broke).

Are you referring to the Crucial P5 Plus NVMe disks?

Had a few. P2 for density, P5 (not plus) for speed. Especially the P2 would sometimes degrade performance below what I'd expect from a SATA SSD.

Samsung Pro-series NVMe seems to be liked quite a bit.

Damn, it's incredible how much HN hates AMP, and how much better and faster this AMP presentation is compared to Ars' usual hostile layout.

Recently had a bad firmware version causing orders of magnitude increased NAND wear. Apparently took Puget Systems bugging Samsung about it to get a fix issued.

Nobody’s perfect.

The best case scenario is that the product ships with perfectly bug free software. The second best case is that when a major bug is discovered it gets fixed quickly which Samsung did. IMO that’s a bigger measure of a company’s quality, how they react when something inevitably goes wrong.

Puget Systems' handling of that issue earned them a $10K order from me after the HN story appeared. Super impressed with the customer focus exhibited by them.

They have problems in Linux with autotrim, to the extent Ubuntu put trimming script into cron, if it detects Samsung SSD when installed.

What's the best NVMe drive for linux these days? (Seamless firmware updates, autotrim, rock-solid reliability)

By the way, but I've discovered that Seagate has published the source code for openSeaChest [1]:

> openSeaChest is a collection of comprehensive, easy-to-use command line diagnostic tools and programming libraries for storage devices that help you quickly determine the health and status of your storage product. The collection includes several tests that show device information, properties and settings. It includes several tests which may modify the storage product such as power management features or firmware download. It includes various commands to examine the physical media on your storage device. Close to 200 commands and sub-commands are available in the various openSeaChest utilities. These are described in more detail below.

I also recommend reading the Arch Linux wiki [2].

I'd also have a look at the Linux kernel quirks [3][4].

[1]: https://github.com/Seagate/openSeaChest

[2]: https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Firm...

[3]: https://www.kernel.org/doc/html/latest/nvme/feature-and-quir...

[4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

Might not be the answer you want, but I found it easier to not trust any brand and instead opt for a mirror-RAID approach. Yes, it's more expensive (and still has its own risks) but (IMO) it's better than trusting any manufacturer.

With only a handful of brands out there, you're bound to get burnt by all of them sooner or later.

Samsung NVMe and Sabrent Rocket in my opinion. Price goes down every year too, which is a plus.

RAID5 for me. i had 3 disks die over the years, but no data loss.

I have one of these and have used it high yet, so looking forward to that.

In other news, all of the WDRed CMR drives in my ZFS array started throwing read errors simultaneously so I've ripped them all out.

I'm about done with WD.

>In other news, all of the WDRed CMR drives in my ZFS array started throwing read errors simultaneously so I've ripped them all out.

I don't get it, is the implication that them showing errors at time should be a freak occurrence, and such an occurrence means that WD drives are low quality? Drive failures are more correlated than you think. If you bought and installed the drives at the same time, such a failure case is totally expected. The drives probably came from the same batch, and experienced similar amount of wear, so it's only logical that they experienced the same failure case.

Ah, the RAID array love story: one of many similar disks fails, you replace it, array starts rebuilding which causes an increase of reads from other drives and before rebuild is complete, more disks from the same batch start failing.

I don't deal with storage hardware much for some 6+ years, since I've switched to cloudy solutions. But at the beginning of my shared storage building journey, I considered arrays built from different disks somewhat "lame" or "unprofessional". I changed this viewpoint 180° after seeing multiple similar drives fail in a very short (hours, days) timespan for the second time.

I now have more trust for home-built ZFS setup using different disks than a dedicated storage appliance using enterprise-grade disks with almost following serial numbers.

Raid array should always be scrubbed every few months to prevent this exact situation

Scrubbing helps fix random errors in the drive data, but it won't prevent correlated disk failure. If a drive fails during the scrub, then you still have to replace the disk, and it's still very likely that another disk can fail in that time. If you just want to eliminate lemons, then you can do a burn-in test before adding it to your array, but IMO even bad disks usually don't fail that quickly.

Scrubbed as in erased? Where do you temporarily place 1PB+ of data while that’s happening?

Scrubbed, as in performing a periodic read of the entire array to 1) check the data for consistency and 2) exercise the drives.

I guess so! I was speaking with someone on Mastodon today with some WD Reds and they failed at almost exactly the same number of power on hours as mine (~43k hours).

Given mine were in a ZFS mirror, they would have had almost identical wear so I suppose we could say they that they're predictable (though mine are the CMR drives which you don't get in the 4T these days and SMR drives are bad for ZFS).

If I was to replace them with the same drives I'd have a rough idea when to replace them before they start failing but I'll be buying larger capacities these days so those numbers are probably out the window.

These are spinning disks, not SSDs. Spinning disks are not known to fail like this, unless the manufacturer made a horrible mistake.

Spinning disks absolutely do fail this way. Drives from the same batch can easily work 5+ years fine and then fail within a few hours of each other. Even worse, in a RAID the additional load from starting a new replication after a hot swap can push a borderline drive over the edge, and this is precisely the time when your data is at its most vulnerable.

I've worked at places where they always maintained a sizable inventory of new disks so that every time a new machine was provisioned, it would receive a RAID comprising of at least 3 disks bought from different batches at least a couple of months apart from each other, and as much as possible from different suppliers.

On the contrary. The same principle applies to them. Spinning disks might not have NAND cells that wear out, but they have other parts that have manufacturing defects and can experience wear: bearings, actuators, heads, etc. If you have 100 identical cars that rolled off the assembly line in sequence, you would their failure date to be be clustered, rather than being evenly distributed.

On a tangent, wouldn't it make it safer if you didn't start a NAS with a bunch of identical drives, but you got them from different manufacturers and models?

They'll still fail, but hopefully they'd be less likely to do it at the same time.

Yes, that’s generally good advice.

this is news to me and taken away my excitement of having my first nas. it's very good to know though. i wonder about good-guy computer stores, mixing batches for naive customers like me, ordering 2 identical drives thinking it was best. it's crazy, it really defeats the purpose of raid 1 if both drives are likely to fail at the same time. i guess i should buy another different brand drive, and keep one of the current pair in a drawer as the future replacement.

To take some further wind out, with current disk sizes there’s a decent chance of a second disk failing while you’re still replacing the first. Reading all the data off a disk is a stressful operation, you see.

But that means you need RAIDZ2 or erasure coding to be reasonably safe, which takes you well outside what most of these turnkey systems can handle.

maybe i'll start a data-recovery savings account instead.. for whenever it happens

Backblaze works well enough.

SSDs in raids famously all failed at once. Spinning disks basically don't, outside of (typically) firmware faults. Out of 100000 spinning disks I've managed, I've never had them fail at the same time.

Mechanical things don't fail as evenly as digital things.

As someone that has set up a lot of NAS devices, we're talking thousands of disks over the years, I would not put my money on your statement.

There are a lot of simultaneous failure modes. HP had one in SMART firmware that killed disks at some fixed number of hours.

There are also temperature excursion events that can lead to arrays failing within 10s of hours of each other, not enough time to do a full rebuild of a single disk.

Seagate had really crappy 3/4 tb disks that loved to fail rapidly and nearly all at once.

So your examples of failures include a firmware example, which I said "outside of", and an example where the drives were abused?

And another poster os talking about SSDs?


Intel also had a famous bug / 'feature' where the SSD refuses to respond after a fixed number of writes.

If all your drives were Intel and part of the same array, they'd all fail at exactly the same time.

Reliable storage is hard.

Eh? I've been admining raid arrays since before SSDs were a thing. Making sure you build your array from disks from different batches to reduce the likelihood of simultaneously failing has always been the advice.

I make no assertion as to WD quality, but failures like you describe have been the bane of sysadmins quite some time.


It happens. On my first job a batch of drives started failing daily, in sequential order by serial number. There was a bearing defect - the vendor ended up replacing about 400 drives, which meant many weekends of migration and restores.

Sounds like the exception that makes the rule.

Just because you were in a plane crash, doesn't mean it's common or frequent.

(With an apology to that guy who was hit with a nuclear weapon twice)

It happens. That scale is rare, but i can think of 5-6 times that I'm aware of where vendors proactively replace many disks.

ZFS is very sensitive and takes a long time to rebuild. I only choose RAID-10 for this reason, because I'd rather have most of my data than none of it.

More than 2 HDDs starting to fail at the same time would be really bad luck. Maybe the issue is elsewhere, like the HBA or the motherboard ?

For the story, I have a NAS with an AMD B550 motherboard and some Asmedia SATA controllers with a bunch of WD HDDs, using ZFS. At some point, they also started to throw weird read errors when under heavy load, but would work just fine the rest of the time. Turned out it was caused by me moving the SATA card from one of the PCIe slots wired directly to the CPU, to a PCIe slot wired to the B550 chipset.

Maybe they were bought together? I had a few drives I bought together (Seagate, though, I think), and they started failing within a month or so of each other.

I did replace them with WD, though, I think. Oops.

Bad batch, not bad luck.

> In other news, all of the WDRed CMR drives in my ZFS array started throwing read errors simultaneously so I've ripped them all out.

Is this multiple drives purchased around the time failing at the same time? Or is it a bad cable or controller?

They were all bought together and had the same load on, so quite reliable/predictable of them to all fail so close together.

For the record, they have 43k power-on-hours, though nothing has been written to them much for the last ~3k hours as I had feeling something was off before they started reporting errors and migrated the data away.

I've had multiple cases of IO errors with ZFS, and just two of them were due to the disks. Rest were connection needing reseating or cables needed changing.

If all of them went at the same time I'd try swapping to a different controller before I suspected the disks.

Same. I would try scrubbing with a different cable and if I was still getting errors I would switch cards and try another scrub.

Could be defect RAM, unstable power supply or other problem than the disks themselves.

I had one of these SanDisk SSDs and I lost terabytes of data within days of using it. Because of WesternDigital and SanDisk, I had an irreparable loss of art, notes, and photos spanning years of my life.

(Yes, I know, keep three backups, etc.)

Did you send it to a data recovery company? It's usually trivial for them to recover data off drives that fail like this.

But also expensive right?

Sounds like worth it for an “irreparable loss”.

I honestly didn't know this. I wiped the drive again (just to be safe) and was able to get a full refund

Wow, I didnt realize WD owns Sandisk. And i was always pretty satisfied with Sandisk tbh, and i also have this Extreme SSD, but this is making me reconsider somewhat…

Is there a resource to identify the affected versions?

Anecdotal: I've been into pc's for 30 years. Each failed harddisk I had over those 3 decades was announced in advance by increased sounds, lowered performance and SMART warnings. In all cases I was either capable of retrieving all the data before the thing died, or at least managed to recover most of it.

With SSD's, the failure was always sudden, zero warning, and resulted in 100% data loss.

My most important files live on traditional spinning hard disks and will continue to do so (and backups, of course).

Knock on wood.

Does anyone have a link to affected drives? I've got two which have worked fine for the last three years but both are under 1TB. Does this only affect drives with more than 1tb?

Having one of the possibly effected devices, I typed in my serial number and it claims to not be effected (has an updated firmware). There is no date on that web page about when the firmware was issue. Is this a case of the issue actually being fixed with the firmware update, so they can just blame those with data loss that they didn't keep up to date, or is this a problem even after?

Thanks for that! I've got V1 and it seems to be a V2 issue... Phew!

I bought Sandisk SSDs & SD cards. All, literally ALL, of my Sandisk SSDs & SD cards either failed (no longer read) or have significantly degraded performance.

That's freaking sad to see such idiocy from the company to which I trusted so much((( I have several their very old (couple of 500Gb and one 1Tb) HDDs that's in daily use all these 15+ years...but it looks like I wouldn't buy from them anything new.

I see I have one of these devices as a local backup device. Any recommendations for a more reliable alternative?

Somehow I don't think Backblaze's statistics are going to help much :-)

There are only 4 names I need to remember for the remainder of my career to be satisfied with my SSD needs:

David Goeckler (CEO) Michael C. Ray (GC) Robin Schulz Samsung

I don't think it is as easy as a firmware fix - seems the problem is on the SSD controller level.

It's possible that they keep selling them because the issue was silently fixed with a hardware or firmware revision. And of course they don't reply because they know anything they say will be used against them in a lawsuit. These days with the short attention spans of people it's more intelligent to do nothing but wait until things blow over.

it seems more likely that they keep selling them because they like money, and never fixed it, and don't reply because they know that nobody can do anything about them not replying (their liability is a good point, too)

since we have no evidence the issue was fixed, and since it seems to still be happening, we can safely assume it was not until such evidence arises

Good on them. No good ever came of engaging with the Verge.

Seriously though, these questions from the Verge sound like a five year old demanding to know why they aren't allowed to eat ice cream in bed.

Zero attempt at an actual investigation, just weirdly demanding to know from their PR people why computer stuff breaks sometimes. This is absolutely trash journalism.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact