Hacker News new | past | comments | ask | show | jobs | submit login
SanDisk Extreme Pro failures result from design flaw, says researcher (tomshardware.com)
281 points by dangle1 on Nov 12, 2023 | hide | past | favorite | 158 comments



I used to work in manufacturing test for an SSD supplier. This would normally be covered by an “ongoing reliability test” in quality. But I also witnessed that quality can be a highly politicized arm of manufacturing companies, and finding issues with products is not always well received, while approving products is always well received.

In many consumer products, tests like that are often not implemented or curtailed compared to OEM products. When you buy from a company like Dell or Apple, you get the benefit of having a large organization providing accountability. In other words, when a company like Dell represents their interests in receiving quality products to uphold their reputation, they also have a shared interest with the end consumer — but carry a lot more weight since they represent large contracts with the supplier. Suppliers tend to put more effort into testing their OEM products so as not to damage their business relationships.

Anyway, this kind of thing happens all the time in consumer storage. Likely nobody was doing reliability testing on these drives in the first place since that costs money and can only expose problems they didn’t really want to know about.


In a perfect world this would be true, especially at the large business level where the integrator will get their ass sued by the customer or at least be forced to make good on the situation.

In the retail and small/medium business market the reality is that Dell, HP, and the like are under so much pressure to cut margins that they'll go with whoever is cheapest, and customers almost never escalate things to tort.

Dell PC power supplies are made for them by someone else, proprietary in size and connector, and gosh, wouldn't you know it - they have a pretty high failure rate. They last just long enough to make it out of the warranty period, and then they make for a really nice revenue stream for Dell via replacement PSUs or pushing the customer to buy a new system entirely.

Even failure within the warranty period is acceptable in the consumer market because integrators have it down to a science exhausting people on the customer support side. Long phone queue times, incompetent support agents who have to transfer you to different agents and likely drop the call entirely, silly policies like requiring a reformat/OS reinstall for everything, and so on.


>Even failure within the warranty period is acceptable in the consumer market because integrators have it down to a science exhausting people on the customer support side. Long phone queue times, incompetent support agents who have to transfer you to different agents and likely drop the call entirely, silly policies like requiring a reformat/OS reinstall for everything, and so on.

This is one reason why I believe Apple computers last much longer than Windows computers. With Apple, they only sell a few models in high volume. So if there's an issue, everyone will know about it and Apple will often have to do a mass recall or provide free repairs. And since Apple prices are higher, you'd assume that they use better-grade parts on average.


> So if there's an issue, everyone will know about it and Apple will often have to do a mass recall or provide free repairs.

I wouldn't say Apple is any better than anyone else - aging iPhone batteries and butterfly keyboards both had a class action lawsuit settlement, it wasn't out of good PR that these got addressed. I suppose you are right that everyone will know about them, though, given that those were from memory.


>I wouldn't say Apple is any better than anyone else - aging iPhone batteries and butterfly keyboards both had a class action lawsuit settlement, it wasn't out of good PR that these got addressed. I suppose you are right that everyone will know about them, though, given that those were from memory.

That's the point. If 5% of PSUs failed inside a Dell computer just outside of warranty, no one would care except those affected. If the same thing happened on a Mac, you'd get a media storm and a class-action lawsuit and Apple will eventually settle by giving out repairs - even if the failure happened outside of warranty.

I did get a free battery replacement for my iPhone 6S.


i only had apple devices fail in my hands. all with video memory corruption. never had a hardware failure when the company used Lenovo (ibm era)


Buying from an OEM certainly doesn’t come with any guarantees. It’s a price/quality contract in almost all cases though. The OEM defines an acceptable defectivity rate in their contract (even if allowed DPM if high). This effectively establishes a requirement at the supplier to ensure they will meet it.

For consumer products, you can assume that this added requirement doesn’t exist.

Edit: as another example, it’s well known among hardware suppliers that being a supplier to Apple can be a double edged sword for this reason. They have very high quality expectations and they squeeze extremely hard on price. But for that, they bring high volumes. If your company doesn’t have their stuff together, they can easily get raked over the coals in Apple contracts.


> ...Dell, HP, and the like are under so much pressure to cut margins that they'll go with whoever is cheapest, and customers almost never escalate things to tort.

Can confirm. Have an office supplied HP business desktop. One day noticed that my system is slower than normal. After 5 minutes with smartctl, I found out that the SSD was constantly throttling down SATA link (SATA downshift), was not reading or writing more than ~250 MBps, and had some wonky latency issues.

Got a new SSD, moved the drive with dd, and all my problems are solved. Previous drive was by Samsung, but it was a "value" drive which even Google knew nothing about. It was probably built with bottom of the barrel parts, and something went bad earlier than expected.


> Even failure within the warranty period is acceptable in the consumer market because integrators have it down to a science exhausting people on the customer support side.

This has been true for at least decades. It's why I completely ignore all warranties when I'm deciding what to purchase -- they tend to be essentially worthless, once you factor in the cost of trying to make a warranty claim.


It's funny because retail-boxed Intel CPUs used to overclock better, at least in the Celeron 300A days.


Except that a non-overclockable CPU isn't a lower quality one. In fact, they may be sold cheaper to the OEM because they are less likely to be overclockable.


> non-overclockable CPU isn't a lower quality one

Generally, it did mean this. If we are to believe that Intel largely made the same CPU, and "binned" their processors into different speeds based on what they were stable at. And locked their multipliers to speeds that they'll be reliable at (lower quality = lower multiplier). But one could still set the bus speed to whatever they liked, and the retail boxed chips handled this better.

There would also be a market demand factor to it. If they got a large order for 266MHz chips, they'd lock them at the multipler for that, even if they could handle 300 or 333 MHz.

(Part of the rumour for some Celeron chips was that they were the same die but a fraction of the cache, so "Pentium" chips produced with a cache defect could have that section locked out and labelled a Celeron)

Nowadays, CPUs can often throttle themselves, so this binning wasn't as necessary to mitigate batch to batch variation.


> Part of the rumour for some Celeron chips

Not a rumor. Starting with Coppermine, Celerons were Binned P3s. Same die size and all.

Interestingly, AMD did not typically do the same for the Duron (with one or two exceptions). My understanding at the time was their dies had extra cache to handle defects without full binning.


> On the one hand, the resistors used in these SSDs are too big for the circuit board, causing weak connections

I am an electronics / PCB hobbyist and I can't for the life of me figure out how they came to such a weird conclusion. What does this even mean?

Larger components will have more surface area at the joint and should be stronger than a smaller component

> On the other hand, the soldering material used to attach these resistors is prone to forming bubbles and breaking easily, according to Häfele.

Never heard of solder doing this - it seems more likely to me that the solder wasn't reflowed properly in manufacturing.

What's more is that the component pictured is a capacitor.

The only conclusion I can draw here is that the guy has no clue what he's talking about


Hard to tell from appearance only but my initial impression is that's an inductor, not a capacitor. The circuit looks like a switching power regulator. The capacitors would be beige with silver ends, this one looks like an over molded inductor, similar to [1], and is used as the main power inductor in a buck regulator.

If this is an inductor, my gut reaction is it has an insufficient current rating for the application and it is overheating. Inductors have a bunch of loss mechanisms that contribute to heating. Depending on the type of metal used to build the core, it can 'hard saturate' and effectively walk itself off a cliff once the current draw gets too high. At some point, it gets hot enough to desolder itself from the circuit board. It's possible they did not see this in validation because the power draw of SSDs depend heavily on the work load and process variations in the chips; erase current can have a fairly wide variation.

fwiw, voiding of solder joints is a problem. The solder is applied as a paste - fine particles of metal solder suspended in solder flux. During reflow the flux evaporates and leaves the metal behind, but if the process isn't tuned right bubbles of gas can be trapped in the joint. This can lead to reliability problems. It can also increase the effective thermal resistance to the circuit board, which for tiny components like this can often be the primary path for heat removal during normal operation.

[1] https://www.digikey.com/en/products/detail/pulse-electronics...


The article says:

> the problem lies in hardware, not firmware, which could explain the lack of corrective firmware updates for those models and SanDisk's continued silence about the source of the issues.

But I'd guess a firmware update that slowed down the erase process could let it cool down. But the performance hit.

Are they not using charge pumps and these are some of the first SSDs upgraded to on-board inductored boost convertors?

These messes could be solved if system power supplies had a 20V rail instead of requiring tiny devices to make it. Maybe an integrated manufacturer (hi apple) will spec out proprietary SSDs like this one day.

Charge pumps are cheap and small, but not as efficient (ie: HEAT!):

> By using the boost converter with the optimized inductor, the energy during write-operation of the proposed 1.8-V 3D-SSD is decreased by 68% compared with the conventional 3.3-V 3D-SSD with the charge pump.

https://dl.acm.org/doi/10.1145/1594233.1594253

2023 paper:

> One of the main causes is the on-die charge pump circuit, which has a low conversion efficiency and induces high heat generation.

> Using the in package boost converter, we show that the power consumption can be reduced by up to 39% while the temperature rise can be reduced by 50%.

https://ieeexplore.ieee.org/document/10145971


  > These messes could be solved if system power supplies had a 20V rail instead of requiring tiny devices to make it. Maybe an integrated manufacturer (hi apple) will spec out proprietary SSDs like this one day.
Then you'll get people (like me) who will deride Apple for requiring a proprietary component where COTS components are available, calling it an anti-consumer move.


The last thing the world needs is another proprietary connector.

With you on that.


me too, but we're talking about a few extra minutes of battery life here. That's catnip to cat people.


Oh, if it were also smaller and lighter, we'd be in heaven. If it weren't for the proprietary devil lurking in the corner, showing us a fake heaven while having us in chains, sucking the life of our dreams.


No vacumoven?


I am electronics / PCB hobbyist and I can definitely see how their explanation can be true. I can't say it is, but I can see how it could be.

If you design a PCB for a given size of the resistor but then decide to use larger resistors without redesigning the pads, you may have reflow problems and weak joints. This is simply due to the fact, that the components are positioned due to surface tension during reflow process (they are pulled into place as the solder melts). If the pads are for smaller components, there will be too little solder for larger surface and weight of the component and working at a wrong angle to pull it into place causing potentially higher rate of failure.

> What's more is that the component pictured is a capacitor.

And that means what? From the picture I can tell that there is very little solder between component and the pad. Potentially too little to hold the component well in place.

> The only conclusion I can draw here is that the guy has no clue what he's talking about

Maybe he does, maybe he doesn't. Have you considered a possibility you are not an expert either?


As someone who designs circuit boards professionally, the explanation is clearly lacking. There might be a thermal issue or there might not be. There is nothing conclusive in the pictures either way. What I do see is the following:

1. Underfill (the brownish-tan smooth material surrounding the components towards the bottom of the picture) around the IC, which is typically done to make parts more mechanically robust.

2. No evidence of overheating on any of the thermal interface material that is left stuck to most of the components and no evidence of overheating on the PCB or the components themselves.

3. Completely insufficient evidence to declare a soldering issue. The way to prove this one way or another is x-ray inspection to look for voids in the solder or a mechanical cross-section of the suspect solder joints.

While this certainly could actually be the problem, I see insufficient evidence to conclude one way or another. Manufacturers don’t put underfill under a part unless it’s required through testing or experience with similar package types in prior designs since it adds cost, additional process steps and makes it a PITA or impossible to rework any bad components in the area.

As to the pad size/shape, there are three general classes of design defined by the IPC (standards body that deals with PCBs and PCB assemblies). Depending on how space constrained your design is, there are different recommended pad designs for passive components like these. They might be using one of the tighter spacing guidelines, but if their process is well controlled, it can be perfectly fine for the design life of the product.

If you want to see small pad layouts done well, look at an iPhone logic board.

If you want to know more about pad design for SMT parts, search for IPC-7352


My totally unsubstantiated guess from the description alone was 'I wonder if they switched to a larger package component and forgot to update the pads.' That could be described as the 'component being too large for the device' and while it might just fit, it may be borderline mechanically and electrically stable. That could also explain the added underfill. Is that possible?


It’s certainly possible someone did a BOM substitution and didn’t due diligence on it, but I doubt it. PCB assembly houses tend to notice components that are suddenly too big for their pads because they’ll have fallout in AOI or later testing.

The underfill was likely added before full production as the result of reliability tests that showed some mechanical susceptibility of that IC.


Does seem a bit strange, but the original article[1] in German, translated using Google Translate, reads as follows:

> “It's definitely a hardware problem. It is a design and construction weakness . The entire soldering process of the SSD is a problem,” says Häfele. A hard drive has components that need to be soldered to the circuit board. “The soldering material used, i.e. the solder, creates bubbles and therefore breaks more easily.”

> “In addition, the components used are far too large for the layout intended on the board,” says Häfele, explaining the technical problems: “As a result, the components are a little higher than the board and the contact with the intended pads is weaker. All it takes is a little something for solder joints to suddenly break.”

It sounds like what they're saying is that the solder pads are too small for some of the components. Not sure about what they're saying about the solder though.

[1]: https://futurezone.at/produkte/sandisk-ssd-ausfaelle-western...


> Not sure about what they're saying about the solder though.

There's more than one solder alloy in use. There's more than one class of solder alloy in use. Some are easier to use, some are harder to use. Some are high-performance, low-tolerance, some are low-performance, high-tolerance. Some are expensive, some are cheap.

The most troublesome family is SnBi. These are relatively new. They have a big "greenwashing" problem in that they solder at lower temperatures, which is "environmentally friendly" (and cheaper to run). Also the base metal is dirt cheap. (Wonder why manufacturers are interested?) It's also very, very brittle. It also happens to be a low-temperature alloy... so it's much easier to get hot enough to desolder during operation. Lots of trouble all around and in general a very high field failure rate. Not recommended... oh wait but it's cheap and greenwashable. Sigh.


Are there places that use SnBi for production devices? I know Bismuth alloys are used to desolder stuff (and they work amazingly well for that), but the general rule is that you should clean it up before soldering something new. (And keep it for later use, because it isn't exactly cheap.)

It's a heavy metal and reading https://en.wikipedia.org/wiki/Bismuth#Toxicology_and_ecotoxi... it looks like we don't know a lot about it yet, but to me it looks extremely unlikely to be better for the environment than SnAgCu.

Also Bismuth appears to be rare: https://en.wikipedia.org/wiki/Abundances_of_the_elements_(da... Rarer than palladium. All the even rarer elements are generally known to be rare and/or precious, or radioactive elements that normal people would never come across.


> it looks like we don't know a lot about it yet, but to me it looks extremely unlikely to be better for the environment than SnAgCu.

Bismuth is quite safe. Pepto-Bismol (bismuth subsalicylate) is over 50% bismuth by mass, and it's sold over the counter.


Lenovo had a lot of press releases about switching over. I don't know to what extent they actually did it.

I agree it's probably not any better for the environment, but you know how the PR cycle goes.


I won't ever forget the widespread BGA failures caused by the RoHS-forced switch to lead-free solder. No doubt massive amounts of additional ewaste were created, but at least it's "environmentally friendly" ewaste?

Military/aerospace are still exempt and continue to use leaded solder.


If you are talking about Nvidia's flip chip problems. Those were actually caused by the glue holding the chip onto the substrate, not the solder. The glue expanded at a different rate from the solder balls and caused them to crack.

This was especially the case on consoles. People kept reballing and doing other useless repairs that solved the problem by accident by melting the solder balls between the substrate and the silicon chip. Some even managed to remelt the solder balls simply by replacing capacitors, which then made everyone think the capacitors were the problem and everyone swallowed it because replacement capacitors were cheap.


I don't want to pick a fight, but here's my rando opinion on that:

Almost all electronic devices end up as e-waste after a few years. If a couple % fail prematurely, that doesn't create a massive amount of additional ewaste, but rather a _very_ slight increase in e-waste. And it's relatively benign e-waste. You could shred the board and sprinkle it over your field and it wouldn't be a huge problem (* don't take my word on this; there's flux residue and somewhat toxic stuff used in other components, the plastics will probably leak BPA and other stuff, etc.)


> It sounds like what they're saying is that the solder pads are too small for some of the components

The converse is also possible. Instead of being a design flaw with the pads too small for the component, it could be that a larger component was substituted during manufacturing. Even terrible freeware EDA packages have design rules that will flag improper solder pad layouts, so it seems like what might have happened is the physical part does not resemble its model.


> Even terrible freeware EDA packages have design rules that will flag improper solder pad layouts

No, they don't. EDA software doesn't really know what size the terminations are. It knows how big the pad itself is, and is very good at keeping those out of trouble, but it doesn't know what size the solderable area is. You might tell it, or give it a 3D model, but make a mistake there and you're right back here. As well, there are so many different kinds of terminations (pop quiz: what kind are these?) that even if it does know what size they are, it doesn't necessarily know what size or shape the pad should be.

Also the CM will totally edit this stuff and not tell you. Which they're not supposed to do, and are probably better at if you're a huge customer, but they still do it. EDA sure doesn't know about that.


If the correct amount of pad is not exposed at the edge of the part, the solder will have nowhere to form a fillet which is critical to its physical attachment. Solder is not glue, and even with more pad contact beneath this is a physically weaker connection which often results in tombstones like pictured in TFA.

If you read the integration documents for these packages, you'll see that they distinctly specify the requirements for these margins. Probably the length is the more important axis and may be what he was referring to when saying "large". I've seen this be a problem particularly during the "chip shortage" where jellybean parts like these capacitors have the weakest specs in a design, meaning unilateral substitutions can happen at many points in the design/mfg pipeline.

Indeed brittle solder is a real phenomenon which is often easily visible in hand soldered joints that we call "cold" joints. Formation of bubbles can happen for a number of reasons, but IME it's the result of low quality solder or flux/cleaning. The organic compounds gasify in the heat and form an internal structure similar to bread.

ETA: an interesting paper exploring the cause and minimization of voiding in the reflow process. Particularly, the decrease in thermal conductivity in voided solder can critically contribute to its failure in high-heat operational environments.

https://www.circuitinsight.com/pdf/controlling_voiding_mecha...


> Larger components will have more surface area at the joint and should be stronger than a smaller component

Larger components are also, well, larger, and have much bigger forces on them. For ceramic capacitors you need to avoid shearing and torquing as the body of the capacitor is very brittle and a small crack means a dead part, possibly dead short. Big ceramics are dangerous to use as they have a high failure rate. I personally won't use anything larger than a 1210. Some of my colleagues think I'm nuts and should stop at 0805, but I think the flexible terminations available these days make 1210 viable. At least in medium volumes, I don't ship SSDs!

> I can't for the life of me figure out how they came to such a weird conclusion

What I see when I look at this is they have a part with a 5-sided termination (typical MLCC capacitor with metallized cap) but they have a footprint that only gets fillets on 1 of those 5 sides (typical would be 3). This is common for resistors... but resistors (a) have only 3-sided terminations anyway and (b) are made of robust alumina bodies, not fragile ceramics. So someone either got dumb with the footprint library or more likely overly aggressive to pack things in, not appreciating what MLCCs really need to be happy. I don't think it's part size changes, because the fillets along the length dimension that are visible look about right in size.


My gut feel was also cracked MLCC ceramics from thermal expansion or shock.

I've seen some 1206s shear right off a pcb from merely mechanical shock to the PCB, not the cap directly.

When I use them I try to orient them parallel with any PCB bending forces, but they are still fragile.


This is something that is in my area of expertise, and your suspicions are correct.

Solder can "bubble" but this is a line process issue that is easily picked up even in old AOI systems (automatic optical inspections) from 10-15 years ago.

To be frank, this article to me, reads like piece put together by somebody who has no idea what they're on about to generate publicity for their company. Nothing to see here.


The most charitable way I can read their statement is that the resistors are too large for the pad, and along with poor solder material it forms a weak joint which breaks over time.

I have a hard time accepting that because there is not a lot of heat on that line nor is there a lot of physical stress, like constant vibration on SSDs.


These SSDs are tiny. The controllers can easily get up to 80C during sustained writes, so there could be mechanical stress from thermal cycling. (Source: we also make small USB-interfaced high-speed storage devices and do a range of reliability testing for stuff like this)


On the SSD chip sure. This looks like a resistor on the data line. The resistor would certainly not get to reflow temp.


It reads to me more like the journalist writing the article summarized a technical report badly.


It looks to me like some glued on covering has been removed here, which in turn could have pulled the components off (could still be weak solder joints) rather than it being a manufacturing problem - the components don't look too big for the pads to me

Most modern manufacturing lines have manual and automatic (vision system) inspections that would detect badly soldered or toombstoned components like the ones shown here.


> What does this even mean?

It means you should click through to look at the pictures in the original article.


But there was something in the article about epoxy - so potentially the components are glued down with a conductive epoxy instead of being actually soldered. Why you do this? Don't know. But it would explain why the solder is losing the plot.


"Too big" could mean the pads on the circuit board were made for a smaller component, and now with the larger one, there's less overlap and direct contact from the pads on the board and the contacts on the component.


I stopped buying WD anything early 2010's, but then they acquired everyone else like Seagate, meaning even decent Hitachi disks would be now tainted to become typical WD garbage. I still won't buy anything WD, but alternatives are hardly attractive with the market limited to like 3-4 players.

Good old monopolies in effect, your options are bad or worse.


If Backblaze yearly disk stats and my personal experience in our datacenter is anything of importance, WD is generally the more reliable disk brand for the last decade or so.

I remember an era where Seagate Constellation (enterprise disks) were so bad, I was replacing them a dozen per week.

Also, from my experience SanDisk didn't get tainted by WD acquisition. Their Extreme Pro SDs still as reliable as before, and their portable SSDs hit the speeds and reliability they advertise.

Every manufacturer makes a design error almost once a decade. Seagate did it, Maxtor did it, WD did it before (their drives were very finicky), however all big producers are in good shape now, from my experience. I can equally trust a Seagate IronWolf Pro or its WD equivalent, or a Samsung SSD and its SanDisk equivalent.

Problems happen, PCBs got revised, things got recalled. Everything is new, but nothing has changed.


> Their Extreme Pro SDs still as reliable as before

Try this: https://news.ycombinator.com/item?id=38244389


These are SSDs. I'm talking about SD cards, which I just downloaded my photos from my camera while writing this comment.


Oops sorry. Completely missed that. o_O


No problems, things happen. To err is human.

Have a nice day.

:)


It's funny you say that. I always thought WD were the more reliable brand, and Seagate were trash.

I wonder if it's just a case of each of us having one HDD of a particular brand fail on us violently, and then finding others who were in the same boat.


Pronounce this in German: "Sea gate oder sea gate nicht" ("Sie geht oder Sie geht nicht"). Meaning "she works or she does not work" is a German word play on early failure rates for Seagate drives.

Coined when there was a time where if you didn't have Seagate drives in a RAID you were more likely to loose your data than not ;)

And yeah I started buying WD at that point. Backblaze stats weren't a thing back then tho.


> I wonder if it's just a case of each of us having one HDD of a particular brand fail on us violently, and then finding others who were in the same boat.

That is absolutely the case and anyone with enough experience could confirm it. Both WD and Seagate have made some real trash drives, and both made at least one or two models that were trash at scale. If you timed it just right you could jump from one to another and experience massive failures with both! You also probably have a drive from each that's been running for 20 years somehow.


Almost makes me pine for good 'ol Miniscribe ;)


I take it you mean "like Seagate [acquired everyone else]" because Seagate, Western Digital, and Micron are all competitors.


And don't forget Hynix. They somewhat recently got into the B2C business, and while they command a premium, the SSDs both OEM and Retail I use from them have been very solid.

There's also Samsung.


Don't forget the last (or first) player, Kioxia. Their drive is often installed on sold devices rather than a DIY parts.


[flagged]



I hadn't heard about the Seagate acquisition, that sucks. So what are my options now if I want a reliable external hard drive for example?


Just to be clear, WD has not acquired Seagate. They're still two different, competing, companies.

The above post probably typo-d "Seagate" while meaning "SanDisk".


I wondered if he was confusing the drama that happened with Seagate buying up Maxtor. A lot of people were upset when that happened because they trusted Seagate a lot more than Maxtor or Western Digital and suddenly the same shitty Maxtor drives many went out of their way to avoid were being sold under the Seagate name leaving people stuck with either buying WD or buying Seagate and probably getting Maxtor anyway. Seagate's quality and reputation took a huge hit.



> WD has not acquired Seagate

Hasn't it?

https://www.westerndigital.com/brand/sandisk


Reading comprehension. SanDisk is not Seagate


For external drives, I would seriously consider using SSDs. Unless you use them exclusively as cold backups and handle them carefully and seldom, I would be far too worried about accidental drops. I have killed some external HDDs this way, never killed an SSD, even though I am far rougher with them. For extra reliability, buy two disks from different manufacturers (e.g. Sandisk/WD and Samsung) at different times and mirror the contents. Less chance of both disks going bad at the same time.

Talking about 3.5" HDDs, sourced from external drives: WD is still ok in my book. Both the Backblaze report [1] (newest, quarterly version, check the drive hours, WDC has less than HGST so far) and my own experience show they are ok. I used to buy HGST based on Backblaze's reports, but now I am using WD external drives in my NAS. My oldest and most used disk (one of the parity drives) has more than 3 years power on hours with nearly 900 start/stop cycles. It shows no signs of failure so far.

I get these HDDs from external drives (called "shucking"), 10TB WD My Book or WD Elements Desktop. It is a bit random what you get, but between 7 HDDs (+1 currently in testing) over about 3 years, I only had one non-Helium drive that runs hotter than the other all Helium drives. No failures yet, no bit errors as well, performance is at least good enough for media storage, currently reading at about 180MB/s sequentially.

I saw one problem: USB errors with WD's USB-SATA bridge and I even had to remove the newest disk to run the test, it would drop from the bus via USB. Might be because it is a refurbished disk or something fishy with the USB 3.0 ports on my server, so I won't blame WD for it.

[1] https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-...


>For external drives, I would seriously consider using SSDs.

I wouldn't. I use my external drives as offline backups, so they don't get plugged in that often. SSDs lose their data if they aren't powered up regularly. And of course, they're much more expensive per TB than spinning rust.


What's wrong with the WD ones? I have a bunch of them and never had any problems


The funny thing is since these have been getting news even months ago, there was almost immediate fire sales on all the main deal sites to sell them off. Everyone that bought them now have a waiting time bomb of a disk to use. Thanks Western Digital for your contribution to society.


Costco was selling them (still!): https://www.costco.com/CatalogSearch?dept=All&keyword=ssd

Is Costco completely unaware of these massive issues?


Costco is actually a decent org, and if anyone knew they were selling this time-bomb garbage, they would stop it, as they will warranty stuff for YEARS, just to be a somewhat decent company in a time of pirates.


I own one of these disks and quit using it when the news came out, expecting I should hang onto it to get money back for a recall. Didn't even occur to me I could just have brought it back to Costco all this time because of their extremely generous return policy.


Not the same series. "Extreme Go" is not the same product as "Extreme Pro". I have two of these from Costco and they have worked fine for several years.


Maybe Costco caught up with this. I can't find it on their web site (at least in the US.)

All I see is the "Extreme Go" which I presume is a different product.


Blissful ignorance imho.


They did warn you - they put the words "Extreme Pro" in the name.

I guess the "Extreme Pro" solder reflowing skillz are required ;-)


Sounds like Western Digital's strategy is to play dead and wait for it to blow over. And it will most likely work.


They saw Apple get away with it and tried to do the same.


I've had a Fujitsu (if I remember correctly) drive many many years ago that had a hardware bug that would cause an IC on it to spontaneously flash fire and die.

It was a known flaw. They got away with it too.


no matter how bad the idea, there’s always someone waiting to turn Apple’s bad idea into a poorly implemented, even worse idea


There will probably be a class action lawsuit where everyone that bought one gets a $20 coupon towards a new WD product, and the lawyers make millions.


"resistors too big" ... <accompanied by picture of a capacitor>


Tom’s Hardware’s fault. The original source only says “components”.


I told myself I'd never again buy a WD drive when I realised the WD Red NAS drives I bought were completely unsuitable for NAS because they secretely replaced the product line with SMR drives.

And now you are telling me that the Sandisk SSD I bought as a replacement also has a fatal design flaw? And apparently Sandisk is a WD subsidiary?

I'm feeling slightly less bad about spending a fortune on getting a bigger built-in SSD in my Macbook. Please don't tell me they are flawed as well.


TFA is only about external drives.


Yeah, I know, I replaced my NAS with external SSDs.


Well they do have the kill your MacBook when they fail problem. ref: rossman on YouTube.


I'm unmoved and unsurprised. Retail parts are unreliable, cheap crap by the nature of the market created to perpetuate the fantasies of something for nothing.

Coincidentally, I recently selected Max Endurance with a 15 year warranty for a noncritical application and a non-retail channel Industrial XI for something else.

I'm also unsurprised there are no SLC or traditional EEPROM SD cards advertising these facts because of the race-to-the-bottom commodification of garbage by the price point obsession of users who don't know any better. In an ideal world™, all network and computing devices would use ECC memory but no we can't have nice things and would rather have silent corruption and bitsquatting to save a few cents.

PS: C. 2001, I intentionally tried to induce errors for failure analysis purposes of industrial Maxim flash EEPROM ICs rated for 10k cell writes by using an environmental cycling chamber with heat, cold, and humidity. The damn parts wouldn't fail beyond 2.5 orders of magnitude beyond that, and I started to question that writes weren't happening. If I had more time, I would've burned it down to the ground until there were many errors to characterize it. At the end of the day, it had to be left at using turbo codes to ensure redundancy of data by cell and across chips.


Maxim parts were and remain bulletproof, with prices to match .

I think eeprom longevity is intentionally understated due to practicalities of testing and possibly wide variations in lifetime beyond the spec.

And then you have Chinese domestic SPI NOR flash that kills itself after 3-4 erase cycles...


SSD's, when the fail, they usually fail catastrophically. Use automated backup software to regularly copy data to an HDD for anything you don't want to lose. And don't use SSD's for archiving, or long term backup purposes.

Also I stay away from Sandisk. They have always occupied the cheap space of drives and they have always been known to cut corners for profit.

Western Digital seems to be heading in that direction as well.

I have had a good experience with Samsung since the beginning of SSD storage.


> Western Digital seems to be heading in that direction as well.

WD SSDs which are SanDisks in a trenchcoat? Or WD HDDs which are their original business? (Or maybe both?)


It seems like both now, but their HDD's we're good before the switched to SMR.

Now I use Seagate for my HDD drives.



Reminds me of my old Corsair Voyager. "Rugged" USB stick housed inside a fully rubber enclosure, which constantly causes the USB plug to snap off. Forgot how many times I had to RMA that thing.


The firmware “fix” sounds suspiciously similar to their handling of a similar issue on the WD Blue SA510 SSD’s, of which I’m on my third, after the previous two failed in less than 12 months. Didn’t they start using some new 3D NAND chips? I wonder if there’s a flaw in those chips. They would be in use on many different products so may explain the similar failures?


I always found it somewhat amusing that SanDisk is very similar to to the french Sans Disque. Like the Chevrolet No Vá situation for spanish speakers.


That's entirely the point as flash or SSD are alternatives to spinning platters of rust. It's storage sans disk.

The company was originally SunDisk but switched to avoid being confused with Sun Microsystems.


Yeah, but when it fails (and dude, it does fail!), you are also Sans Disque.


I'm astonished that after WD bought the SanDisk brand they kept it alive. You couldn't pay ME to use anything under that name, it's so negative. Maybe now with this critical failure they'll just slowly start branding things with any of the other myriad of brand names they've bought "hgst" for instance and slowly kill the brand.


What's wrong with SanDisk? Out of the loop here -- I had a SanDisk SSD around 5 years ago and it was absolutely great; it's still going today (it's seen quite a bit of use, too.)


Yeah, kinda no clue what the controversy is cuz I've never had any SanDisk drive fail. Only WD :)


I've very rarely had an SSD fail in general, to be honest -- though I do generally stick to reliable brands[0], not "Xykdidlwo" or "Dyewkdlo" off Amazon.

Right now I've got 3 SSDs in my server (2 mirrored so 1TB for apps, and a 500GB boot drive), and I'm interested to see which one goes first.

[0]: Crucial, Samsung, Kingston, SanDisk (until I hear any information which discourages me) etc.



Yes, at least in terms of their memory cards for cameras etc. I’ve really only heard them as being quite well regarded, as far as I can remember…


I don't have any experience with their ssd's but I have a few sandisk usb drives that have lasted far longer than any other brand in that hellish environment of being an os system drive. It is not really that bad but with the frequency that usb flash dies when used as a boot drive you would thing I am abusing them. The no-names I understand, junk from who knows where. but the worst offender was kingston, they are probably fine on windows as a rarely used backup unit. but as an openbsd system drive, hot garbage, I went through 6 in six months, I would expect better from a named brand. as a comparison I am still on the original sandisk units, 5 years and counting.


Of the brands I’ve run across for SD cards, Sandisk has been top 3ish for quality. I’ve never had major issues at least for SD Cards?

Samsung has been catching up though.


What brand would you trust the most, for SSDs and for SD cards?


There's only four flash manufacturers: Samsung, Micron, SK Hynix and SanDisk/Kioxia. All of them have had problems over the years. All of them will change the internals of products without changing SKUs or anything visible to the consumer.

You best bet is:

- Buy a variety of manufacturers and SKUs

- Create backups regularly and test your restores


Also, always run perf tests (especially using large writes - preferably up to the capacity of the drive!) for any drive that it is important 'you got what you paid for'.

The number of counterfeit, badly designed to the point of defective, or DOA SD Cards and SSD drives I've seen over the last few years is crazy.

I literally won't even buy USB sticks anymore. The last time I tried, all 5 different makes/models I tried were so dysfunctional they were useless. Literally unfit for purpose. Major brands too!


Did you buy in person, or in an online marketplace (ex. Amazon)? I only buy thumb drives at physical stores to try and avoid outright counterfeits.


Both.

A lot (all?) recent USB sticks have terrible thermal design, and will throttle seemingly arbitrarily to very low speeds under sustained load. Like 2.5MB/s type speeds. They seem like they were made to to theoretically exist for the market niche, but no one expected them to actually be used by anyone who paid any attention at all.

Same for ones bought in big box stores as Amazon or the like. Name brand or random brand.

A lot of less expensive 2.5+ Gig Ethernet dongles do the same.

Good performance for 5-10 seconds, then abysmal.

I switched to SD cards, and at least the good brands of those had decent and predictable performance (50-75MB/s sustained for the same price point). They were also a lot cheaper in general for the capacity.



WD bought HGST? HGST are supposed to be far and away the most reliable source of drives iir.


I wonder if these drives were manufactured during the parts shortage?

Kind of makes you wonder what other devices are ticking time bombs.


Most of them, near as I can tell. Cars manufactured during that time have been having issues like crazy too.


One of the more interesting things to me is that while every storage medium has failures (which is why RAID and backups are a thing :-) there are more failure modes with flash storage that present as abrupt storage failure.


Extreme pro pun title phrasing!

Those extreme pros working for Sandisk - you can't really trust their designs, there's always some little bit that's off about them.


We have one of these as part of a critical video workflow. Anything we can do to mitigate it? Or do we just hope it’s not impacted / replace it soon?


If it's a critical workflow on which your business rests, then you immediately replace it with a better model/brand as that's a business tax write-off. Plus you have the usual on-site and off-sie back-ups which you should already have for your business.

You do have a back-up set up that you also test, right? Right? </Anakin-Padme meme>


If it's a video workflow it's likely more of a working drive, backups don't always keep up with the changes on the drive fast enough.

Unless it's part of a RAID array or something, but by that point you'd shell the money out for a better drive


The fact you have one SSD in a critical workflow is an immediate red flag. You should have some kind of redundant solution with backups even if you didn't suspect particular SSDs are prone to failure.


99% of small businesses just flat out ‘nope’ out when it comes to proper backups or redundancies though.


RAID and a backup strategy? There should not be a single point of failure. Just getting 2 new SSDs with a RAID 1 would be a massive improvement.

And, of course, a separate backup for them because RAID is not a backup.


Replace it with a different SSD sounds like the only option.


I think one can enclose m2 ssd's in usb adapters, then you just use well proven tech like samsung 970 pro, been chugging along on our build server for years now


Many of these adapters have their own quality problems which vary with the version of the controller. That version number is rarely available prior to purchase.


If you have a critical application, you can afford a vendor that uses TB4 with a good reputation.

Here are some options:

https://www.owc.com/solutions/thunderbay-flex-8

https://www.startech.com/en-us/hdd/m2e4btb3


If it's critical, you should not use a cheap SSD. It is better to use a SSD for professional use, for servers.

I have seen and heard too many consumer market a-brand SSDs break.


The Extreme Pro lineup isn't even considered a "cheap SSD", it's their highest end offering before you dip into their G-DRIVE line of rugged SSDs.


Replace it immediately, not soon.


It would probably help to describe your workflow so we can offer specific suggestions.


Looks like this particular problem is easy to fix though.


By whom? Your granny who just lost all the pictures of their grandchildren?


No but by me or anyone else who can hold a soldering iron :)

It's much much easier than a BGA cracking issue, or something internal in the flash which is basically unfixable. This is just some components tombstoning. It shouldn't cost a lot to get it fixed (of course Sandisk should take care of that)


The article unfortunately was written by someone with no clue so we don't know why tombstoned components (shown in the picture) were not caught in inspection/test. They imply the failures happened in the field, but that's not where tombstoning happens. Presumably what happened was that the supercap (looking like [1]) tombstoned in reflow. Then circuit test failed to test that it was installed so the unit was shipped. Subsequently in the field the unit suffered a sudden power loss with pending writes. Normally the supercap provides power for long enough to flush pending writes to NAND. But since it was open circuit, the power fail flush never finished, resulting in corrupted storage. Fixing the open circuit solder joint as you suggest does not remedy the problem for the user because their data is still gone.

[1] https://www.digikey.com/en/products/detail/seiko-instruments...


One capacitor on a tank array would definitely reduce its total capacitance, but they are nearly always in parallel and would not cause a failure of the whole tank, and the device would be inoperative if the output of the array was shorted.

I'm skeptical that losing one capacitor in the array would cause the failure mode you're describing. Especially if the age of the devices is considered, the array would have been designed with margin to withstand capacitance loss as the device ages.


"I'm skeptical that losing one capacitor in the array would cause the failure mode you're describing."

Depends on what the capacitor is being used for in the circuit. In many cases, having a cap fail open results in a higher current draw which kills the unit if left in operation for too long. This is the case on some of the off-road lighting I manufacture. If one cap is present and fails open at ground, the circuit overloads. If the cap is connected to ground but not the rest of the circuit, the circuit doesn't operate.

Regardless, one component being off can cause a whole chain of maladies.


Perhaps tombstoning causes it to short the whole array? I could see that happening if it's positioned just wrong.


> but that's not where tombstoning happens

yeah I know, unless the board gets so hot it unsolders itself, which is very very doubtful (and definitely a fault of its own).

I thought it was more of a stability problem though. Nothing a good backup should cover, and the device should be fine after soldering the component.


By anyone who can operate a stereo microscope and a surface mount solder station.

A Fisher-Price “My First 40 Watt Weller Soldering Pencil” won’t cut it for this type of repair as you’re not just flicking diodes off a board to “unlock” something.


It does for me.. I've soldered 0805 (and 1206 which was most of them fortunately) components with a screwdriver-tipped aldi iron as I didn't have anything else available. It was not a great experience but being very careful with the corner of it it worked.

But this is a super capacitor so it'll be a lot biger than that.

But a hot air rework station or a really fine temperature-controlled tip is way better of course, which is what I usually use.


If a fix requires soldering, then to >95% of people it doesn't exist. I would be surprised if even most computer repair ships were up to it.


Yes but this is more the problem with the mentality around today's disposable electronics than a real human problem. A lot of these skills have been lost.

In the 80s it was totally normal to get an electrical schematic with a TV for instance, and there were repair shops all over (or people doing it from home for a small fee as a side business).

These days it's not as impossible as people think. In fact very often when a TV fails it's a through-hole capacitor that is trivial to replace for a couple bucks. I have repaired several at work and for friends and they still work fine (I always replace it with good quality high-temperature rated ones, manufacturers often use too low a temperature rating so the equipment will fail far too soon and the customers buy a new one).


Guess who gets blamed if your soldered SSD fails.


Yeah, this stuff is harder than it looks. If you need too much time with the soldering iron, the temperature can conduct through the wire and fry other components, those sensitive ICs that are the flash chips in particular.


Are you sure the BGA is soldered correctly? Regarding the soldering, almost every 2nd component looks pretty bad.


They "assured" me that mine won't fail. They checked the serial numbers, and they're not affected (3 disks).

Now I'm in the dark again


I’ll bet one of the purchasing agents found a good deal on resistors and thought they were equivalent and swapped them out.


Somehow they forgot after 25 years of expertise what to do... plausible.


Is there a class action lawsuit yet?


Literally the first sentence from the article:

>A new report from a data recovery company now points the finger at design and manufacturing flaws as the underlying issue with the recent flood of SanDisk Extreme Pro failures that eventually spurred a class action lawsuit


Thank you. I skimmed too fast. My apologies. I am under the weather.


If that's really the issue, it's trivial to fix and you can pick these up for nothing in the secondary markets.


For you and, indeed, for me too. But, sadly, not for many people.


3 copies. Always. Spread them out on different companies and technologies.


And physical locations




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: