Hacker News new | past | comments | ask | show | jobs | submit login
Seagate Barracuda Ramp Weakness (2016) (dq-int.co.uk)
113 points by doener 14 days ago | hide | past | favorite | 28 comments

I'm not surprised it's about the infamous ST3000DM001, one of the few models of hard drives to have its own Wikipedia page and HN discussion: https://news.ycombinator.com/item?id=27419072

The broken ramps clearly use significantly less material, and are thus weaker as a result. If you look carefully you can also see what looks like stress whitening (slightly lighter colour) forming on the bottom half of the cracked one, as well as both halves of the one below it.

Maybe this could be the reason why these drives have had so many failures?


I can't find the original source, but if you Google around a bit, you'll find this quote from a German data recovery company called "Datenrettung".

"We must assume that this is an error in the design of the Seagate Grenada hard drive installed in the Time Capsule (ST3000DM001 / ST2000DM001 2014-2018). The parking ramp of this hard drive consists of two different materials. Sooner or later, the parking ramp will break on this hard drive model, installed in a rather poorly ventilated Time Capsule."

I'd feel pretty aggrieved if two disks in my four-disk raid failed simultaneously. Is there a standard practice for allowable failures in disk arrays that would have allowed for recovery after this?

This was using RAID 5, which in terms of reliability isn't great. It can recover from 1 disk failure, but as the recovery process is so intensive, it's likely another disk may fail during recovery - especially if they are all the same age. Other RAID setups can recover from two disk failures - but that's a trade-off with performance and useable space.

Mixing drives from different manufacturers may help, but really you shouldn't rely on RAID alone. For commercial use you could duplicate the data on multiple NAS systems, but at home you probably aren't going to do that. Simplest is to understand RAID is not a backup, and to store the data somewhere else as well :-)

Back during the dot-com era we lost a production array to the infamous "DeathStar" drives.

The system reported a failure, so we scheduled the drive to be replaced and brought up the hot spare and started the parity resync process. A little while later there was another drive failure and we told the data center folks to tell the tech to hurry up. While the tech was headed to our cage, there was a third drive failure and the array was toast. We were able to restore from backup, but the data was a day old.

Lessons were: Mix drives from different production batches (we couldn't mix manufacturers because of the leasing contract). Have a backup that you can restore from. Parity resync operations while the array is in use will put more stress on the drives than production use alone will, and will kill any (remaining) weak drives.

When I was selling SANs, the conventional wisdom was RAID10 for production/performance stuff, and RAID6 for archival/"colder" storage. Plus an external or offsite backup, as you said.

I build quite a few RAID1 arrays for workstations and home servers. Normal performance and rebuild performance and even degraded performance are great. Hypothetically the RAID5 is more efficient but for me doubled 14TB drives are affordable and compact.

At home an offsite backup is going to be best - the initial backup though, especially on a cable internet connection is going to take weeks.

> [...] the initial backup though, especially on a cable internet connection is going to take weeks.

Initially, I drove my HDDs offsite... and I keep another offsite backup in my car.

In the car is rather clever, I hadn't thought of doing that.

I've been maintaining an offsite at a friend's house that I rotate out whenever I visit. The car would be a good extra. Encryption required, but I'm already doing that.

> In the car is rather clever,

I'd give them credit but I forgot where I got that idea from ;) but I don't even encrypt mine because it has a higher chance of recovery that way.

Are you not worried about the temperature variations?

I thought RAID 5 was a no-go nowadays due to the high probability of rebuild causing a second failure?

depends on your needs for redundancy

Adding more disks to it. The thing is that disks used in RAIDs tend to be of the same model and even batch and thus suspicious to the same vulnerabilities. Home users typically don't really have neither the need of the capacity/speed nor the budget to allow enough disks in the RAID to allow two disks failing at the same time. As disks get larger and larger, it's really unlikely to have no defects for all disks during the expected life span of the RAID. If one disk in the array shows signs of failure, the others probably going to have the exact the save issue soon. Given the size of the disks, it can take days to rebuild the array after replacing a single failed disk. Making it worse, rebuilding itself also put more pressure on the remaining disks that might be imminent to fail and can really lead to a cascading failure of multiple disks, in which case, total loss of data is unavoidable.

So even my NAS supports RAID5/6, I still go with single disk volumes so that in event of disk failures I'll only have to replace the failed disks and lose only the data on it in the worse case. In fact, I don't lose much because my data does not have to be in the same volume and the disks are ready fast enough and made my home network the bottleneck.

Recently my small home server had a disk failure. One of the BTRFS RAID0 drives started having a lot of device errors, and shortly after the second drive started having errors too - but BTRFS was able to just take the healthy bits from both disks and I lost nothing.

However, since I had to now replace 2 disks I decided an upgrade would be in order, the disks been there for 10 years and I wanted to have more that 1tb for a while. Turns out it's two times cheaper to have a 4tb array with 3 2tb disks than a 4tb array with 2 4tb disks.

Well, I would say you had bad sectors on your disks, if there weren't too many of them, most modern file systems can handle it but still there are possibilities of losing data. But nonetheless, it's a signal to start migrating data.

What I meant for disk failures was really disk failures, there would be no way directly get data out of the disk unless you open it and directly read the plates, in your case, all your data would be totaled as RAID0 offers no redundancy and thus tolerates no single disk failures at all. Your data seems not much and it would not be that hard to migrate a 4TB array. Things would be completely different if you will have to rebuild more even worse migrate a much bigger array i.e. 4x10TB ones. Even if everything goes on well, it will take a few days just to copy the data somewhere else. If there were cascading failures, then there is really not much you can do. I can promise you that once you went through all those nightmare, you will never ever want to do it again. Take into those cost/effort needed, whenever I need to build a personal NAS, I always go with 2/3 biggest capacity NAS/Enterprise HDDs available and leave at least 1 vacant bay so that I can easily add more space without fiddling with existing drives. Even if the NAS eventually become full, I can still simply replace smallest drive with a bigger one and copy the data back.

Mix drives and manufacturers. It is much more likely to see clustered failures if you buy all your drives at once, from the same distributor, and they're the exact same model. Conventional wisdom says you shouldn't do this for either hardware or software RAID, and I wouldn't do it with hardware RAID, but why would you use that anyway.

Raid 5 gives up one hard drives worth of space and can survive the loss of one drive. You can go with raid 6, which will tolerate two drives dying, but it comes at the cost of giving up two drives worth of space.

As you add more drives, the odds of failure go up. It’s N drives all playing Russian roulette independently. With enough drives even raid 6 is not enough.

People use RAID 10 because single drive failures are cheap to replace (you copy one disk to another) you can survive some two drive failures, and the logical arrangement is simpler - it’s just stripes where each stripe member has a duplicate.

I wonder if it increases the probability of drives dying.

When you connect multiple similar devices together mechanically, their vibrations can sum catastrophically.

Ah, it’s mechanical resonance I’m thinking of.

Incredible little detail. I'm pretty sure I have a bunch of disks with similar issues that I couldn't diagnose because I'd never consider this a critical failure point.

Not many of us buy disks at enough scale in order to see patterns emerge from their failures.

I personally have like maybe 5 disks, two I purchased and three scavenged from systems they outlived. Not really gonna learn which ones are good and bad with that kind of sample am I?

It’s notorious that bad batches of hard drives came out. Circa 2000 the market was flooded with Maxtor drives, it was hard to find anything else at retail stores.

I bought 8 of them and had 5 fail in 2 years. Fortunately the failures were never synchronized enough to kill a RAID.

I've had 2/8 Seagate IronWolf drives fail on me slightly outside of the first year I had them. I'm using them in a Synology NAS. I've never had such problems with the WD Reds in my last Synology.

An article from 2016 makes the FP, but no updates about the huge WD hack that happened like two weeks ago??


What more updates do you want? AFAIK the last development was that WD offered some sort of trade-in program or data recovery service to the affected users.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact