Hacker News new | past | comments | ask | show | jobs | submit login
The SSD Endurance Experiment: Two petabytes (techreport.com)
167 points by geoffgasior on Dec 4, 2014 | hide | past | web | favorite | 27 comments

So, we know these SSDs will survive an ungodly amount of read-writes. Is there a way to test how they will survive as archival media?

Presumably the major cause of data degradation in archival time-scales is bit-rot from either component failure or media corruption. What are the sources of bit-rot and component failure? Can they be accelerated to provide a rough benchmark for component failure over longer time-scales?

Heat is typically used to simulate aging; other than that, just time.

Wow. And the 840 is still going error-free. That's pretty impressive.

The longevity of the 840 pro vs the silent failure of the 840 makes a really good case for the pro.

>The Samsung 840 Series started reporting reallocated sectors after just 100TB, likely because its TLC NAND is more sensitive to voltage-window shrinkage than the MLC flash in the other SSDs. The 840 Series went on to log thousands of reallocated sectors before veering into a ditch on the last stretch before the petabyte threshold. There was no warning before it died, and the SMART attributes said ample spare flash lay in reserve. The SMART stats also showed two batches of uncorrectable errors, one of which hit after only 300TB of writes. Even though the 840 Series technically made it past 900TB, its reliability was compromised long before that.

Note that they seem to be running a sample size of one. To add an anecdote, my own 840 Pro, after relatively little use, died completely and without warning one random day.

Yeah, the random controller failures are really what kills SSDs. This test has been all about proving that wear leveling works and write endurance isn't the thing to be worrying about.

Where did you see that the 840 is still going error-free? It says that the 840 maxed out it's reallocated sectors at around 900TB and veered into a ditch right before the petabyte threshold.

It'd be interesting to run these same tests on enterprise grade drives as well.

Edit: You meant that the 840 Pro is still going, I see.

840 has TLC, 840 Pro MLC

TLC cells should sustain 1-1.5K Program-Erase cycles

MLC 3-5K P/E cycles

eMLC 10-30K P/E cycles

SLC >100K

256GB eMLC SSD with 10K PE should be able to sustain 2.56PBW, which is pretty much what the 840 Pro 256GB with MLC was able to sustain in the test.

Also enterprise ssds usually come with huge overprovisioning, a raw 1TB drive usually comes with 800G usable space.

I had few 720G fusion iodrives with 1.1PBW and 0 reallocated sectors, and these were rated 10PBW.

I think so, yeah. That's one of the reasons I paid a bit extra to put an 840 Pro in my main development box.

Yeah: the 840 died but the 840 Pro is still going.

Yeah, I meant the 840 Pro. Sorry.

For some reason my SSDs have never lasted very long. Ive been using consumer grade SSDs since 2009 and among those which failed are a SuperTalent Ultradrive 128GB, Intel X-25M and Crucial M400. Now i use a Samsung 840 Evo which is actually a replacement since the first one died after just a couple of weeks.

Granted i am a poweruser with a lof of small writes because of software development related activities, but it still strikes me that everyone else is of the impression that SSDs last forever. Certainly not my experience. The story is similar for a couple of my buddies.

The experience with SSD reliability of myself and my friends/colleagues/contacts has been similar to spinning metal type drives, though with a smaller sample-set thus far. I've used a number of drives at home and work and had two fail: one just died, and other started reporting write errors (one Sandisk and one OCZ, I forget which exact models and which failed which way).

Between us we've got a fair few Crucial drives running pretty much 24/7 (in my case at home: the system drive in a Windows desktop that never turns off, a pair in RAID1 for the system + core VMs volumes in a server), and a selection from other manufacturers. IIRC the ones in the machine at the office are by Samsung.

Other people I know have had similar experience. Most of the failures we've experienced were early on, which either means it was down to luck or quality has improved over time. I wouldn't say I find SSD to be any less reliable than traditional drives, though when they do fail it is more often that they "just die without warning" than other failure modes.

Do you have a decent power supply? Does your building get a lot of surges? My SSDs have been rock solid.

If we're doing anecdotes, of the dozen plus SSDs I've bought for myself or helped other people select/install, I've never had a failure. This includes my ~2008 SuperTalent drive, which still works like it was brand new (which is to say slowly; SSDs at that point largely ended up slower than HDDs in at least some metrics, but I was still excited).

Either your PC or your power is killing those SSD's. I have dozens of the things since 2010 and have had zero failures.

Like many others, I use an SSD as my system drive and HDDs for data drives. Accessing files the first time might take a bit longer, but with 32GB RAM, after the initial accesses, I don't see many hits on the drive.

i never got a ssd (sadface) but maybe you should try a different PSU, if your PSU is sending crazy voltages to it it might shorten the lifespan (just common sense im not an ssd expert), maybe you have a strong graphics card thats drawing all the power and a shotty psu or something.

I'm using:

1) Corsair Performance Pro 128Gb; Marvell controller; Toshiba toggle-NAND memory; Used for 2.5 years; written about 7-8Tb.

2) Corsair Neutron GTX 120Gb; LAMD controller; Toshiba toggle-NAND memory; Used for 1.5 years; written about 4Tb.

Both are up and running 24/7 most of the time, both are usually filled to 60-80%, one is system partition, one is for remaining software and games. Both SSDs suffer about 10-20 electricity outages per year.

As a side question, how can you tell how many writes a regular drive has sustained? Say, like the one in my desktop or laptop?

SSDs tend to report this as SMART ID 241, total LBAs written.

I haven't seen this on any rotational drives. I actually find that puzzling, as it's a very interesting stat to track.

It's an interesting stat to track, but not one that physically was an endurance metric for spinning drives, which is probably why it didn't appear until SSD vendors tracked it because, as much more obvious here, it's a very serious thing to keep track of sometimes.

I thought the main endurance metric for spinning disks is total bytes read or written? AKA time where the head is right up close to the platter. That doesn't seem too far off to track both total read and write IOs and/or bytes.

I believe (but can't cite any sources for that) that unless the head is parked (which either only happens when the disk goes to sleep, or at least after somewhat extended times of inactivity) HDD heads are always right above the platter, i.e. regardless of whether something is being read, written, the head moved, or there being a (short) activity pause. Assuming that the number of repolarisations of the magnetic substrate is not limiting, while total parking/sleep times should have a +-strong correlation with life times (although it might be negative, too, when the periods are too short, due to parking cost?), total bytes read/written should only have a weak correlation (by being correlated with the former). I am not a storage device specialist.

sudo smartctl -a /dev/sda | grep Total_LBAs_Written

Please do not depend on ID 241 - it may vary with manufacturer [1]

[1] http://www.samsung.com/global/business/semiconductor/minisit...

glad i have the 840 pro, although the sample size of 1 of each drive essentially makes these tests meaningless

These tests are only meaningless if you completely misinterpret what they're testing. It's not a test of the overall reliability of the drives. They're just testing the write endurance (and occasionally the data retention). The wear leveling and garbage collection algorithms will have zero variance between different drives of the same model, so there's no need for a large sample of controllers. And each drive itself constitutes a large sample of flash memory so any random variation in the lifespan of individual NAND cells is already averaged out.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact