The main issue I have with this form of testing is that it's basically measuring...

vidarh · on Sept 24, 2014

The rationale is that most people will never approach those kind of number of P/E cycles, and so people would rathe pay for more space, or pay less. Even in many cases in enterprise settings.

We have some cheapish SSD's in use for some of our high traffic database servers. We lost some drives that failed catastrophically, and the company we bought it from "suggested" we might have worn them out and maybe we didn't have a reason to RMA them, and perhaps we just ought to buy more expensive enterprise models next time.

So we checked the SMART data, and after a year of what to us is heavy 24/7 use with a large percentage of writes, we'd gone through less than 10% of the P/E cycles.

(We did our RMA, and it was very clear that this was a problem with the model/batch - all the failed drives were OCZ Vertex drives from when their failure rate shot through the roof before the bankruptcy)

All our other SSD's are chugging along nicely; the oldest have suffered through 3-4 years of heavy database traffic. I am just waiting for the oldest ones to start failing.

At that rate it doesn't matter if they won't survive as long as SLC anyway: We'll end up replacing them with faster, higher capacity newer models soon anyway - we usually do on a 3-5 year cycle depending on hardware and needs -, because it's more cost-effective for us to upgrade regularly to increase our hosting density as it helps us avoid taking more rack space, and colocation space/power/cooling costs us more than the amortised cost of the hardware.

The consumer market is similar: Most people don't ever buy replacements for failed drives - they buy a newer computer.

userbinator · on Sept 24, 2014

The rationale is that most people will never approach those kind of number of P/E cycles

The flip side of that is most people could now have drives that don't cost all that much more, but last much longer. Most SLC tends to be rated for 100K cycles and 10 years of retention; assuming a roughly inverse correlation, at 10K or 1K cycles the retention goes up considerably to a century or more.

The consumer market is similar: Most people don't ever buy replacements for failed drives - they buy a newer computer.

That is true, but the long-term implications are more subtle; the fact is that most people don't backup, and quite a few of them keep the old drives (that were still working when they were replaced) around as "backup", with the implicit assumption that the data on them will likely still be there if they ever decide e.g. that they wanted to find an older version of some file they had. With flash memory, this assumption no longer holds.

On a longer timescale, we've been able to "recover data" from stone tablets, ancient scrolls and books, this being a very valuable source of historical information; and most if not all of that data was probably never considered to be worth archiving or preserving at the time. More recently, rare software has been recovered from old disks ( http://www.chrisfenton.com/cray-1-digital-archeology/ ). Only the default, robust nature of the media made this possible.

Despite modern technology increasing amount of storage available, and the potential to have it persist for a very long time, it seems we've shifted from "data will persist unless explicitly destroyed" to "data will NOT persist unless explicitly preserved", which agrees well with the notion that we may be living in one of the most forgettable periods in history. It's a little sad, I think.

darkmighty · on Sept 28, 2014

The fact is, even it wouldn't matter as-is if the data took 10000 years to degrade from the platter itself. Most consumer hard drives those days are made for laptops, which are probably used for less than 5 years on average. Even if you consider external HDs and desktop HDs, a long of a lifetime isn't much use: the control electronics themselves fail fast, and the mechanical reliability even faster.

It's an optimization rule of thumb: in an optimal trade-off for (e.g.) maximum reliability for cost, the reliability of each element will tend to be close (actually the derivative of the reliability vs cost will be equal, but this tends to imply the former) -- i.e. you improve the least reliable and sacrifice the most reliable, even if the most has very reliability.

sitkack · on Sept 24, 2014

Sounds like you want some multi-terabyte capacity Dec DiskPacks [0] completely separating the controller and the heads from the recording medium.

http://en.wikipedia.org/wiki/Disk_pack

dijit · on Sept 24, 2014

OCZ Vertex are the cheapest least reliable SSD's in recent memory, I can't imagine how you could ever justify those for database servers.

FWIW I'm not advocating expensive drives, but ones that are known to fail reliably are far better than the cheapest consumer SSD, I put Intel 513's in RAID10 on the databases at my last company with instructions to replace drives at 60% of their expected life.

databases are important, for many people it's the heart and soul of a business, recovering them can be very costly and especially time consuming.

for average Joe, have a computer for 4-5 years and then throw the machine away, you can't expect the same out of prod servers. Please, please please in future when purchasing things for servers check the failure rate, if there is no real world data then DO NOT BUY those things.. especially avoid consumer markets, they're cheap and cheerful for a reason.

vidarh · on Sept 24, 2014

> OCZ Vertex are the cheapest least reliable SSD's in recent memory,

Vertex 3's are perfectly fine. We've not had any of our Vertex 3's fail, in fact, which means they're one of our better performing models. Vertex 2's are known to have high failure rates, and so is 4 (and we tried some 4's and won't again).

> I can't imagine how you could ever justify those for database servers.

You answered your own question. Because they were cheap, and failure of individual drives does not matter.

For any given drive, we assume it will die. For any given RAID array, we assume the entire RAID array will die. For any given server, we assume the server will regularly crash or die. For any given data centre, we assume the data centre will eventually lose power or burn to the ground.

When you start out with those assumptions, that reflect real world risks any business should plan for, you then design your reliability around that:

Everything is in RAID's and can afford to lose at least one and often two drives. Everything is replicated, so if the RAID or the server it is attached to dies, another server can take over. Everything is replicated to a secondary location, so if the data centres loses power (has happened to us - a suspected fire forced the data centre operator to shut everything down before the fire brigade could enter), we can make the decision how long to wait before we reroute (we don't do that automatically at the moment, though we could - we've moved traffic transparently between the data centers in some instances).

To me, if you worry about data loss from failing drives, then your system is designed wrong.

If a system is designed for resilience, drive reliability becomes purely economic question: How much it costs us to expend the effort on RMA'ing drives and send someone down to replace them vs. price difference for drives. In that calculation, the Vertex 3's do just fine. We're not buying more OCZ at the moment, but we'll see what happens under new ownership, who knows.

> databases are important, for many people it's the heart and soul of a business, recovering them can be very costly and especially time consuming.

If database recovery is costly, then in most cases someone is not doing their job. Few businesses have data that is big enough to justify not having both database level replication, regular snapshots, and nightly backups. For every database, we have about 4-5 copies newer than 24 hours old, at a minimum (master, slave, <1 hour old replica of the whole container the master runs in, <1 hour old replica of the whole slave container, and the newest backup image), as well as older snapshots. For some databases we have more copies than that. It costs us peanuts compared to what it costs us to serve up the live versions of the sites those databases are for.

> for average Joe, have a computer for 4-5 years and then throw the machine away, you can't expect the same out of prod servers.

Yes, you can. For us, if we keep our production servers more than about 3 years, we lose money, since as long as we're growing, we have the option of taking more rack space vs. rotating out our oldest servers and replacing them with servers that have many times higher capacity in the same space.

Our current oldest generation servers can handle about <20% of the capacity per 1U of rack space than our newest generation. With the cost of taking an extra rack what it is, it's an absolute no-brainers for us to throw out those servers and replace them with new servers on about a 3 years cycle. Sometimes, if our growth is slower, we'll leave it a bit longer, until we need the space, but 5 years is pretty much the upper limit.

For businesses with an entirely static, and small, workload, sure, you may prefer to keep the servers for longer, and those have the option of buying more expensive drives, or deal with more failures over the lifetime of their server.

> Please, please please in future when purchasing things for servers check the failure rate, if there is no real world data then DO NOT BUY those things.. especially avoid consumer markets, they're cheap and cheerful for a reason.

Don't assume we don't check. And no, I most definitively will not avoid consumer markets. On the contrary. Enterprise components are sometimes worth it. But often they are priced and designed for people who are terrified of component failures or don't want the "hassle". When you have a system where component failure is an assumed "everyday" event, the consumer versions often have a far lower total cost of ownership.