> During this quarter 4 (four) drive models, from 3 (three) manufacturers, had 0 (zero) drive failures. None of the Toshiba 4TB and Seagate 16TB drives failed in Q1, but both drives had less than 10,000 drive days during the quarter. As a consequence, the AFR can range widely from a small change in drive failures. For example, if just one Seagate 16TB drive had failed, the AFR would be 7.25% for the quarter. Similarly, the Toshiba 4TB drive AFR would be 4.05% with just one failure in the quarter.
Backblaze should consider adding interval estimates in addition to point estimates. It might help the reader understand the uncertainty of the point estimate.
This data is significant and appreciated. Thank-you so much, and please keep up the good work.
It certainly is useful and interesting.
> I'd guess this is an effective piece of marketing.
Yes, it really has been good to us. :-)
It is data we would collect internally for our own decision making and tracking even if we didn't release it, and it is only a small amount of work for us (mainly for Andy the author of that blog post) to format it up every 3 months and write some observations down. Since it isn't what we "sell", it would have just gone to waste being hidden. And it results in our name getting "out there" and inevitably every quarter somebody asks "hey, what do these people do that requires 130,000 hard drives anyway?" and then we get a little bump in customers. Cost vs. benefit has been WELL worth it to us.
Along the way (unrelated to making money) it makes us happy that people get some use out of it. Like we find out somebody does their PhD thesis on the raw SMART numbers of the drives or something, we enjoy hearing about it.
Until my dying day I will never understand why Google Storage and Amazon S3 and Microsoft Azure don't release their drive failure statistics. I just don't get it. Those companies have SO MANY DRIVES and they employ people with PhDs in statistics, they MUST have the same info internally but at 100x the scale. I don't get why they don't release the numbers?! But hey, if they want to continue to give us this marketing gift of exclusivity, I'll take it. Maybe they like us and are just doing us a solid.
We had a lower failure rate across the board, for all components. I think the only protection that was in place was some blue tarps and a chain link fence.
I can't find a public reference, sorry.
Could be that on that scale they're dealing more directly with manufacturers, and they fear sharing failure data would affect it negatively?
I mean "doing it well for an influencer and shafting everyone else" is far cheaper and easier than just making a solid and reliable product consistently. That's such old-fashioned thinking these days. :)
I don't know if they're as budget-conscious as they used to be, but Backblaze used to "shuck" drives from externals because they're cheaper .
If this is still the case, then their stats should be similar to hobbyists who are doing this too.
> While the shucked drives might not have had the warranties
We haven't shucked drives in a while. The main advantage was when this artificial price difference that occurred in the Thailand drive crisis where a "raw drive" cost $400, but if it was put inside a $7 USB enclosure the same drive including the USB enclosure suddenly only cost $200. At the time, our business model literally wouldn't survive the cost increase, it was shuck drives or go out of business. :-) We assume the issue was that mostly "rich companies" bought "raw drives" and could afford the increase in price, but "poor consumers" bought drives in USB enclosures and couldn't afford the increase.
We still watch the prices though. You can think of it this way: if 3% of the drives fail but don't have a warranty because they were shucked, then we have to save more than 3% (plus the cost of the "shucking" process) to make it worth it. That rarely happens anymore. I think partly because as the volume of drives that we purchase has gone up we have the ability to get better "bulk discounts" than we had in the earliest days. Also, the process of "drive farming" where the retail stores would limit it to "2 per customer" or whatever would get much harder at our current scale.
Exactly, as we just saw with the Western Digital debacle.
The unfortunate reality is, only a fortuneteller can say what the best hard drive to buy right now is.
> Our failure rates were consistently much higher compared to Backblaze
> The overall workload on the clusters was extremely heavy
That is the most likely explanation. We see higher failure rates when we are writing to the drives, for example as a new vault fills with customer data. Backup (our oldest business) has a fairly easy work load, it is not as punishing as a database that is pummeling the drives.
In certain circumstances when we are down more than 1 drive in a 20 drive Reed Solomon group we stop putting new data on that drive group until the parity is restored explicitly because this lowers the chances of an additional drive failing in the group. That gives us more time to rebuild the parity with less stress in our lives. When that last parity drive fails and one more drive failure means customers lose data the fun drains right out of this job. Red alerts are thrown, pagers go off in the middle of the night, datacenter techs start driving towards the datacenter at 3am to replace drives. We prefer the world nice and calm and relaxed with a good night's sleep.
/me goes and checks BackBlaze's careers page...
(I suspect driving time to any of their datacenters from Sydney puts me out of the running... At least their 3am emergencies would be a much more reasonable 8pm emergency from here.)
A particular vertical-axis wind turbine project was destroyed by buildup of magnetic dust on the generator magnets.
I'm not saying that's what was killing your drives directly. Normally HDDs are fully sealed. But the amount of dust you mention is awfully suspicious. It might change your investigative perspective a bit when you consider that some small percentage of all that dust is ferromagnetic.
I have a lot of old Hitachis that still work, but every last one of my Seagates died years ago. Yes, they've gotten better since then, but they're still reliably trounced by Hitachi.
Where are the failures coming from? What did they cheap out on? Has the C-suite decided it's not financially worth increasing reliability? What are the internal feelings on being the outlier, year in and year out, in these stats? Does anyone care?
From an interview last year, it sounds like they don't use SMR drives. I would be interested if there is any good source for failure rates of those.
It's widely deployed on Android, if that counts.
If it wasn't for Helium Sealed tech moving more platters into HDD, we would have zero capacity improvement in the past 4-5 years.
So, while the idea was ridiculed in the press, the advent of SSD's and the usecase for harddrives these days actually means it makes a lot of sense given the area (and therefor the capacity) increases with the square. Combined with the slower spin rates also increasing the density, and you get a device which is far more generally useful for bulk storage than SMR.
Bottom line, assuming 20T 3.5" drives, you have doubled the front capacity of a 3U case.
OTOH, if you really wanted to play games consider how much capacity could be stored on a 19" platter like the old IBM mainframe disk units.
... or use 4U so you can use the half inch clearance to try and help cool them enough...
NAND prices fluctuate up and down and the current lows are not very sustainable.
It seems the days of computing getting cheaper every two or three years are gone.
What I really wish is that they would make an effort to go beyond just reporting their experiences and see what they are doing as also providing a service in the form of model reliability data. AKA toss a few pods of WD's/etc in there even if they are slightly more expensive/whatever. If the data were broken out by production location and manufacture date, its likely it would be something that they could sell on the open market for a small fee. I know I would have gotten the company I worked for to pay such a fee for a somewhat scientific look at the failure rates of certain models/etc.
AKA, pay a bit more for a broader set of drives, summarize the data for free, and then get people to pay for the detail data. If you work for a company buying a thousand or so drives a year, avoiding a 10% AFR is going to be worth a lot of money. AKA too small to have a good view of the state of things, but big enough that buying 1k bad drives and fighting daily RAID rebuilds for the next three years is real nightmare. Think of it as a bit of insurance, or at least validation of a problem when things start to go south.
> AKA toss a few pods of WD's/etc in there even if they are slightly more expensive/whatever.
If you see "low numbers" of one particular drive model in the stats, that is usually us trying out that drive model because at some point in the future the price might drop making it worth purchasing it in bulk worth it, or the price is ALREADY worth it but we're being careful in the rollout to make sure that drive model performs in our particular application - nobody else's application, just for us.
> make an effort to go beyond just reporting their experiences
We are VERY careful to report what we are seeing in our datacenter for our particular application, and no more. This isn't a scientific study, we aren't Consumer Reports, we're just publishing data we would collect whether or not we released it. We have a core business we super happy focusing on, somebody else is WELCOME to sell drive testing and drive predictions and we won't compete with them or get in their way. Heck, we would totally subscribe to that service!
I'm wondering why they keep maintaining two product lines that are sufficiently different to result in one basically being consistently the best, the other being consistently the worst, in terms of reliability.
Over the same features I find b2 to be way simpler and cheaper.
Edit -> what are the feature's your missing most? We're collecting feedback at: firstname.lastname@example.org if you want to send some notes over!
But perhaps the boot drives are just leftovers that are too small to be used any more for storage?
*Edit -> asked around here's whats up -> we do currently stream some logs/data off the device but we also like it to be written to disk - there's something about having multiple copies that we like ;-)
With an attitude like that, you should consider running a backup as a service company! ;-)
> though they're less reliable, it's cheaper to buy more of them
Exactly. Each month our buyers go out and get bids for more drives. The cost is input into a little spreadsheet, and the SPREADSHEET tells us which drive to buy. It isn't about picking the most reliable drive, we have a software layer and redundancy for that! It is about picking up the least expensive drive.
The spreadsheet isn't complicated. If drive model X fails 1% less but costs 2% more then we don't buy it that month. It might change the next month. If you want to win our business, just look at the failure rates and under bid the competitors by the failure rate of your drive.
There are some other things in the spreadsheet I should mention. If a drive is twice the density (let's say a 16 TByte drive vs an 8 TByte drive) it is still the same physical size and we pay for physical space rental in datacenters. And another thing is that drives that are twice as dense uses approximately the same amount of electricity, and power costs are an ENORMOUS amount of our overall cost of operation. So the spreadsheet will choose a more dense drive even if it is slightly more expensive per TByte just because of the other savings it implies. Again, the spreadsheet isn't complicated, but the spreadsheet tells us which drive to buy.
If you are only going to purchase one drive, and you aren't going to back it up, you should sort by reliability of that drive. You are also insane -> always backup your drives!! And as long as your drive is backed up and you trust the backup, then who cares if the failure rate is 1% or 2%?
I would choose to pay a premium to reduce the chance of me needing to restore from backup as often. Not a big premium, but one that makes sense at the one or two drive purchase scale that might make a whole lot less sense at a one or two thousand drive purchase.
I'd pay an extra $20 or $30 on a ~$300 drive to drop from a 2% to 1% failure rate... (As it turns out, for home I pretty much buy drives in pairs and mirror them for all my not-in-a-laptop storage - so I guess I pay a 100% premium for reliability...)
(I realize in practice you probably do, but in the explanation here you said "just look at the failure rates and under bid the competitors by the failure rate of your drive")
I tried to connect with s3cmd but it wasn't working. Unfortunately their support was not very helpful neither...
Though cross-over point is still some time away, a decade or so?
Just seems that the spinning rust guys have to jump through some pretty extreme hoops to get density up, while the flash guys "just" have to reduce cost, which gives them more angles of attack (ie either further density increase or process optimization).
Backblaze deploys their storage as storage pods (a server holding drives) in their vault architecture (many pods form a vault, with data sharded across pods in a vault).
So you would want to look at $/B of a vault. A vault includes the rest of the server hardware. The cost of the vault includes not just the server hardware, but also floor space in a data center and electricity to run the hardware. Flash storage is denser than HDDs and more power efficient, so you should expect a monetary breakeven at some point before SSDs are priced at similar $/B as HDDs.
It's reasonably well understood how to plug 45 harddrives into a server, I wonder if you can buy hardware to drive 45m.2 2280 SSDs into one server/motherboard as easily?
Seagate's drive longevity has improved a lot. Compare today's report to Backblaze's 2015Q1 report: