Hacker News new | comments | show | ask | jobs | submit login
180TB of Good Vibrations – Storage Pod 3.0 (backblaze.com)
207 points by BryantD 1427 days ago | hide | past | web | 88 comments | favorite

I think it's worth (re)reading this comment on the original Storage Pod from several years ago:


I have essentially have all of the same questions for Storage Pod 3.0 -- and in particular, what does the software stack look like? (This config is absolutely begging for ZFS, but I have a haunting feeling that something janky is afoot upstack.) I would also be curious as to the specific nature of failures that have been seen with the deployed architecture. Have the concerns from three years ago proven to be alarmist or prescient?

That said: I think it's very valuable to get configs like this out there for public discussion -- and I think it might be inspiring us (Joyent) to similarly publicly discuss our own high density storage config...

Disclaimer: I work at Backblaze. I'm not technically on the server team, but here is what I understand: we have 450-ish Backblaze pods (each with 45 hard drives) deployed in the datacenter. We are JUST NOW starting to see some old age mortality (increased failures) of the drives we deployed about 4 and a half years ago. We're really happy with the longevity, it exceeded everything we were told to expect.

We group the drives into 15 drive RAID6 groups, where there are 13 data drives and 2 parity drives. This means we can lose 2 drives and not lose any data in that particular RAID6 group. We use the built in Linux "mdadm" tools to do this.

The network interface to a pod is through HTTPS talking with Tomcat (Java web server). Java writes the data to disk (ext4 on top of the above RAID6). Our application (backup) is very specific and performance forgiving, essentially we write data once and then re-read it once every few weeks and recalculate the SHA-1 checksums on the files to make sure the data is all completely, totally intact and a bit hasn't been thrown somewhere.

One of the "luxurious" parts of working at Backblaze is we own BOTH the client and the server. On a customer's laptop, the client pre-digests the data, breaks it up into chunks that make sense (more than 5 MBytes and less than 30 MBytes) and then the client compresses it if appropriate (we don't compress movies or audio because it would be silly wasted effort) and the client encrypts the data, then sends it through HTTPS to our datacenter. Because the client computer is supplied by customers, all their CPU cycles are "free" to us. We can conveniently break up files, encrypt them, deduplicate (within that client) all without spending any CPU cycles at Backblaze because it is done on the customer's laptop before being sent.

Again, the Backblaze storage pods really aren't the correct solution for all "off the shelf" type IT projects. For example, it won't meet the performance needs of many applications. But it does work exceptionally reliably in our experience as a backup solution when you have one or two programmers to help implement a custom software layer in Java.

Wow, thanks for the explanation! I would love to learn more about the software you guys use!

One specific question, how do you know if the checksum is correct? Do you keep a database of checksums stored on a specific pod? And if the checksum is not correct, do you have other copies on other pods?

All those criticisms look pretty irrelevant in Backblaze's scale-out cold storage use case. Their performance requirements are essentially zero. They already handle full pod failures, so a partial failure is no worse as long as they can recognize it and fail the whole pod.

Having said that, it is interesting to compare Backblaze's design to Facebook's Open Vault. Facebook's design has much better mechanical serviceability, redundant power supplies, and redundant SAS expanders. Facebook believes that they can afford such features even for cold storage, so why is Backblaze cutting even more corners?

Disclaimer: I work for Backblaze. The answer to almost every question put to us is "total cost of ownership". Facebook stores a lot less data per customer than Backblaze and makes more per customer than we charge. Facebook can afford to waste more money than we can, so they do.

In my humble opinion, there is A LOT of wasted money in making datacenter machines easy to service. We create a little spreadsheet (it isn't rocket science) which includes the employee salaries taking into account different designs and how long it takes to open up and service a pod. For example, to access the Backblaze Pod's hard drives you must pull the pod out like a drawer and open the top. Many servers you can access all the hard drives from the front, without moving the server. Accessing drives from the front is much faster -> but you lose 2/3 of the density!! We pay $600 per month for a cabinet, just for the physical space. We can stack 8 pods in that cabinet, or about $75 per pod per month in space rental. So buying servers that are 1/3 as "dense" but easier to access drives from the front costs Backblaze $50/pod/month or $600/pod/year. We open a particular pod and replace a failed drive or fix some other problem maybe once every 6 months-> so you are paying $300 PER INCIDENT if you have front mounted drives that save the technician maybe 10 minutes of time. The math simply doesn't make any sense at scale.

I think a lot of the inefficiency is because the datacenter employees recommend these "nice to haves" to their managers to make the datacenter employees lives easier. But their company pays dearly for not doing the math. This isn't important if your company is massively profitable or if it isn't a "datacenter centric" company like Backblaze. But for us, it's the difference between life and death. Remember, we only charge $5/month and we don't have any deep pockets so we have to stay profitable -> there isn't a lot of margin in there to be wasteful.

I totally agree with you in principle. But you're being too generous with your density advantage as compared to hotswap. I can't tell if your pods are 4U or 5U, but your motherboard vendor sells a 4U front & back chasis that has 36 3.5" sas trays.

Obviously that setup will run you more than yours, but I'd be surprised if the delta was more than 1k once you subbed in your consumer mb, controllers and backplanes.

Thanks for the great info posted here!

That said, one small nitpick: I'm fairly sure Facebook does less revenue/user than you do: They have a billion users, and did (order of magnitude) 5 billion last year: That's $5/user/year. Which you do per month.

Still, it's amazing you can provide this unlimited service for that small fee!

Do you replace a drive as soon as failure occurs? Or do you wait until X number of drives in a chassis go bad to make it worth the time for the tech to pull the chassis out of the rack?

I work for Backblaze -> typically we replace a drive once it fails, but not "as soon as it fails". Because the storage pods go in to "read only" mode when one drive goes down, we have some time before we need to take action, sometimes it can be a few days before the drive is replaced. All incoming data is rerouted to a different pod, but the data that was on the pod is still readable and available for restore.

How do you roll up you old pods? By this I mean, do you still have all of your 1.0 pods running, or do you start migrating them to 3.0 pods, and cannibalizing the older pods for parts, until full failure/replacement, or...?

We have migrated the data from smaller pods to larger pods, then re-enabled backups on the new half filled pod to fill it to the brim. But we did not do this because the pod was necessarily old, we do this if a pod is unstable for some reason (usually it is a brand of drive we ended up not liking).

We have done exactly what you mention -> migrated off a pod and then disassembling the pod and using the older drives as replacements. Hard drives often come with a 3 year warranty, so for the first 3 years it is free to get replacement from the manufacturer if they fail. But after 3 years we have to pay out of our own pocket to replace the drives which can change the cost/math a little.

Disclaimer: I work for Backblaze. I also liked what James Hamilton (VP in Amazon Web Services) had to say: http://perspectives.mvdirona.com/2009/09/03/SuccessfullyChal...

I wonder if it is still too soon to start seeing the worst failure cases given the age of the drives.

I am curious why RAID6 is (still) used given its poor behavior in failure conditions. I am guessing that Backblaze is not as worried about additional drive failure in an array during a rebuild because they are periodically verifying each drive.

Comparing to a X4540 is kind of a stretch. The use cases seem very different. Also, it for Backblaze, they wouldn't exist at the price they charge, with the price that Sun charged for an X4540.

I hope Joyent does release something about their config, that would be very interesting.

(Yev w/ Backblaze here) -> We have drives that have remained online since the first versions of the storage pods, but we've just recently started collected drive failure rates. Typically we see that if a drive can stay up for the first two weeks, it will live happily in the pod for a good long while.

I'd love to see more discussions of high density storage config. Perspective is good.

I think it should use ECC memory and an i3-CPU that supports it. Random memory bit flips in are going to corrupt data at a steady pace.

Intel i3 processors that support ECC: http://ark.intel.com/search/advanced/?s=t&FamilyText=2nd...

Also, it'd be interesting to hear why Backblaze doesn't use SuperMicro SAS-boards instead with a SAS-expander, like HP SAS Expander.

Oh, about the "random memory flips" -> in our particular application, the client running on a customer's laptop encrypts the data then calculates a SHA-1 checksum THEN transmits the file through HTTPS to the pods. The pods write it to disk with the checksum there. Once every couple of weeks we re-read the file and re-calculate the SHA-1 checksums. If there was ever a problem, we would detect it. These turn out to be VERY rare, but they do happen where a file is fine for many years then a bit is flipped "on disk" (we don't think they are in the RAM, but it doesn't matter, it is an "end-to-end" check). We assume this is happening in consumer systems also, but at the rates we see it would be undetectable in consumer's worlds (1 bit per customer lifetime - it would probably create a tiny mis-spelling in a MS Word document, or maybe one pixel would be wrong in one JPEG).

Or 1 bit flip could corrupt and entire 128 bites block of AES encrypted data. Openssl would complain when trying to decypher the file giving a "bad magic number" error.

BTW, keep up the great work guys!

Disclaimer: I work at Backblaze. The answer to pretty much any question is "sort by price". :-) The SAS-expanders are just a tiny bit more expensive, or at least they were the last time we checked. We were worried early on that many other designs seemed to prefer the expanders vs the port multipliers, but in all the years over 450 pods we've never seen the port multipliers give us any problems. Maybe they aren't as fast as the SAS expanders? But that isn't our current bottleneck so it wouldn't help us at all.

More worried about the lack of ECC. Have you done tests on (normally) undetected errors?

The serviceability of those things looks not pleasant. I used to work in the storage industry and got to play with (what is now) NetApp high-density setups [1]. 60 drives in a 4U setup, compared to 45 drives in Storage Pod in the same 4U. But I'm guessing the cost is where Storage Pod really wins out. NetApp gear, even as a JBOD, is really expensive.

The NetApp box has same type of padding for all the drives, but they are much easier to access (pull out trays are stable and easy to use).

Fun issues I saw with the NetApp box (at least 3 years ago): fully loaded with drives, it went over the weight limit that Fedex or UPS would ship with standard shipping. It required freight shipping to ship a single, full-loaded, E5400.

[1] http://www.netapp.com/us/products/storage-systems/e5400/

Storage hardware startup here. We can do 72 drives in 4U for standard racks and 120 drives in 4U for deep racks, but it is difficult to service so we're only pushing to early adopters. However high-density servicability can be addressed. Our next system should have same drive numbers in 5U, much easier to swap drives. Not much can be done about the shipping problem, ha ha.

And what hardware startup might this be? I'm curious ;-)

Sup Yev! Evtron. We're in St. Louis. We have a call with Gleb on Wednesday. Love to chat some time: ivicars@evtron.com / @israelvicars.

Hah, well I'll leave you guys to it ;-)

I'm working on something more similar to the Backblaze Storage Pod (in fact, originally based from their 2.0 design) for my employer. We played around with the hardware from their set up and decided to go a different route. It looks like we'll be getting 48 drives in 4U, but I'm sure we'll have some teething issues as we start to get our first few boxes up & running. 72 drives in 4U sounds like a lot of fun- I'm guessing SuperMicro SAS backplanes. Is there room in the box to make it self-sufficient or is it JBOD?

"Our monthly cost for a full rack of Storage Pods with 3 TB drives is $0.63 per TB, while a full rack of Storage Pods with 4 TB drives is $0.47 per TB. "

Interesting coincidence, 180TB of Google Storage comes out at $8560 per month which is $47.55 per TB. Almost exactly x100.


And that does not include the network cost.

Google Storage (or Amazon S3) have multi-rack and even multi-room redundancy that the Backblaze setup does not at the quoted price. And that doesn't account for the costs of managing and administering those machines, nor replacing dead disks and machines when they do fail.

I expect that you'll generally find that Google or AWS more pricey than a DIY-setup, partly because of hidden costs and partly because they are actually trying to make money off of their services, even if the margins aren't outrageous.

Well either they're making 99% gross margin, or perhaps they need a few Backblaze Storage Pods.

The very high gross margin is more likely - they're charging as much as they can, not as little as they can.

Unrelated, from their FAQ:

"Look, I'm an Advanced User, and I Already Have a Set of RAID Drives with Perl Scripts to Copy My Files Back and Forth Between My 18 Home Machines that are in a Datacenter I've Built in My Closet. Why Do I Need Backblaze?"

Made me smile, probably because I have more machines than m^2 in my apartment...

And yet they don't have a *nix client...

I love a good storage story. Its interesting that they still put them behind a couple of gigabit network ports. Using the native network (2 x 1GbE) it would take more than a week with those interfaces on full dump to get a full load off or on to a pod.

I had an ultimately unrewarding conversation with Sean Quinlan (of Google GFS fame) about the futility of putting a lot of storage behind such a small channel (in Google's case the numbers were epicly Google of course but the argument was the same). You waste all of the spindles because the operation rate (requests coming into the channel) vs the amount of data ops needed to satisfy the request, basically leave your disks waiting around the next request to come in from the network. (btw that allows you to make a nearly perfect emission rate scheduler for disk arms but that is another story).

What this means is that petabyte pods are going to be nearly useless, although with an external index they can be dense.

I could see it being a problem for Google, but Backblaze wants these for archival purposes, not something where there is going to be a lot of reading and writing. The write rate is going to be whatever speed their users upload stuff, divided by the total number of their storage pods. I assume this is relatively small. The read rate is going to be whatever speed their users download restores, divided by the total number of storage pods, which is probably much smaller still.

The assumption here is that data is kept for a long time relative to how frequently it's written and read, so the IO speed probably isn't that big of a deal.

No. As you said port speed doesn't matter for data at rest. What matters is ingest/exfil of data due to "exceptional" conditions. Prime cases are cluster/mirror failure. Remirroring existing data to another pod is port limited, as is ingest for pods that are remirror targets.

Is there any reason the sources and targets couldn't both be thoroughly distributed throughout the cluster? Nothing says hard drives have to be perfectly replicated, you just need multiple copies of the data. I'm imagining that a HD dies, and the extra copies of what it contained are scattered all over. You re-replicate them by scattering them further all over. No one pod has to move any substantial amount of data.

Sure. You can absolutely replicate chunks. But you start kicking the problem upstream. A rack down is a couple pb, so you start doing a ton of cross rack transfers to get your replica counts back up. Now you're gated on nic/TOR/agg switch throughput. A DC down and you're gated on nics TORs Aggs & intra DC network. And this keeps adding up $$$ the further you get.

Ms had an interesting paper on data locality in storage last year. Can't recall the title offhand though.

I agree, as big tape drives these things are pretty cool.

I am curious where they purchase 4TB disk for $195. That is less than half of most places. That's quite a price break, even if buying in bulk.

Seagate 4GB external drive is selling for $180 right now: http://www.amazon.com/Seagate-Backup-Desktop-External-STCA40...

Yes, probably not same speed specs, but it's external, which means it usually costs more for the convenience of the housing.

Edit: Unless my math is wrong, the 3GB model is a better price point...$40 for a TB as opposed to $44+/TB for the 4TB version. http://www.amazon.com/Seagate-Backup-Desktop-External-STCA30...

Sadly, the ones in external cases are cheaper these days, since they're targeted differently. ;( Take a look at http://blog.backblaze.com/2012/10/09/backblaze_drive_farming...

If you're not afraid of shucking cases, there are a number of 4TB drives you can get at that pricepoint-




Yev from Backblaze here -> We're not above liberating drives from their enclosures: http://blog.backblaze.com/2012/10/09/backblaze_drive_farming...

Husking external drives seems like a great way to get a deal.

I've done it myself for my home servers. Tends to get strange looks, I can only imagine what buying in the scale BackBlaze did caused for reactions :p

They actually had people buying external hard drives for them last year[1], so they're probably just removing the drives from the housings and then recycling the casing.

[1] http://news.cnet.com/8301-11386_3-57556147-76/startup-paid-b...

Backblaze do say the price per TB for 3TB drives is better, but obviously you can squeeze more TB per 4U case when you use 4TB drives.

That cuts down on the power usage, cooling, and rack space that you need to pay for.

Disclaimer: I work at Backblaze. We make the decision to switch to the more dense drives when the "break even" point is about 1 year of operation. We have a little spread sheet (it isn't rocket science) of how much electricity a drive uses, what we are paying for the physical space rental, etc. We plug in the prices if the 4 TB drives pay for the overhead within 1 year we go ahead and buy those.

The drives seem to last about 5 years in our experience, so technically we should be able to buy the more dense drive it it pays back in 4 years, but cash to run the business is very dear to our hearts so we don't like going out much more than a year on a payback.

You can get them at Costco now for $180. It kinda blew my mind when I was there last weekend.

Costco has them for $20 off right now, so $159.

Yea, if you see my above link, Costco was a big help to us. They have a lot of good deals.

My local Costco has them at $139/ea for the 4TB externals right now. I wonder if they have some sort of dynamic/regional pricing that causes the price to go up in the Bay Area.

And I wonder if that happened automatically in their yield-management software, perhaps triggered by sales data showing that Bay Area stored got completely cleaned out on a regular basis? :-)

That may be the case. It certainly would make sense to me since I imagine demand for denser storage is higher in Silicon Valley.

It's crazy that hard drive prices are still higher than they were two years ago because of the flooding...

It's not THAT crazy, the drive manufacturers saw demand stay constant while supply dwindled, there response was to raise prices across the board. They really have no need to drop them again, as quickly as they had been before the flooding. As soon as the first major producer drops prices back to normal levels the others will follow suite, but no one wants to be the first, since people are still buying them at inflated prices.

I understand the law of supply and demand, I am merely expressing surprise that the effect has lasted for over two years. :)

Supply and demand.

Disclaimer: I work at a company that makes hard drives. Don't forget about their sensitivity to rotational vibration, most often induced by coupling from other drives. Maybe it's not important in this particular application, but it can be a performance killer. To avoid it, 'vibration absorption' is not always the name of the game :) This link has some good background info, and a reminder that there are actual hardware/firmware reasons why enterprise & near-line drives are more expensive than consumer drives: http://enterprise.media.seagate.com/2010/05/inside-it-storag...

It sounds like they've got enough redundancy that a single drive failure (or even an entire pod) isn't an issue - they use cheap drives and simply swap them out when they fail.

I wish Google's hard drive failure analysis [1] included failure rates and failure scenario statistics for different models, vendors and consumer/enterprise classes too.

[1] http://research.google.com/archive/disk_failures.pdf

Knowing Google, I'm pretty sure that their drive population was consumer-grade - I think they mentioned that in the paper. It was interesting that the paper lamented the usefulness of SMART in predicting failures, because one of the things that enterprise/nearline drives buy you is much much richer SMART diagnostics. I wonder what their results would have been with drives that had better diagnostic reporting. I'm also curious if folks that are building 'cold-storage' on the cheap have looked into using DVR drives, they might have some useful characteristics for this application.

I bought a used HP MDS 600 on EBay for less $2000 shipped. It doesn't have a computer built in, but unlike the Backblaze it has four power supplies, two redundant and four fans and two SAS interfaces. It also holds 70 drives. I've thought of building a Backblaze before, but if you want to meet your storage needs with a single enclosure, there are better solutions for the money. The Storage Pod is really for when you have enough of them to consider any single one redundant.

Obviously a one-off is always going to be source-able from second hand bits and pieces. But if you need to build a these on a regular basis and you need to maintain them then uniformity and slightly higher cost are not an issue.

Nice deal though!

Isn't the "use same drives" advice a little dangerous? If you have bad luck with firmware or specific batch of drives - you could be put in a world of pain fast.

Yev w/ Backblaze here -> This is true, and we've had bad batches before, but all of the drives and chassis are tested before we put them in to production, having arrays with the same drives minimizes the variables throughout the pod.

Based on not much other than their blog postings, I really like the (for want of a better word) vibe of this company.

I wish they support linux

If only there was a unix-oriented backup provider with bulk pricing lower than S3, 12 years of history, and a public, aggressive stance on privacy and civil liberties.

Oh, if only such a provider existed. (rsync.net)

Yeah, but your standard costs for not crazy huge amounts of data aren't as competitive as I've been able to find. We're storing on a competitor for about half what you guys charge and they provide rsync/ssh access.

I'm building a service to be on top of rsync.net -so far-, and the only problem is the lack of a admin API to automate everything. In the plus side, the guys answer emails (faster and better than azure and s3, IMHO).

I wouldn't say this in most places, but always, always email us about your data sizes and needs. There's usually a way to make the pricing work if you are both:

- technically competent - have smaller space needs

Care to name names? Inquiring minds, and all that...

Strongspace at http://www.strongspace.com/

Interestingly they use ZFS on FreeBSD I believe.

I'm pretty sure that Crashplan+ is the nearly same price, and they definitely support Linux.

Eh? What makes you think it doesn't already?

The pod itself does but the BackBlaze client software for the BackBlaze service does not.

Sorry, yes, I was thinking about support in relation to the pod itself (the subject of this article) and not BackBlaze in general.

Their website?

"Backup Windows and Mac Backblaze backs up Windows XP 32-bit, Windows Vista 32-bit and 64-bit, Windows 7 32-bit and 64-bit, Windows 8 and Intel-based Mac OSX 10.5 and newer."

They state they don't in the FAQ. What makes you think they do ?

I'm interested in maybe having a couple of those boxes for video data, to keep it online for editing bays. What would be the best solution to backup that amount of data - redundant boxes?

Yes, redundant boxes would probably be the best bet for data of that size. Remember though, raid is NOT a backup. When possible make sure to send copies to multiple places!

I wonder when the Hitachi 7K4000 4TB 7200rpm drives will hit $150 in quantity 100 or less. I think they were $180 on Black Friday 2012, and $210 recently.

whoa. 2.5" (boot) drives more reliable than 3.5". why?!

Rotational velocity for one.

I surmise that 2.5" drives are less affected by vibration. I'm surprised they don't use a SDCard as the boot drive. I run our ESX servers from them.

I wonder why they use a boot-drive at all instead of netbooting these things (or use a pair of USB-sticks).

Disclaimer: I work at Backblaze. We use the boot drive just because it's a tiny bit "cleaner" to boot off a separate drive (and configure swap on that drive) and have all the "data" alone on the 45 drive array. It allows you to boot and THEN reassemble a failed raid array that is on other drives, stuff like that. You could do it off of USB sticks, but I'm not sure about how well that would work with swapping? Maybe it would be fine? In the end it hasn't been worth focusing on yet, the price difference would be really super small. We spend most of our brain cells trying to find cheaper hard drives which account for 80 percent of our costs.

Netbooting -> we're actively looking into that as an option. But you still probably need a local swap drive.

You don't normally need a swap-drive on a storage node (if it needs to swap then something is wrong to begin with), but of course I don't know the details of what your nodes may do beyond dealing out files.

The main advantage of PXE-booting would be maintainability (rolling out upgrades by a simple reboot, etc.) but I assume at your scale you have that already figured out in one way or another.

Either way, thank you for all the insights that you keep sharing with the public! These hands-on blog posts are priceless both in entertainment and education value. :-)

If you want free, generic info on netbooting large environments, please let me know.

I used to netboot 5500-6000 Linux boxes that crunched collider data from the CMS detector at the LHC (worked at a DOE lab). I learned a lot that year.

That's a pretty impressive price point.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact