I have essentially have all of the same questions for Storage Pod 3.0 -- and in particular, what does the software stack look like? (This config is absolutely begging for ZFS, but I have a haunting feeling that something janky is afoot upstack.) I would also be curious as to the specific nature of failures that have been seen with the deployed architecture. Have the concerns from three years ago proven to be alarmist or prescient?
That said: I think it's very valuable to get configs like this out there for public discussion -- and I think it might be inspiring us (Joyent) to similarly publicly discuss our own high density storage config...
We group the drives into 15 drive RAID6 groups, where there are 13 data drives and 2 parity drives. This means we can lose 2 drives and not lose any data in that particular RAID6 group. We use the built in Linux "mdadm" tools to do this.
The network interface to a pod is through HTTPS talking with Tomcat (Java web server). Java writes the data to disk (ext4 on top of the above RAID6). Our application (backup) is very specific and performance forgiving, essentially we write data once and then re-read it once every few weeks and recalculate the SHA-1 checksums on the files to make sure the data is all completely, totally intact and a bit hasn't been thrown somewhere.
One of the "luxurious" parts of working at Backblaze is we own BOTH the client and the server. On a customer's laptop, the client pre-digests the data, breaks it up into chunks that make sense (more than 5 MBytes and less than 30 MBytes) and then the client compresses it if appropriate (we don't compress movies or audio because it would be silly wasted effort) and the client encrypts the data, then sends it through HTTPS to our datacenter. Because the client computer is supplied by customers, all their CPU cycles are "free" to us. We can conveniently break up files, encrypt them, deduplicate (within that client) all without spending any CPU cycles at Backblaze because it is done on the customer's laptop before being sent.
Again, the Backblaze storage pods really aren't the correct solution for all "off the shelf" type IT projects. For example, it won't meet the performance needs of many applications. But it does work exceptionally reliably in our experience as a backup solution when you have one or two programmers to help implement a custom software layer in Java.
One specific question, how do you know if the checksum is correct? Do you keep a database of checksums stored on a specific pod? And if the checksum is not correct, do you have other copies on other pods?
Having said that, it is interesting to compare Backblaze's design to Facebook's Open Vault. Facebook's design has much better mechanical serviceability, redundant power supplies, and redundant SAS expanders. Facebook believes that they can afford such features even for cold storage, so why is Backblaze cutting even more corners?
In my humble opinion, there is A LOT of wasted money in making datacenter machines easy to service. We create a little spreadsheet (it isn't rocket science) which includes the employee salaries taking into account different designs and how long it takes to open up and service a pod. For example, to access the Backblaze Pod's hard drives you must pull the pod out like a drawer and open the top. Many servers you can access all the hard drives from the front, without moving the server. Accessing drives from the front is much faster -> but you lose 2/3 of the density!! We pay $600 per month for a cabinet, just for the physical space. We can stack 8 pods in that cabinet, or about $75 per pod per month in space rental. So buying servers that are 1/3 as "dense" but easier to access drives from the front costs Backblaze $50/pod/month or $600/pod/year. We open a particular pod and replace a failed drive or fix some other problem maybe once every 6 months-> so you are paying $300 PER INCIDENT if you have front mounted drives that save the technician maybe 10 minutes of time. The math simply doesn't make any sense at scale.
I think a lot of the inefficiency is because the datacenter employees recommend these "nice to haves" to their managers to make the datacenter employees lives easier. But their company pays dearly for not doing the math. This isn't important if your company is massively profitable or if it isn't a "datacenter centric" company like Backblaze. But for us, it's the difference between life and death. Remember, we only charge $5/month and we don't have any deep pockets so we have to stay profitable -> there isn't a lot of margin in there to be wasteful.
Obviously that setup will run you more than yours, but I'd be surprised if the delta was more than 1k once you subbed in your consumer mb, controllers and backplanes.
That said, one small nitpick: I'm fairly sure Facebook does less revenue/user than you do: They have a billion users, and did (order of magnitude) 5 billion last year: That's $5/user/year. Which you do per month.
Still, it's amazing you can provide this unlimited service for that small fee!
We have done exactly what you mention -> migrated off a pod and then disassembling the pod and using the older drives as replacements. Hard drives often come with a 3 year warranty, so for the first 3 years it is free to get replacement from the manufacturer if they fail. But after 3 years we have to pay out of our own pocket to replace the drives which can change the cost/math a little.
I am curious why RAID6 is (still) used given its poor behavior in failure conditions. I am guessing that Backblaze is not as worried about additional drive failure in an array during a rebuild because they are periodically verifying each drive.
Comparing to a X4540 is kind of a stretch. The use cases seem very different. Also, it for Backblaze, they wouldn't exist at the price they charge, with the price that Sun charged for an X4540.
I hope Joyent does release something about their config, that would be very interesting.
Intel i3 processors that support ECC: http://ark.intel.com/search/advanced/?s=t&FamilyText=2nd...
Also, it'd be interesting to hear why Backblaze doesn't use SuperMicro SAS-boards instead with a SAS-expander, like HP SAS Expander.
BTW, keep up the great work guys!
The NetApp box has same type of padding for all the drives, but they are much easier to access (pull out trays are stable and easy to use).
Fun issues I saw with the NetApp box (at least 3 years ago): fully loaded with drives, it went over the weight limit that Fedex or UPS would ship with standard shipping. It required freight shipping to ship a single, full-loaded, E5400.
Interesting coincidence, 180TB of Google Storage comes out at $8560 per month which is $47.55 per TB. Almost exactly x100.
And that does not include the network cost.
I expect that you'll generally find that Google or AWS more pricey than a DIY-setup, partly because of hidden costs and partly because they are actually trying to make money off of their services, even if the margins aren't outrageous.
"Look, I'm an Advanced User, and I Already Have a Set of RAID Drives with Perl Scripts to Copy My Files Back and Forth Between My 18 Home Machines that are in a Datacenter I've Built in My Closet. Why Do I Need Backblaze?"
Made me smile, probably because I have more machines than m^2 in my apartment...
I had an ultimately unrewarding conversation with Sean Quinlan (of Google GFS fame) about the futility of putting a lot of storage behind such a small channel (in Google's case the numbers were epicly Google of course but the argument was the same). You waste all of the spindles because the operation rate (requests coming into the channel) vs the amount of data ops needed to satisfy the request, basically leave your disks waiting around the next request to come in from the network. (btw that allows you to make a nearly perfect emission rate scheduler for disk arms but that is another story).
What this means is that petabyte pods are going to be nearly useless, although with an external index they can be dense.
The assumption here is that data is kept for a long time relative to how frequently it's written and read, so the IO speed probably isn't that big of a deal.
Ms had an interesting paper on data locality in storage last year. Can't recall the title offhand though.
Yes, probably not same speed specs, but it's external, which means it usually costs more for the convenience of the housing.
Edit: Unless my math is wrong, the 3GB model is a better price point...$40 for a TB as opposed to $44+/TB for the 4TB version.
If you're not afraid of shucking cases, there are a number of 4TB drives you can get at that pricepoint-
That cuts down on the power usage, cooling, and rack space that you need to pay for.
The drives seem to last about 5 years in our experience, so technically we should be able to buy the more dense drive it it pays back in 4 years, but cash to run the business is very dear to our hearts so we don't like going out much more than a year on a payback.
I wish Google's hard drive failure analysis  included failure rates and failure scenario statistics for different models, vendors and consumer/enterprise classes too.
Nice deal though!
Oh, if only such a provider existed. (rsync.net)
- technically competent
- have smaller space needs
Interestingly they use ZFS on FreeBSD I believe.
"Backup Windows and Mac
Backblaze backs up Windows XP 32-bit, Windows Vista 32-bit and 64-bit, Windows 7 32-bit and 64-bit, Windows 8 and Intel-based Mac OSX 10.5 and newer."
I surmise that 2.5" drives are less affected by vibration. I'm surprised they don't use a SDCard as the boot drive. I run our ESX servers from them.
Netbooting -> we're actively looking into that as an option. But you still probably need a local swap drive.
The main advantage of PXE-booting would be maintainability (rolling out upgrades by a simple reboot, etc.) but I assume at your scale you have that already figured out in one way or another.
Either way, thank you for all the insights that you keep sharing with the public!
These hands-on blog posts are priceless both in entertainment and education value. :-)
I used to netboot 5500-6000 Linux boxes that crunched collider data from the CMS detector at the LHC (worked at a DOE lab). I learned a lot that year.