I have essentially have all of the same questions for Storage Pod 3.0 -- and in particular, what does the software stack look like? (This config is absolutely begging for ZFS, but I have a haunting feeling that something janky is afoot upstack.) I would also be curious as to the specific nature of failures that have been seen with the deployed architecture. Have the concerns from three years ago proven to be alarmist or prescient?
That said: I think it's very valuable to get configs like this out there for public discussion -- and I think it might be inspiring us (Joyent) to similarly publicly discuss our own high density storage config...
Disclaimer: I work at Backblaze. I'm not technically on the server team, but here is what I understand: we have 450-ish Backblaze pods (each with 45 hard drives) deployed in the datacenter. We are JUST NOW starting to see some old age mortality (increased failures) of the drives we deployed about 4 and a half years ago. We're really happy with the longevity, it exceeded everything we were told to expect.
We group the drives into 15 drive RAID6 groups, where there are 13 data drives and 2 parity drives. This means we can lose 2 drives and not lose any data in that particular RAID6 group. We use the built in Linux "mdadm" tools to do this.
The network interface to a pod is through HTTPS talking with Tomcat (Java web server). Java writes the data to disk (ext4 on top of the above RAID6). Our application (backup) is very specific and performance forgiving, essentially we write data once and then re-read it once every few weeks and recalculate the SHA-1 checksums on the files to make sure the data is all completely, totally intact and a bit hasn't been thrown somewhere.
One of the "luxurious" parts of working at Backblaze is we own BOTH the client and the server. On a customer's laptop, the client pre-digests the data, breaks it up into chunks that make sense (more than 5 MBytes and less than 30 MBytes) and then the client compresses it if appropriate (we don't compress movies or audio because it would be silly wasted effort) and the client encrypts the data, then sends it through HTTPS to our datacenter. Because the client computer is supplied by customers, all their CPU cycles are "free" to us. We can conveniently break up files, encrypt them, deduplicate (within that client) all without spending any CPU cycles at Backblaze because it is done on the customer's laptop before being sent.
Again, the Backblaze storage pods really aren't the correct solution for all "off the shelf" type IT projects. For example, it won't meet the performance needs of many applications. But it does work exceptionally reliably in our experience as a backup solution when you have one or two programmers to help implement a custom software layer in Java.
Wow, thanks for the explanation! I would love to learn more about the software you guys use!
One specific question, how do you know if the checksum is correct? Do you keep a database of checksums stored on a specific pod? And if the checksum is not correct, do you have other copies on other pods?
All those criticisms look pretty irrelevant in Backblaze's scale-out cold storage use case. Their performance requirements are essentially zero. They already handle full pod failures, so a partial failure is no worse as long as they can recognize it and fail the whole pod.
Having said that, it is interesting to compare Backblaze's design to Facebook's Open Vault. Facebook's design has much better mechanical serviceability, redundant power supplies, and redundant SAS expanders. Facebook believes that they can afford such features even for cold storage, so why is Backblaze cutting even more corners?
Disclaimer: I work for Backblaze. The answer to almost every question put to us is "total cost of ownership". Facebook stores a lot less data per customer than Backblaze and makes more per customer than we charge. Facebook can afford to waste more money than we can, so they do.
In my humble opinion, there is A LOT of wasted money in making datacenter machines easy to service. We create a little spreadsheet (it isn't rocket science) which includes the employee salaries taking into account different designs and how long it takes to open up and service a pod. For example, to access the Backblaze Pod's hard drives you must pull the pod out like a drawer and open the top. Many servers you can access all the hard drives from the front, without moving the server. Accessing drives from the front is much faster -> but you lose 2/3 of the density!! We pay $600 per month for a cabinet, just for the physical space. We can stack 8 pods in that cabinet, or about $75 per pod per month in space rental. So buying servers that are 1/3 as "dense" but easier to access drives from the front costs Backblaze $50/pod/month or $600/pod/year. We open a particular pod and replace a failed drive or fix some other problem maybe once every 6 months-> so you are paying $300 PERINCIDENT if you have front mounted drives that save the technician maybe 10 minutes of time. The math simply doesn't make any sense at scale.
I think a lot of the inefficiency is because the datacenter employees recommend these "nice to haves" to their managers to make the datacenter employees lives easier. But their company pays dearly for not doing the math. This isn't important if your company is massively profitable or if it isn't a "datacenter centric" company like Backblaze. But for us, it's the difference between life and death. Remember, we only charge $5/month and we don't have any deep pockets so we have to stay profitable -> there isn't a lot of margin in there to be wasteful.
I totally agree with you in principle. But you're being too generous with your density advantage as compared to hotswap. I can't tell if your pods are 4U or 5U, but your motherboard vendor sells a 4U front & back chasis that has 36 3.5" sas trays.
Obviously that setup will run you more than yours, but I'd be surprised if the delta was more than 1k once you subbed in your consumer mb, controllers and backplanes.
That said, one small nitpick: I'm fairly sure Facebook does less revenue/user than you do: They have a billion users, and did (order of magnitude) 5 billion last year: That's $5/user/year. Which you do per month.
Still, it's amazing you can provide this unlimited service for that small fee!
I work for Backblaze -> typically we replace a drive once it fails, but not "as soon as it fails". Because the storage pods go in to "read only" mode when one drive goes down, we have some time before we need to take action, sometimes it can be a few days before the drive is replaced. All incoming data is rerouted to a different pod, but the data that was on the pod is still readable and available for restore.
How do you roll up you old pods? By this I mean, do you still have all of your 1.0 pods running, or do you start migrating them to 3.0 pods, and cannibalizing the older pods for parts, until full failure/replacement, or...?
We have migrated the data from smaller pods to larger pods, then re-enabled backups on the new half filled pod to fill it to the brim. But we did not do this because the pod was necessarily old, we do this if a pod is unstable for some reason (usually it is a brand of drive we ended up not liking).
We have done exactly what you mention -> migrated off a pod and then disassembling the pod and using the older drives as replacements. Hard drives often come with a 3 year warranty, so for the first 3 years it is free to get replacement from the manufacturer if they fail. But after 3 years we have to pay out of our own pocket to replace the drives which can change the cost/math a little.
I wonder if it is still too soon to start seeing the worst failure cases given the age of the drives.
I am curious why RAID6 is (still) used given its poor behavior in failure conditions. I am guessing that Backblaze is not as worried about additional drive failure in an array during a rebuild because they are periodically verifying each drive.
Comparing to a X4540 is kind of a stretch. The use cases seem very different. Also, it for Backblaze, they wouldn't exist at the price they charge, with the price that Sun charged for an X4540.
I hope Joyent does release something about their config, that would be very interesting.
(Yev w/ Backblaze here) -> We have drives that have remained online since the first versions of the storage pods, but we've just recently started collected drive failure rates. Typically we see that if a drive can stay up for the first two weeks, it will live happily in the pod for a good long while.