I have essentially have all of the same questions for Storage Pod 3.0 -- and in particular, what does the software stack look like? (This config is absolutely begging for ZFS, but I have a haunting feeling that something janky is afoot upstack.) I would also be curious as to the specific nature of failures that have been seen with the deployed architecture. Have the concerns from three years ago proven to be alarmist or prescient?
That said: I think it's very valuable to get configs like this out there for public discussion -- and I think it might be inspiring us (Joyent) to similarly publicly discuss our own high density storage config...
We group the drives into 15 drive RAID6 groups, where there are 13 data drives and 2 parity drives. This means we can lose 2 drives and not lose any data in that particular RAID6 group. We use the built in Linux "mdadm" tools to do this.
The network interface to a pod is through HTTPS talking with Tomcat (Java web server). Java writes the data to disk (ext4 on top of the above RAID6). Our application (backup) is very specific and performance forgiving, essentially we write data once and then re-read it once every few weeks and recalculate the SHA-1 checksums on the files to make sure the data is all completely, totally intact and a bit hasn't been thrown somewhere.
One of the "luxurious" parts of working at Backblaze is we own BOTH the client and the server. On a customer's laptop, the client pre-digests the data, breaks it up into chunks that make sense (more than 5 MBytes and less than 30 MBytes) and then the client compresses it if appropriate (we don't compress movies or audio because it would be silly wasted effort) and the client encrypts the data, then sends it through HTTPS to our datacenter. Because the client computer is supplied by customers, all their CPU cycles are "free" to us. We can conveniently break up files, encrypt them, deduplicate (within that client) all without spending any CPU cycles at Backblaze because it is done on the customer's laptop before being sent.
Again, the Backblaze storage pods really aren't the correct solution for all "off the shelf" type IT projects. For example, it won't meet the performance needs of many applications. But it does work exceptionally reliably in our experience as a backup solution when you have one or two programmers to help implement a custom software layer in Java.
One specific question, how do you know if the checksum is correct? Do you keep a database of checksums stored on a specific pod? And if the checksum is not correct, do you have other copies on other pods?
Having said that, it is interesting to compare Backblaze's design to Facebook's Open Vault. Facebook's design has much better mechanical serviceability, redundant power supplies, and redundant SAS expanders. Facebook believes that they can afford such features even for cold storage, so why is Backblaze cutting even more corners?
In my humble opinion, there is A LOT of wasted money in making datacenter machines easy to service. We create a little spreadsheet (it isn't rocket science) which includes the employee salaries taking into account different designs and how long it takes to open up and service a pod. For example, to access the Backblaze Pod's hard drives you must pull the pod out like a drawer and open the top. Many servers you can access all the hard drives from the front, without moving the server. Accessing drives from the front is much faster -> but you lose 2/3 of the density!! We pay $600 per month for a cabinet, just for the physical space. We can stack 8 pods in that cabinet, or about $75 per pod per month in space rental. So buying servers that are 1/3 as "dense" but easier to access drives from the front costs Backblaze $50/pod/month or $600/pod/year. We open a particular pod and replace a failed drive or fix some other problem maybe once every 6 months-> so you are paying $300 PER INCIDENT if you have front mounted drives that save the technician maybe 10 minutes of time. The math simply doesn't make any sense at scale.
I think a lot of the inefficiency is because the datacenter employees recommend these "nice to haves" to their managers to make the datacenter employees lives easier. But their company pays dearly for not doing the math. This isn't important if your company is massively profitable or if it isn't a "datacenter centric" company like Backblaze. But for us, it's the difference between life and death. Remember, we only charge $5/month and we don't have any deep pockets so we have to stay profitable -> there isn't a lot of margin in there to be wasteful.
Obviously that setup will run you more than yours, but I'd be surprised if the delta was more than 1k once you subbed in your consumer mb, controllers and backplanes.
That said, one small nitpick: I'm fairly sure Facebook does less revenue/user than you do: They have a billion users, and did (order of magnitude) 5 billion last year: That's $5/user/year. Which you do per month.
Still, it's amazing you can provide this unlimited service for that small fee!
We have done exactly what you mention -> migrated off a pod and then disassembling the pod and using the older drives as replacements. Hard drives often come with a 3 year warranty, so for the first 3 years it is free to get replacement from the manufacturer if they fail. But after 3 years we have to pay out of our own pocket to replace the drives which can change the cost/math a little.
I am curious why RAID6 is (still) used given its poor behavior in failure conditions. I am guessing that Backblaze is not as worried about additional drive failure in an array during a rebuild because they are periodically verifying each drive.
Comparing to a X4540 is kind of a stretch. The use cases seem very different. Also, it for Backblaze, they wouldn't exist at the price they charge, with the price that Sun charged for an X4540.
I hope Joyent does release something about their config, that would be very interesting.