This system is definitely optimized for backup. That totally make sense for Backblaze. However it's important to not compare this like for like with something like S3 which is optimized for much better read/write performance.
At the basic level the cooling on this system seems minimal. Those tightly packed drives would sure get hot if they were all spinning a lot. More than that since they are using commodity consumer hardware, and they already used up their PCIe slots for the SATA controllers there isn't any place to add anything more than the gitabit (I assume) ethernet jack on the mobo. That means there throughput is limited.
Again, this is a great system for backup. Most of the data will just sit happily sipping little power. However, if you are thinking of this as equivalent to a filer, that's an unfair comparison.
It would take about a dozen 80mm normal-speed computer fans to reach this.
Probably you can get some high IOPS with all these drives in parallel, but probably not as much as the other more expensive ones (EMC, Netapp).
But I have to agree this is an awesome setup optimized for a specific application. Way to go
That being said - did seagate ever fix their firmware issue on the 1.5TB drives that would cause random corruption? (heard about it maybe 6 months ago)
I think you missed the part about testing a dozen SATA cards, etc.
The attention to detail here is a lot more than something you'd slap together in your apartment.
We hacker types love to think that we could do the same thing in no time with little budget, and I'm sure we could get a first approximation. But the devil is in the details. Debugging the complex interaction of 20 different hardware components is not my idea of fun.
Hats off to them, particularly for sharing.
Just because you can in theory hook 40 drives to n cards doesn't mean it will work - well done to them
This seems like an opportunity to disrupt from the low end of the market for some scrappy enterprising engineers. If they don't do it, someone should.
Even if you got a $100/hr consultant, $100k would be half a year. I suggest it could be done in half that time, by someone who has suitable background. Someone with the specific expertise should be able to do it in 6 weeks elapsed and 2 weeks of full time work, tops. (bt; dt)
I'm sure the backblaze folks have built something that handles that sort of thing into the system, but just saying; it's something you have to think about before you slap consumer drives into a raid that needs to keep working when one drive fails.
The whole point of this excercise is reducing cost across the board and upgrading consumer SATA disks to enterprise ones will make this setup a whole lot more expensive since the drives make up most of the hardware cost.
Zfs, if it does deal with consumer drives as well as it claims to, would solve the problem at the same disk space cost as raid5.
If you want a shared-nothing cluster with less than 3x overhead you can use erasure codes. The software complexity is significant, but at scale it should be the cheapest option. http://cleversafe.org/ or http://allmydata.org/trac/tahoe
The big win here is caching... each of the opensolaris boxes can be using write-back caching, because all data is mirrored to the other opensolaris box. much like a NetAPP or EMC with dual heads, I don't have to worry about disk inconsistencies caused by write-back cache unless both drives fail. (of course, your 'real' dual-headed SAN will switch to write-through caching if one head dies, and doesn't require a full rebuild when the other head comes back online, but I can't afford a 'real' dual-headed SAN.)
Rebuild times, I imagine, will be quite significant.
Now, I am very leery of the performance of this system (being as these are xen instances, it wouldn't surprise me if it was unsuitable for anything but tape replacement) but I haven't tried yet. It seems, though that it would work just fine if I was not too cheap to use real hardware.
If you wanted to add complexity, you could have 3 or more of these opensolaris boxes, and use software raid5 on the client to save space.
I always prefer the strategy where mirroring the data happens before the disk is presented to the VM. Have a look at GlusterFS and see if that might be an option? I know it wasn't playing nice with zfs for a while there, though.
And I agree, giving the user 2 block devices and expecting them to mirror is a very bad idea. I know this from experience. in an earlier setup, I'd give each user two block devices, one from each disk in the box. the idea being the user would mirror or not as appropriate. well, you loose a customers data, you loose the customer. Lesson learned.
But this is where my virtual setup gives me an edge. On every Xen server, I have a control guest or driver domain, the dom0. all disk and net I/O goes through the dom0 anyhow, so there wouldn't be much more overhead to me doing the mirroring on the Dom0, and passing the md through to the DomU (what the user controls.) I'm not adding another point of failure, either; if the dom0 chokes, all guests on the system will crash regardless.
With RAID10 again you have to build large arrays to beat triple JBOD for storage efficiency - a 12 x 1TB array yields 5TB usable space while triple JBOD gives you 4TB. BUT you still have a single point of failure - your RAID card can give out and your array is unavailable. Using triple JBOD you'll be able to have your data on different physical machines or even geographically separate data centers if you wish.
The advantages outweigh the reduction of storage density when you're dealing with petabytes of storage - why else would Google et al be doing it?
On top of that, I am kindof dumb compared to the sort of person I'd want writing my block device drivers. I want to take well-tested, open-source software components and plug them together in a clear manner. I don't know of anything off the shelf that will give me a filesystem with reasonable performance on a '3 jbod' system, unless you mean running md in a raid1 with 3 drives. (I'm actually doing that on my more remote servers; the idea is that I can wait longer before replacing a bad disk.)
The system I proposed above is basically that you can export drives that may fail to clients, who can then do their own redundancy. Because the drive is specified as 'may fail' write-back caching may be utilized. (God, ram has become cheap.) The client, in my case, will be the Dom0 of the DomU that wants the space, but if I was selling this space to random people on the internets, it seems that the client could be some box running md that treats it's iscsi devices syncronously, meaning it waits for the write to return from both MD devices before it returns it's write. If the intermediary client device did no caching at all, it seems that I may be able to setup IP failover. (though that part... sounds dangerous. I'd need to be very careful with 'fencing' it or what have you, so that the two nodes were not active at once.)
On the other hand, if I handle the 'beta' poorly, like taking it out of beta early, there won't be a long term to worry about.
Seems dangerous to source all your hardware from one vendor over a small timeframe. Would be nice to have some redundancy over manufacturers too.
Raw Drives $81,000
Is backblaze's solution cheaper than S3? Absolutely. But they're also twisting the numbers a bit.
99.99% uptime means roughly 1 hour per year of downtime. I don't know what the specific failure rates on the components are, but it seems reasonable that A) the data drives are hot-swappable, and will not cause downtime when they are replaced and B) the rest of the components fail once a year (or less) and take ~10 minutes to replace and reboot the system. With 4 main places of failure (PSU, Boot Drive, Motherboard/Ram/CPU, Drive controllers), as long as you have staff constantly on call and they can respond within 5 minutes of a failure, 99.99% uptime seems reasonable.
I don't know where you got the idea that your data was on 4 different servers when using S3. I can't find even the slightest amount of information on that. Yes, that would be nice, and it is cool to think that, but its rather doubtful that they're actually doing that (or they could probably add another 9 to their uptime).
"Amazon keeps at least 3 copies of your data (which is what you need for high reliability) in at least 2 different geographical locations. "
I can't find the original source supporting that statement, but I also know it to be true based on direct contact with AWS team. (I'm using AWS since private alpha of EC2 in 2005)
In addition, looking at the actual S3 contract, they're really only guaranteeing at 99.9% uptime, which allows for up to 8 hours of downtime a year, more than enough to completely re-build the outer server once a year, as long as they can keep the data intact (which they seem more than capable of with their setup, once again assuming their data center is not completely destroyed).
I believe that S3 replicates to multiple physical locations. So, while you might experience downtime, you probably won't experience data loss.
That is just the cost to buy all the components they have. If you budgeted that in you'd just have a bunch of boxes at your office that weren't even wired up. On the other hand, from Dell or Sun you at least have all the hardware in the chassis, if not some basic configuration and OS install done for you. Go with Amazon and you've already got it racked up in a data center with dedicated techs and replacement parts on hand.
It isn't a fair comparison.
Tell you want, since you're a fellow HNer, I'll save you some money and do it for half that. I'll even throw in some basic configuration. You're welcome!
It might not add up the same for your needs (assuming you want multi-data center redundancy of all data, etc...).
They're going Google's route, which is fine, it's just... a whole different direction than a Sun Thumper. I'd be interested to see how their costs per petabyte stack against Google's.
Google's operation has at most half the disks-per-rack density of Backblaze, but with 4 or 8 times the server density. Google does almost all their computation on the same nodes where the data is stored. Google is also storing each tablet a minimum of 3 times within each cluster, with most systems having multiple replica clusters.
Backblaze's system is going to have several orders of magnitude less bandwidth:
way too many drives in one server (some of them on the PCI bus!)
use of port multipliers (causing 5 drives to share one SATA cable's 300MB/s)
RAID6 with too many drives per array (15-way XOR is no fun)
why use JFS?
only access is via HTTPS, not clear if SSL is done on the pod
The more painful corollary is that, even if one measures not by total storage but by performance (a more critical aspect of, e.g., databases), the prices are still an order of magnitude or two apart.
I would argue that the reason "No One Sells Cheap Storage" in the backup/archive sense is that there isn't enough demand. Obviously, this is starting to change.
I am, however, startled that a driveless pod is so expensive ($2467 or $54.82 per drive), considering that external performance is, effectively, limited by gigabit Ethernet.
Two SuperMicro 846E1s would be $2k for 48 drive bays, and one could easily connect 10 of them to a single $300 SAS RAID card. Add another $1k for mobo and cables, and that's $11300 for 240 drives or $47.08 per drive, without having to do a custom case.
Granted, it takes up almost twice the rack space, but it's still a decent power density at 50W/U AC (assuming 6.5W per drive DC).
It takes 52 groups of 13 1.5TB disks (Hmm.. 13 pops up an awful lot in their setup) to make a PB using their pods. 51 groups means 17 pods is .9945TB for $134k ($134.5/PB).
Using my pods, it would be .936TB for $120.3k ($128.5/PB), but what's $6k/PB between friends?
I don't think a single SAS card can support more than 128 (or maybe 256 if you're lucky) disks. This doesn't affect your math much, but I wouldn't want someone to try this at home and be disappointed. If you're not willing to build a custom case and don't feel like suffering the vagaries of SATA expanders, LSI SAS cards plus Supermicro JBODs are the way to go.
Similarly, from "at home" experience, I can also safely say that luck factors prominently in the vagaries of SATA port multipliers. The reason I even made the attempt was a lower price per port. This is why I'm so started that a solution based on them would end up being more expensive than a SAS-based one.
Of course, if the camel of performance gets its nose into the tent of ones requirements, it has to be the SAS way or the highway.
Personally, I'm working on a system whereby I just fill up otherwise unused slots on my chassis and use that for backup. The system isn't done yet, but I've got 6 disks out there waiting for me to figure it out.
If I only needed one I'd seriously consider it, as my time, for setting up one unit and working out the kinks, would unquestionably be more than that, and they have quite a bit of experience working out the problems by now, even if I made my system redundant, their system would unquestionably be more reliable the first iteration.
However, I plan on needing a bunch more, so it probably makes sense for me to roll my own, work out the kinks, and hopefully end up with something cheaper that has a lot more cache.
To address vibration, acoustics and gyroscopic effect, what I've seen done in highly dense enclosures is to rotate every second drive around 180 degrees in a bit of a shotgun approach to balancing stuff.
Do you have some pictures of those enclosures?
Two disks were mounted in a frame linearly, both screwed to the frame, with the power/sata connectors toward the middle of the frame, and one drive upside-down.
These cassettes were removable as a unit for hot-swap, and were inserted linearly into a half-deep 19" rack enclosure.
That the two drives were physically connected to the same frame, and removed and replaced as a unit would make it seem as if they would be started in phase. Now I'm not physicist, but I'm not 100% sure that's so important - if you have two gyroscopes contra-rotating and firmly connected, running at the same speed, surely they resist movement by sheer gyroscopic effect?
It'll take 3 days at the theoretical max of the networking equipment to read/write the 67TB. The overhead of HTTPS constrains the networking so this is too low.
I'd expect that their internet connection (i.e. in/out of the data center) is the real bottleneck.
I believe that the system is being used as a tape drive replacement.
These guys only use 4GiB ram because they are saving money on the motherboard, I bet. Personally, if I were building it, I'd increase the cost by another grand or so and use dual low power opterons with 32GiB ram. (of course, that would also increase the space taken by the motherboard, so that would require some case redesigns. Still, opterons and registered ecc ddr2 are both incredibly cheap right now.)
"Seagate ST31500341AS 1.5TB Barracuda 7200.11 SATA 3Gb/s 3.5″
Aargh! Should be definitely substituted by 2 TByte WD RE4 drive.
Today I've built a 32 TByte raw storage Supermicro box with X8DDAi
(dual-socket Nehalem, 24 GByte RAM, IPMI), two LSI SAS3081E-R and OpenSolaris
sees all (WD2002FYPS) drives so far (the board refuses to boot from DVD
when more than 12 drives are in though probably to some BIOS
brain damage, so you have to manually build a raidz-2 with all
16 drives in it once Solaris has booted up). The drives are about
3170 EUR sans VAT total for all 16, the box itself around 3000 EUR
sans VAT. I presume Linux with RAID 6 would work (haven't checked yet),
too, and if you need more you can use a cluster FS.
Maybe not as cheap as a Backblaze, but off-shelf (BTO) and
you get what you pay for.
Do they offer S3-like storage? They should. If they can offer something like S3, but at one third of a penny per gigabyte per month (heck, let's splash out - a whole penny per gigabyte per month) I know quite a few people who'll be interested in talking to them... (including myself)
For example each of these "pods" only has 4GB of RAM. If I was doing lots of random I/O on 67TB of drives, I would want a heck of a lot more ram for caching efficiencies.
That's a strange choice, HTTPS would incur quite a bit of overhead for something that is essentially a (large) drive at the end of a network cable used internally only. Why the encryption ?
quote from the article:
"A Backblaze Storage Pod isn’t a complete building block until it boots and is on the network. The pods boot 64-bit Debian 4 Linux and the JFS file system, and they are self-contained appliances, where all access to and from the pods is through HTTPS. Below is a layer cake diagram.
Starting at the bottom, there are 45 hard drives exposed through the SATA controllers. We then use the fdisk tool on Linux to create one partition per drive. On top of that, we cluster 15 hard drives into a single RAID6 volume with two parity drives (out of the 15). The RAID6 is created with the mdadm utility. On top of that is the JFS file system, and the only access we then allow to this totally self-contained storage building block is through HTTPS running custom Backblaze application layer logic in Apache Tomcat 5.5."
That's an odd choice for a storage server protocol stack.
To help prevent a network compromise from resulting in storage management compromise? Just because something is internal doesn't mean it's safe. Once a host/network segment is compromised, you don't want it to be easy to jump to the next.
Otherwise, you've built M&M security. Hard candy shell on the outside, soft gooey chocolate insides. Mmm.
Of course these days there are well-understood solutions to those problems, compared to when S3 first started, but it's still not stuff you can pull off the shelf. (Although with the Cassandra distributed database, the metadata problem is close to that point now. </plug>)
Hardware is cheap, power and developers aren't.
I'm excitedly posting a link to this on my personal site, and today I have a lot of phone calls to make to clients.
Why is it that the best products and services are also the hardest to find when you're looking for them?
Does somebody know any other company which provides this solution? I don't trust other people to crawl my system and to "encrypt" it for me.
EDIT: Added Question
rsync does exactly what you're looking for. You can also pipe it over SSH encyption on the wire. type "man rsync" at a shell for a description of how.
The need to stagger the power on of the two supplies poses a problem. What if power to a data center is lost? When power is restored, all box will try to start, blowing all fuses. Granted, this is a catastrophic event, so its frequency should be very low. But, this also seems like an area that could be automated.
some of the more expensive managed power supplies also support a staggered power on after power fail. But I don't worry about it; only using 75% of the power circuit solves that problem for me.
Addressing this would require a little bit of design, but the problem is relatively simple. If they wanted to get fancy, they could add a chaining feature--pods on the same circuit would be connected together so that they'd power on serially. This would get away from their goal of using off-the-shelf parts. It is, as with many things, an engineering trade-off.
from there it would be easy enough to have an automated process turn on servers one at a time.
I presume all accesses to these pods are from within their data center? Or do they directly expose these boxes to clients (whoa!)?
Why a Core 2? An Atom mobo would be lower power and cheaper.
Why 4Gb? Seems like an overkill.
They are using a HD to boot. Couldn't they boot of a USB key?
The cost comparison between raw drives, their custom solution, and Amazon S3, etc was a little skewed. S3 is designed for pay as you go storage so you're not paying for capacity you don't need. If you just need a few dozen gigabytes, it's a much better deal. If you need terabytes or a petabyte, a dedicated storage solution is more economical.
It's the same argument as vacation house vs timeshare. If you lived in a timeshare all year, it would cost more than buying the house.
Capricorn appears to be in the business of selling standard 4 drive per 1U setups and support. The fact you have to contact their sales department to even get close to a price seems to indicate they're not competing on price.
I didn't see anything in here that discusses that eventuality? And when you have that many servers, it IS going to happen at some point...
From the article:
When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failures it’s irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesn’t allow for a single point of failure.
I'd like to know what levels of warnings and alarms they use with which system, e.g. nagios, etc.
With that much data, single bit errors will happen predictably.
If their usage patterns are anything like ours, disk failures are well under normal rates. Most backup data is stored and ignored. The drives just don't get much stress. Downing a machine for maintenance is probably acceptable to their usage pattern.
I wonder what the reliability stats on this setup is, though. Is it really cheaper to jam all those drives in one unit without redundant PSU's, MB's, or a boot drive?
I'd guess you'd have to build at least 2 of these units and mirror them to get any sort of reliability. And, at that point, how long does it take to copy 58 TB across https?
data is hard.
Now, for the rest of the hardware, its not that important to if it fails. If one of the other components die, you're only looking at some down time (and a possible dead hard drive or two from a PSU dying, which I assume they monitor regularly). As long as the data is secure and in one piece, it doesn't really matter whether the pod is up or down, until someone needs the data. Just send out your repair guy to replace it and reboot it, and its fine.
If a system goes down and they have to replace a disk or a power supply or motherboard, the data is still safe, and if by chance there's any $5/month users who need to restore data held on that one unit, well they can wait a few hours :-)