
Petabytes on a budget: How to build cheap cloud storage - bensummers
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/
======
sh1mmer
One of the smart guys in our (Y!) cloud team just point out something to me
that hadn't occurred to me.

This system is definitely optimized for backup. That totally make sense for
Backblaze. However it's important to not compare this like for like with
something like S3 which is optimized for much better read/write performance.

At the basic level the cooling on this system seems minimal. Those tightly
packed drives would sure get hot if they were all spinning a lot. More than
that since they are using commodity consumer hardware, and they already used
up their PCIe slots for the SATA controllers there isn't any place to add
anything more than the gitabit (I assume) ethernet jack on the mobo. That
means there throughput is limited.

Again, this is a great system for backup. Most of the data will just sit
happily sipping little power. However, if you are thinking of this as
equivalent to a filer, that's an unfair comparison.

~~~
Andys
Fast 120mm fans can move a pretty decent amount of air - up to 120 ft^3/min
each, and they used three in parallel.

It would take about a dozen 80mm normal-speed computer fans to reach this.

------
dmillar
This is unbelievably awesome. Can you imagine what we could achieve if every
company had this level of transparency?

~~~
PStamatiou
Thinking the exact same thing! While technology isn't more than any typical
hacker can slap together in their apartment (minus maybe the case fab), I
think the application is ingenious and the full blog post about it, like
dmillar, pointed out is rather awe-inspiring. And I was all happy because my
new Core i7 box has 2TB of space and 6GB of ram... sigh

That being said - did seagate ever fix their firmware issue on the 1.5TB
drives that would cause random corruption? (heard about it maybe 6 months ago)

~~~
jbellis
> While technology isn't more than any typical hacker can slap together in
> their apartment

I think you missed the part about testing a dozen SATA cards, etc.

The attention to detail here is a lot more than something you'd slap together
in your apartment.

~~~
Periodic
This is what always gets in my way. It takes a lot of work and a lot of
expertise to put together a home-grown system that works as well as one from
the major vendors. If you're only going to be using one or two, you're much
better off going to one of those major vendors because a large part of the
price is their expertise and testing that went into it. For a large setup like
Backblaze they can spread the cost of design over many systems, but for
smaller companies it just isn't feasible.

We hacker types love to think that we could do the same thing in no time with
little budget, and I'm sure we could get a first approximation. But the devil
is in the details. Debugging the complex interaction of 20 different hardware
components is not my idea of fun.

Hats off to them, particularly for sharing.

~~~
uhgygghhj
I did this as an inhouse backup for a data warehousing app. Just slapping 4
ide cards into a case and putting 16x250gb IDE drives on them resulted in a
system that would copy about 1 disk worth before hanging with some fault or
suddenly dropping to 1% speed.

Just because you can in theory hook 40 drives to n cards doesn't mean it will
work - well done to them

------
edw519
I always thought it was best to focus on what I did best (application
software) and leave the infostructure to others. Until I saw this:

    
    
      Raw Drives      $81,000
      Backblaze      $117,000
      Dell           $826,000
      Sun          $1,000,000
      NetApp       $1,714,000
      Amazon       $2,806,000
      EMC          $2,860,000
    

I had no idea. Kinda makes one rethink what business they want to be in.

~~~
cperciva

      Backblaze      $117,000
      [...]
      Amazon       $2,806,000
    

I cry foul. Backblaze's "67 TB" pods actually only hold 58.5 TB, so their
hardware cost per PB of storage is $134k, not $117k; and that's without any
high-level redundancy. Servers fail -- both catastrophically, and by silently
corrupting bits -- and Backblaze's $134k / PB doesn't have any protection
against that. Datacenters also fail -- power outages, cut fibre, FBI raids,
etc. -- and any system which stores all of its data in a single datacenter
lacks any protection against that. Store each file on two different servers in
each of two different datecenters, and suddenly Backblaze's $134k turns into
$536k. The price for Amazon, in contrast, is based on the assumption that
their prices remain fixed for the next 3 years -- which seems a rather radical
assumption.

Is backblaze's solution cheaper than S3? Absolutely. But they're also twisting
the numbers a bit.

~~~
skolor
Interesting, but Amazon doesn't guarantee most of those things. They guarantee
a 99.99% uptime, but that isn't counting a complete datacenter failure. In
fact, it sounds to me as if they have a similar set up to the backblaze
people.

99.99% uptime means roughly 1 hour per year of downtime. I don't know what the
specific failure rates on the components are, but it seems reasonable that A)
the data drives are hot-swappable, and will not cause downtime when they are
replaced and B) the rest of the components fail once a year (or less) and take
~10 minutes to replace and reboot the system. With 4 main places of failure
(PSU, Boot Drive, Motherboard/Ram/CPU, Drive controllers), as long as you have
staff constantly on call and they can respond within 5 minutes of a failure,
99.99% uptime seems reasonable.

I don't know where you got the idea that your data was on 4 different servers
when using S3. I can't find even the slightest amount of information on that.
Yes, that would be nice, and it is cool to think that, but its rather doubtful
that they're actually doing that (or they could probably add another 9 to
their uptime).

~~~
lecha
Re. geographical redundancy, see comment
<http://news.ycombinator.com/item?id=422574> :

"Amazon keeps at least 3 copies of your data (which is what you need for high
reliability) in at least 2 different geographical locations. "

I can't find the original source supporting that statement, but I also know it
to be true based on direct contact with AWS team. (I'm using AWS since private
alpha of EC2 in 2005)

~~~
skolor
I can't find that anywhere either, and would be interested in seing it. In
fact, all I can find is: _A bucket can be located in the United States or in
Europe. All objects within the bucket will be stored in the bucket’s location,
but the objects can be accessed from anywhere._ which seem to imply that your
data is located in one location, not 2.

In addition, looking at the actual S3 contract, they're really only
guaranteeing at 99.9% uptime, which allows for up to 8 hours of downtime a
year, more than enough to completely re-build the outer server once a year, as
long as they can keep the data intact (which they seem more than capable of
with their setup, once again assuming their data center is not completely
destroyed).

------
gstar
They might have missed a trick here.

To address vibration, acoustics and gyroscopic effect, what I've seen done in
highly dense enclosures is to rotate every second drive around 180 degrees in
a bit of a shotgun approach to balancing stuff.

Still, awesome.

~~~
ciupicri
This has sounded ok for me too, but when I've thought a bit about it I've
realized that the drives must be synchronized so that each one would "cancel"
each others vibrations. The vibrations generated by the drives might have
indeed the same amplitude and (spatial and temporal) frequency, but I doubt
that they'll have the same phase offset. For this to happen they would need to
be started at _exactly_ the same time which I don't think that happens in
practice.

Do you have some pictures of those enclosures?

~~~
gstar
I don't have pictures. I can describe, though.

Two disks were mounted in a frame linearly, both screwed to the frame, with
the power/sata connectors toward the middle of the frame, and one drive
upside-down.

These cassettes were removable as a unit for hot-swap, and were inserted
linearly into a half-deep 19" rack enclosure.

That the two drives were physically connected to the same frame, and removed
and replaced as a unit would make it seem as if they would be started in
phase. Now I'm not physicist, but I'm not 100% sure that's so important - if
you have two gyroscopes contra-rotating and firmly connected, running at the
same speed, surely they resist movement by sheer gyroscopic effect?

------
tsuraan
67TB of storage, with 4GB of cache. I'd really love to see some performance
numbers versus the way-too-expensive competition. If the systems are being
used as tape drive replacements, I could see this working well, but as an
actual NAS-like device, I can't imaging how it could perform acceptably. Of
course, if those Intel motherboards have the dual 1Gb/s NICs that Intel boards
generally do, it will probably take a while to fill the drives anyhow.

~~~
notaddicted
[http://www.wolframalpha.com/input/?i=(67+TB)+%2F+(2*(1+Gb%2F...](http://www.wolframalpha.com/input/?i=\(67+TB\)+%2F+\(2*\(1+Gb%2Fs\)+))

It'll take 3 days at the theoretical max of the networking equipment to
read/write the 67TB. The overhead of HTTPS constrains the networking so this
is too low.

I'd expect that their internet connection (i.e. in/out of the data center) is
the real bottleneck.

I believe that the system _is_ being used as a tape drive replacement.

~~~
lsc
If the load is relatively sequential, then yeah. But if the load is mostly
random, then there is an argument for having more cache.

------
eleitl
This made the Beowulf list yesterday, and below is what I wrote in response:

"Seagate ST31500341AS 1.5TB Barracuda 7200.11 SATA 3Gb/s 3.5″ Aargh! Should be
definitely substituted by 2 TByte WD RE4 drive.

Today I've built a 32 TByte raw storage Supermicro box with X8DDAi (dual-
socket Nehalem, 24 GByte RAM, IPMI), two LSI SAS3081E-R and OpenSolaris sees
all (WD2002FYPS) drives so far (the board refuses to boot from DVD when more
than 12 drives are in though probably to some BIOS brain damage, so you have
to manually build a raidz-2 with all 16 drives in it once Solaris has booted
up). The drives are about 3170 EUR sans VAT total for all 16, the box itself
around 3000 EUR sans VAT. I presume Linux with RAID 6 would work (haven't
checked yet), too, and if you need more you can use a cluster FS.

Maybe not as cheap as a Backblaze, but off-shelf (BTO) and you get what you
pay for.

------
swombat
Wow.

Do they offer S3-like storage? They should. If they can offer something like
S3, but at one third of a penny per gigabyte per month (heck, let's splash out
- a whole penny per gigabyte per month) I know quite a few people who'll be
interested in talking to them... (including myself)

~~~
jacquesm
Apparently they use HTTPS for input/output using tomcat and some custom
application.

That's a strange choice, HTTPS would incur quite a bit of overhead for
something that is essentially a (large) drive at the end of a network cable
used internally only. Why the encryption ?

quote from the article:

"A Backblaze Storage Pod isn’t a complete building block until it boots and is
on the network. The pods boot 64-bit Debian 4 Linux and the JFS file system,
and they are self-contained appliances, where all access to and from the pods
is through HTTPS. Below is a layer cake diagram.

Starting at the bottom, there are 45 hard drives exposed through the SATA
controllers. We then use the fdisk tool on Linux to create one partition per
drive. On top of that, we cluster 15 hard drives into a single RAID6 volume
with two parity drives (out of the 15). The RAID6 is created with the mdadm
utility. On top of that is the JFS file system, and the only access we then
allow to this totally self-contained storage building block is through HTTPS
running custom Backblaze application layer logic in Apache Tomcat 5.5."

That's an odd choice for a storage server protocol stack.

~~~
antonovka
_That's a strange choice, HTTPS would incur quite a bit of overhead for
something that is essentially a (large) drive at the end of a network cable
used internally only. Why the encryption ?_

To help prevent a network compromise from resulting in storage management
compromise? Just because something is internal doesn't mean it's safe. Once a
host/network segment is compromised, you don't want it to be easy to jump to
the next.

Otherwise, you've built M&M security. Hard candy shell on the outside, soft
gooey chocolate insides. Mmm.

------
thaumaturgy
I have been searching high and low for something like this for months -- I
have a number of clients that have been begging me for this, as well as
something for my own needs.

I'm excitedly posting a link to this on my personal site, and today I have a
lot of phone calls to make to clients.

Why is it that the best products and services are also the hardest to find
when you're looking for them?

------
ShabbyDoo
What terrified me was the number of low-level issues they had to address. SATA
protocol problems, custom-designed SATA cards? Wow. I'd have no idea how to
begin with that stuff. As other posters have noted, this company's business is
storage. How many PB must one host for such specialization to make sense?

------
Bjoern
What a pity, I'd like to have a way to store just a encrypted backup directly
(via ssh) instead of installing a app which crawls my system. Oh did I mention
I use Linux? _sign_

Does somebody know any other company which provides this solution? I don't
trust other people to crawl my system and to "encrypt" it for me.

EDIT: Added Question

~~~
bestes
Did you look at tarsnap (<http://www.tarsnap.com/>)? I don't use it, but read
about it here quite a bit and it sounds really great.

------
datums
Service Idea: S3 offsite backups. Run <http://www.eucalyptus.com/open/> on
this kind of hardware.

------
ALee
These guys are very very good at what they do. Btw folks, no funding and
they're already profitable. Gleb and crew rock!

------
preview
I wonder about the single point of failure posed by the power supplies. One
failed box is not a big deal (since I assume the data is replicated over
several). But, what if they get a bad batch of supplies and see a relatively
high failure rate? I wonder how high a power supply failure rate they can
handle.

The need to stagger the power on of the two supplies poses a problem. What if
power to a data center is lost? When power is restored, all box will try to
start, blowing all fuses. Granted, this is a catastrophic event, so its
frequency should be very low. But, this also seems like an area that could be
automated.

~~~
lsc
You should only use 75% of the rated capacity of a circuit, which means you
have enough power to turn them all on at once.

some of the more expensive managed power supplies also support a staggered
power on after power fail. But I don't worry about it; only using 75% of the
power circuit solves that problem for me.

~~~
preview
Unfortunately, using 75% of the rated capacity may not be enough to handle the
inrush. The article discusses this point, "...if you power up both PSUs at the
same time, it can draw a large (14 amp) spike of 120V power from the socket."
That would mean one pod per 20A circuit. Ugh. In normal operating conditions,
a 5.6A max load would allow three pods per circuit.

Addressing this would require a little bit of design, but the problem is
relatively simple. If they wanted to get fancy, they could add a chaining
feature--pods on the same circuit would be connected together so that they'd
power on serially. This would get away from their goal of using off-the-shelf
parts. It is, as with many things, an engineering trade-off.

~~~
lsc
The PDUs that support 'staggered power-on' are 'off the shelf' - if that is
not an option (really, we're talking maybe $500 for each 20a circuit, retail)
the next thing I'd do is set 'power on after power fail' to off, then have
some remotely accessible way to trip the power button. (I'm working on a
solution to that particular problem, but that's not 'off the shelf' - yeah,
everything is on PDUs I can trip remotely, but there are reasons why it is
much better to ungracefully reboot with the 'reset' jumper than to
ungracefully reboot via cutting off the power.)

from there it would be easy enough to have an automated process turn on
servers one at a time.

------
psranga
Great stuff. Would anybody care to clarify why access is through HTTPS and not
HTTP?

I presume all accesses to these pods are from within their data center? Or do
they directly expose these boxes to clients (whoa!)?

~~~
phsr
BackBlaze runs online, off-site backups (like Mozy and Carbonite) at $5/month
for unlimited storage. HTTPS is used to keep client data protected, I assume

------
durana
The price comparison with other solutions isn't really fair. The cost per PB
of these storage building blocks is directly compared to complete storage
solutions from companies like NetApp and EMC. In the end the cost of the
complete solution they assemble these building blocks into may well be cheaper
than solutions from other companies and that's the number they should be using
for comparison.

------
modoc
I love how open they are about their solution, and I also want one for
Christmas:)

------
jwilliams
Awesome. Brings up a few questions for me (sure there are answers, just
curious really).

Why a Core 2? An Atom mobo would be lower power and cheaper. Why 4Gb? Seems
like an overkill. They are using a HD to boot. Couldn't they boot of a USB
key?

~~~
yellowbkpk
I don't know for sure, but there might not be an Atom-based board that has as
many PCI/PCIe slots.

~~~
jwilliams
Good point. The one I've got only has one PCI in fact.

------
byoung2
Very interesting insight! I may try building a scaled-down version just for
fun.

The cost comparison between raw drives, their custom solution, and Amazon S3,
etc was a little skewed. S3 is designed for pay as you go storage so you're
not paying for capacity you don't need. If you just need a few dozen
gigabytes, it's a much better deal. If you need terabytes or a petabyte, a
dedicated storage solution is more economical.

It's the same argument as vacation house vs timeshare. If you lived in a
timeshare all year, it would cost more than buying the house.

------
pmorici
Seems along the same lines as Capricorn Tech. <http://www.capricorn-
tech.com/products.php>

~~~
ggruschow
Not really. BackBlaze's post is about how to get on-line storage up for the
lowest cost per bit. They appear to have done a great job at quadrupling the
density of standard 1U setups for a similar price.

Capricorn appears to be in the business of selling standard 4 drive per 1U
setups _and support_. The fact you have to contact their sales department to
even get close to a price seems to indicate they're not competing on price.

------
dabeeeenster
What happens when one of the PSU's fails catastrophically, pushing a big
electrical surge through the hard disks, frying half of them?

I didn't see anything in here that discusses that eventuality? And when you
have that many servers, it IS going to happen at some point...

~~~
papercrane
It's addressed in the section "A Backblaze Storage Pod is a Building Block"

From the article:

When you run a datacenter with thousands of hard drives, CPUs, motherboards,
and power supplies, you are going to have hardware failures it’s irrefutable.
Backblaze Storage Pods are building blocks upon which a larger system can be
organized that doesn’t allow for a single point of failure.

~~~
dabeeeenster
Sounds to me like they haven't implemented this yet. I wouldn't want my data
on that sort of solution. Backed up data would surely need a level of
geographic redundancy.

~~~
papercrane
In the next section they say they have implemented the machine level
redundancy. As for geographic redundancy I think the idea is that you use them
for that. As in you backup your data locally and then send it to them as a
redundant copy.

------
sh1mmer
I'd love to hear what they are doing to monitor their system. They didn't
mention that. 3 out of 15 drives failures to loose a volume seems quite likely
at that scale.

I'd like to know what levels of warnings and alarms they use with which
system, e.g. nagios, etc.

------
sschueller
How do you replace a drive? With so many drives it seems like a lot of effort
to pull the entire server out to just access one drive.

~~~
papercrane
They have three raid-6 volumes per machine, so they probably leave it in the
rack until two of the volumes are unusable and then refurbish the machine.

~~~
rarrrrrr
I doubt it very much. When drives start failing out of a RAID6 array, you lose
it's ability to automatically detect and correct single bit errors on disk.

With that much data, single bit errors will happen predictably.

If their usage patterns are anything like ours, disk failures are well under
normal rates. Most backup data is stored and ignored. The drives just don't
get much stress. Downing a machine for maintenance is probably acceptable to
their usage pattern.

------
tricky
effing love this.

I wonder what the reliability stats on this setup is, though. Is it really
cheaper to jam all those drives in one unit without redundant PSU's, MB's, or
a boot drive?

I'd guess you'd have to build at least 2 of these units and mirror them to get
any sort of reliability. And, at that point, how long does it take to copy 58
TB across https?

data is hard.

~~~
skolor
I would assume they don't care about any of the hardware, just the data. If
you look at the setup, and how the drives are raided, there 45 drives are sub-
divided into 15 raid arrays, of which it would take 3 to die before they lost
data. Essentially, they would need 20% of their drives to die simultaneously
for them to lose data.

Now, for the rest of the hardware, its not that important to if it fails. If
one of the other components die, you're only looking at some down time (and a
possible dead hard drive or two from a PSU dying, which I assume they monitor
regularly). As long as the data is secure and in one piece, it doesn't really
matter whether the pod is up or down, until someone needs the data. Just send
out your repair guy to replace it and reboot it, and its fine.

------
silverscreen
Great read! Why a seperated boot disk? Why can't that be on the storage array?

