
Under the hood: Facebook’s cold storage system - slyall
https://code.facebook.com/posts/1433093613662262/-under-the-hood-facebook-s-cold-storage-system-/?hn=1
======
bkeroack
"...since we were using low-end commodity storage that was by no means
enterprise-quality."

If Facebook (and I believe it's similar with other big players like Google and
Amazon) are not using "enterprise grade" hardware (because it doesn't make
economic sense at scale), who is? And more importantly, _why_? Why do
"enterprise grade" products even exist if the largest and most deep-pocketed
corporations don't find them to be good value propositions?

~~~
slyall
Lets say you enterprise division/company wants to store 100 Terabytes of data.
You ring up Netapp and buy a solution for say $100,000 and plug it in. You get
support, lots of documentation and it is usually fairly reliable.

Now if you are Google or Facebook you want 100 Petabytes of storage but
instead of paying Netapp $100 million you employee 20 really smart guys to
build you something, you put it on cheap hardware and you build exactly what
your software will talk to and what it needs.

It costs you $10 million up front but the hardware cost is just $15 million
because you use cheaper drives and build (thousands) of your own servers.

Of course this doesn't mean that "Enterprise Storage" isn't going to be the
next premium product hit by Open Source type solutions.

~~~
mbubb
I guess it depends how close to bare metal you want to get.

Google/Facebook/AWS (and maybe a Backblaze) have architected their data center
setup to not require the functionality that Netapp provides. If you buy a
NetAPP and plug it in your netork it is because you cant architect a
datacenter from the ground up...

Facebook was a founding member of the Open Datacenter Alliance. They have a
cool githup repo and put out interesting whitepapers like:

[http://www.opendatacenteralliance.org/docs/architecting_clou...](http://www.opendatacenteralliance.org/docs/architecting_cloud_aware_applications.pdf)

It is not just cheaper but architected better. They would use Cisco and other
enterprisey stuff if it were better but it is a case of "il y a moins bien
mais c'est plus cher" as the French Linux motto goes...

Which is not to say NetAPP or Cisco is bad - just that you can engineer the
datacenter to make it work better without enterprise functionality.

As an object lesson in failure, I had an experience with my first buildout of
Hadoop back in 2011. Not so successful because I bought all the wrong
components. Raid cards, 10G networking, HP and Bladenetworking rack switches,
dual power supplies.

It was hard to unlearn how I was used to setting up servers.

~~~
maximilianburke
Off topic but how would you have done your Hadoop buildout differently?

~~~
mbubb
Wow - still a bit painful to remember my first go around with this. What i am
about to admit is damning enough to keep me unemployed for the forseeable
future so hopefully no hiring managers are reading this...

I was a sysadmin and did the datacenter buildouts for the company. The task
was to take the dev setup in Softlayer (on bare metal servers) and reproduce
it in our racks.

The error was to take this somewhat literally. The bare metal servers that we
leased from Softlayer were 2u boxes with redundant power supplies, Raid
controllers and 10g networking.

I found a Supermicro configuration that almost identically matched and ordered
30 datanodes; 2 name nodes (this was back when Hadoop used the curiously named
primary and secondary name nodes); and a server to launch jobs from.

We already had a Netezza in place and since that used Bladenetwork switches I
decided to use the same as TOR switches with a beefier on as the Aggregation
switch.

We decided to use Ubuntu as the OS and the Cloudera packages - but without the
support or the console.

Every single one of those choices were mistakes. It is remarkable in hindsight
that it worked at all.

The mistakes:

1) starting with the power supplies. Since they were redundant it dictated an
A/B power setup in the racks. What this means is that you cannot use more than
50% of the power density because the entire rack is set to fail over. Each PDU
has to be able ot keep the whole rack up so it alarms at 40% capacity. My
using redundant powersupplies I was more than halving the amount of power i
had at my disposal.

2) RAID. I hadnt yet read the excellent Hadoop Operations O'Reilly book which
tells you why RAID is a bad idea for hdfs. Further buildouts included ripping
out the RIAD card and JBODing the drives allowing Hadoop to properly use the
raw disks.

3) 10G networking - because more is better right? Later after we hired
talented networking folks I learned a bit about how important caching is in
switches - particulary for Hadoop. My expensive monster Bladenetwork switches
were falling over because the bursty traffic would saturate the switches. In
addition I had the networking setup for a much more conventional network and
we were doing the switch aware stuff in the config but had a wide open /16 as
the network. I learned about properly segmenting to avoid uneccesary cross
talk.

4) Ubuntu - never listen to developers. ;) Actually it was our default OS.
CentOS works better for Hadoop I would later find out.

5) Lack of support. Cloudera is expensive but - unless you are already an
expert - worth it. We should have gotten help early on.

6) Hardware choices. Supermicro was a bad choice. Penny wise / Pound foolish.
They changed hardware config and it was hard to get replacemnts, etc. Cant say
enough good things about working with PSSClabs for hardware though. I already
mentioned the switches. I learned about Arista switches which are amazing for
this application.

By the time I left that company, I learned an enormous amount from really
smart engineers that knew gobs more about Hadoop and netwrking than I did.

I think the number of datanodes was up over 200 and we had a lean 1u datanode
from PSSCLabs with single powersupplies 12 drives JBOD with onboard flash for
the OS. Bonded 1G networking running up to Arista TORs and mutliple Arita Agg
switches in a leaf / spine topology.

It was a thing of beauty.

~~~
neurotech1
Most people wouldn't believe the hacked together servers that Google relied on
in the early days.

[http://en.wikipedia.org/wiki/History_of_Google#Beginning](http://en.wikipedia.org/wiki/History_of_Google#Beginning)

------
meesterdude
I remember par2 from usenet, seems like similar/same concept
[http://www.quickpar.org.uk/](http://www.quickpar.org.uk/) and something i've
been wondering about for a while for storage; glad to see it being implemented
at such a scale! makes good sense. Though I would be curious how
corruption/rebuild plays out over time.

Also impressive how they shaved off all that extra power requirements and
built-in expected unreliability. And really, utility is pretty darn reliable
anyway, mixed with non-customer facing storage. Good move.

~~~
fla
Reed–Solomon error correction is one of these things that still feels like
black magic to me.

~~~
qrmn
They're not even the most arcane. Consider reading up on rapid tornado
(Raptor) codes, other fountain codes, Goppa codes… there are whole families of
error-correcting codes with different sets of tradeoffs. And the decoding can
be enlightening, too: the Viterbi algorithm and its progeny can be
surprisingly useful.

Then blow your mind even further by looking up the principles of the McEliece
cryptosystem (but don't use it as originally specified - modern developments
have weakened and refined it a lot).

I warn you: this is one of those down-the-rabbit-hole subjects that has a
deliciously large amount of literature to soak up, you could find yourselves
with bookshelves full of it. (An alarming number of patents, too, for the
unwary.)

~~~
fla
Stop killing my productivity ! :)

------
louwrentius
"For example, one of our test production runs hit a complete standstill when
we realized that the data center personnel simply could not move the racks.
Since these racks were a modification of the OpenVault system, we used the
same rack castors that allowed us to easily roll the racks into place. But the
inclusion of 480 4 TB drives drove the weight to over 1,100 kg, effectively
crushing the rubber wheels. This was the first time we'd ever deployed a rack
that heavy, and it wasn't something we'd originally baked into the development
plan!"

That's just fun.

In some way I'm amazed that nobody of the smart people at FB tough of the
weight of this stuff. But then again, I probably made sillier mistakes myself
on a much smaller scale.

~~~
mbubb
Fascinating stuff.

My guess is that they were definitely aware of the weight tolerance of the
floor and the tiles but looks like the weak spot was the canister wheels on
the bottom of the rack. I wonder how they eventually got it out. They probably
used one of those rack jacks to get a stronger dolly underneath... Or removed
the HDDs and moved it.

This made me wonder about something else - the power density. They say the
layout:

"can support up to one exabyte (1,000 PB) per data hall. Since storage density
generally increases as technology advances, this was the baseline from which
we started. In other words, there's plenty of room to grow."

I wonder how they account for increased energy consumption as they put in
larger powersupplies to deal with the increase in power to run and cooling for
the larger HDDs. I imagine in this layout everything is in proportion. CPU /
RAM / Disk Space. It must be hard to accurately target growth with in the
constraints of a limited power footprint.

~~~
andyidsinga
afaik - as density increases per disk, ex going from 1TB to 4TB, power
efficiency per TB should be better too as long as they are also not also
increasing the total number of spindles by adding new storage nodes. One thing
I'm not sure of is if _total load_ for the hall increases as increased
accesses may occur because of having more density.

The hard disk power management part is really fascinating in terms of keeping
the storage cluster software alive and well to be able to respond to requests
while keep the hard disks off. A low power SSD cache tier on the front may be
part of the solution.

------
RyJones
It looks like every four days Facebook gets a Flickr worth of photos.

~~~
dummyfellow
seems FB will be bigger then google, once they launch the video service.

~~~
thrownaway2424
Doubtful. This article describes only 2EB of storage, while Google was
estimated to have 15EB of storage two years ago.

Another comparison with Google would be regarding the erasure coding technique
described in this blog post. Google offhandedly mentioned using it in a 2010
presentation. Is this really Facebook's first deployment of erasure coding?

~~~
brettproctor
No. We published an article of doing raid in hdfs in june 2014:
[https://code.facebook.com/posts/536638663113101/saving-
capac...](https://code.facebook.com/posts/536638663113101/saving-capacity-
with-hdfs-raid/)

------
devonkim
Even the companies that have $100M+ allocated for storage won't necessarily be
able nor even want to hire the kind of developers that could create their own
data storage system in-house - they've already invested so much of their IT
staff into buying into their SAN infrastructure. With most companies outside
tech and $100M+ available for IT, they'll just send it to paying for better
developers on things that offer better growth potential like big data
analytics or whatever because IT is a cost center if your revenue does not
directly depend upon reliability of your information systems, and with limited
engineering talent available at non-tech companies you'd rather have them work
on things that directly contribute to raising the top line. IT is about the
bottom line until a business decides IT is not a cost center for them.

------
jtchang
My guess is this is what Google and Amazon are also doing for Nearline and
Glacier.

When will SSD drives catch up in storage capacity and price to traditional
spinning platters? A SSD consumes very little power so this type of setup will
not be as necessary.

Also what about the life cycle of drives that get powered on and off
repeatedly? On one hand you have backblaze running consumer off the shelf
drives 24/7\. On the other you have FB/Google/Amazon powering on and off these
drives.

~~~
wmf
_When will SSD drives catch up in storage capacity and price to traditional
spinning platters?_

Never. Someone did calculate that if you can get flash that lasts 15 years it
would be cheaper than disk, though (because the disk has to be replaced). I
can't find the links on those.

 _Also what about the life cycle of drives that get powered on and off
repeatedly?_

We wrote a paper on that topic where we treated start-stop cycles as a
resource to be rationed with a token bucket filter: [http://storage-
conference.org/2011/Papers/Research/10.Felter...](http://storage-
conference.org/2011/Papers/Research/10.Felter.pdf)

~~~
JunkDNA
"Never" is a pretty log time. Curious what reasons you have for making that
statement?

~~~
rasz_pl
physics

------
sengork
Facebook went quiet on their Blu-ray library cold storage method.

------
curiousDog
Would be interesting to see stats on how often they had to resort to cold
storage? Also, how do they go about thoroughly testing such a service? This is
a main advantage of having 3 replicas, you can test HA in your statement in a
straight-forward way. How does that work with the Reed-Solomon system? Also,
kudos to Facebook for publishing articles to this to the public. A lot of
companies keep this stuff secret.

~~~
brettproctor
This quote from the article talks about data verification. Although we all
know that a backup isn't really a backup until you've tested restoring from it
:-)

""" To tackle this, we built a background “anti-entropy” process that detects
data aberrations by periodically scanning all data on all the drives and
reporting any detected corruptions. Given the inexpensive drives we would be
using, we calculated that we should complete a full scan of all drives every
30 days or so to ensure we would be able to re-create any lost data
successfully.

Once an error was found and reported, another process would take over to read
enough data to reconstruct the missing pieces and write them to new drives
elsewhere. This separates the detection and root-cause analysis of the failure
from reconstructing and protecting the data at hand. As a result of doing
repairs in this distributed fashion, we were able to reduce reconstruction
from hours to minutes. """

------
carb
Resurrecting a dead comment:

"the amount of engineering at this scale is just insane. kudos to facebook
engineering." \- emocakes

~~~
akurilin
Yeah, it's pretty unbelievable. Is the idea here that they're keeping these
storages as general-purpose as possible to serve many different use cases and
survive the test of time and change?

Or do they specialize these data centers for ONE specific feature? I wouldn't
want to be in the shoes of the people designing the full integration of the
system in the latter scenario, imagine your use-case changes... that's a whole
new level of refactoring, especially once you go into special-purpose
hardware.

~~~
brettproctor
Yes, the idea is to be as general purpose as possible. That was one of the
motivations for avoiding tape.

------
krig
There was an interesting presentation about the cold storage system used at
Alibaba on openSUSE conf 15. Unfortunately the video isn't online yet, but
they have a system that allows them to turn off power to the cold storage
servers and only wake the disks up on a monthly schedule. They also use
erasure coding and have calculated that they won't have to replace any disks
for 5 years if the MTBF numbers from the vendors are correct:

[https://events.opensuse.org/conference/osc15/proposal/556](https://events.opensuse.org/conference/osc15/proposal/556)

~~~
philipw
[http://events.linuxfoundation.org/sites/events/files/slides/...](http://events.linuxfoundation.org/sites/events/files/slides/LFVault2015_Alibaba.pdf)

The slides from Vault 2015 talk about Sheepdog on low power hardware and SMR
drives at Alibaba.

------
amelius
So in what respect are generic solutions such as glusterfs lacking? Only the
cold storage? And would that be easy to fix?

------
tmdp
What a waste of electricity

~~~
pbhjpbhj
I wonder - looking at the data they have now - how much of the data stored has
never been used by ordinary users. There's got to be a lot of cruft that could
be trimmed. In the past people would keep letters they felt were important.
Now, with the network it's hard to tell what's important, a user might prune
items that other users want to keep or might prune items that later become
more important. Is there really enough value in keeping everything.

I'm a consumate hoarder, I've got 2 decades of emails saved (but not all
emails!) ... even I think perhaps we've gone over the top.

Perhaps it's good for social history. Or maybe in the future for a Black
Mirror like reconstruction of people's personalities in an AI.

Doesn't there need to be limits on what we keep?

[Meta: the parents is a valid remark and one that I feel adds to the
complexion of the scenario under consideration; perhaps it could have been
made better, fleshed out, but still. Wish HN - as a population - would value
those that add to the conversation in this way.]

------
revelation
You know, Facebook, you can just _buy_ these things from companies that spend
all of their R&D exclusively on _making_ them.

Not invented here..

~~~
cmurf
Those are arguably antiquated models. Even if you don't agree with that, those
models are predicated on companies that don't actually know what they're doing
when it comes to storage, which is why they defer to others and pay the markup
that comes with it. Google, Amazon, Microsoft, Apple, Facebook, many others
(and growing), this _is_ their business now, to know how to reliably store
things, do it in-house in a way that works for their use case.

I don't use Facebook at all, but I appreciate the many investments they've
made in open hardware and open source software. They don't give away
everything. But they definitely contribute in significant ways.

