

When It Comes to Facebook Scale, You Can Throw Out the Rulebook - coreymgilmore
http://techcrunch.com/2014/09/24/when-it-comes-to-facebook-scale-you-can-throw-out-the-rulebook/

======
eitally
We are the integration partner for Facebook, at least in Europe, and one of
our divisions helped them work through some design challenges re: heat
dissipation, PDU routing, and HDD density. We are _NOT_ the EMS who
manufacturers the servers for them, and the well-known company they're using
is providing crap quality. We see about 10% failure (to the point of having to
scrap the server) rate after we rack and start burn in and system test.

Honestly speaking, one of the things that interested me the most was the fact
that FB is running a custom Linux kernel ... and they update it way more
frequently than our test engineers are happy about. I'm happy to answer non-
NDA questions.

~~~
monksy
I'm not sure if you're able to answer this or not. Is it a FB specific
configuration? Or are they writing their own drivers/cpu-scheduling?

~~~
eitally
The latter. It's extremely irritating because they expect us to write the
system tests for everything from the HDDs to RAM to Fusion I/O cards to
network interfaces, and all we have to start from are the standard stuff ...
so when they change things and don't provide tests, it is unpleasant.

------
ChuckMcM
Nice puff piece :-) The interesting bit is that if you're going to be
servicing a lot of machines, then even if it costs you more to build/acquire
them, you can get back more in savings in operating them. This is called
"trading capex for opex." And while it takes a reasonably savvy CEO to "get
it" the number of CEOs who just "don't get computers" seems to be on the
decline (at least in spaces that use a lot of computers).

In one way this also puts pressure on folks like Amazon since their margin on
EC2 and other services is a mix of over-provisioning and opex savings, having
other folks be able to do this is a win. Perhaps the saddest thing was talking
with the HP "Gen8" folks (trying to sell me servers to replace my Supermicro
ones) who don't get this at all. But for them they came from one server / one
app not 500 servers / one app. I expect they could reclaim some market share
that way but I don't expect they will.

------
aristus
I used to be a server monkey long long ago, and played with the mockups of
these servers that were set up at FB HQ. One thing to realize is they are
_heavy_. Loaded with hard drives, one of these servers is roughly the size,
shape and weight of a long 120mm mortar shell.

Easier to rack than the Dells of yesteryear, sure. But there can be up to 120
of these bad boys in a single standard rack, which means 60 of them have to be
lifted above waist height during rack and stack. That is pretty strenuous, and
technically you should use a moving lift and / or a lifting partner.
Especially when you're talking about hundreds or thousands of racks.

Thinking outside the box and keeping the poor floor staff in mind, I've always
wondered why we don't build little catwalks or half-height rack units, or even
moving racks, like a dry cleaner's.

~~~
ianamartin
I don't see anything in this article that sounds like throwing out the
rulebook. It sounds a lot like throwing out a few stupid rules. But not the
whole book.

~~~
forgottenpass
Of course it sucks, It's techcrunch, they can't actually write about tech. A
bunch of vaguely clued in non-techies writing about tech companies for AOL
don't target an audience that wants to hear about server hardware. They want
to hear about how great (innovative, rule-breaking, etc...) facebook is. If
server hardware is the vehicle for that story, so be it.

This article is "company finds business value in customizing critical
component instead of using COTS." And that's only interesting to people who
understand the components. So they represent it poorly, but only a little bit,
not badly enough to merit the weight we typically associate with the word
"misrepresentation".

------
hackuser
I don't see much that's novel in the article, other than perhaps the rack
dimensions. Corporate servers and workstations have been designed for easy,
screwless maintenance for a long time, going back to the 1990s I think. I'm
pretty sure I also hot swapped array drives back then by just pulling a lever.
I can't remember the last time I needed a screwdriver for a corporate
server/desktop; I even see laptops now (HP Elite line) that are screwless, at
least the parts I've seen, and designed for easy maintenance.

~~~
tgflynn
The drive sleds are easy to insert and remove but to remove the drive itself
from the sled for replacement you have to remove screws. At least that was my
experience. I remember typically 4 easily accessible screws though, not really
as big a deal as the article makes out.

~~~
ControlledBurn
You're correct, OEM hot swap trays are only tool-less once it's screwed into
the tray (Though the OEMs do tend to ship most drives preinstalled to their
trays). The OCP stuff is designed to drop the bare drive into the JBOD/Server.

------
AndrewKemendo
I remember seeing a video from the Facebook server engineering/maintenance
crews and was really impressed with all that they were doing.

 _“Many silo these engineering teams –server, storage, database, [and so
forth]. We don’t create these barriers,”_

Does anyone know how they are structurally organized?

I also wonder who directed that all of these innovations be standard, eg. was
it Zuckerberg, Sandberg, Corddry or what?

~~~
judk
Maybe the Googler teams they recruited?

------
jsabo
Does anyone have a link to more info on their differently sized racks or the
storage arrays described in the article? I'm not too familiar with the open
compute project, but the storage vault design they have is 2U:
[http://www.opencompute.org/projects/open-vault-
storage/](http://www.opencompute.org/projects/open-vault-storage/)

which doesn't match the article.

~~~
ControlledBurn
Don't have anything to link to, but you're right, everything still ends up in
a 2U, technically OU (Open U, 2 inches per OU instead of 1.75), chassis. Each
ODM has their own unique sled setup for the server nodes, but everything ends
up in a 2OU framework for the busbar connection.

------
segmondy
Nothing new, but there are still lessons to take away. As a software engineer,
I agree with removing constraints/rules to see what new things could happen,
bringing in people across different domains to tackle new problem instead of
just software engineers, and absolutely watching your users use your product.

------
siliconc0w
Come on facebook, screwless design is oldhat let's see a fully automatic drive
replacement system.

~~~
ojbyrne
Indeed. As I read this article I found myself thinking of my CoolerMaster case
at home.

------
arbuge
>>Facebook has found when engineers work together instead of in isolation
interesting things begin to emerge.

Well, let's face it, this is hardly a groundbreaking new discovery...

