Hacker News new | comments | show | ask | jobs | submit login
An Inside Look at Facebook’s Approach to Automation and Human Work (hbr.org)
93 points by riqbal 894 days ago | hide | past | web | 32 comments | favorite

I would not say that FB compares favourably to Google in terms of infra, and I worked for both, in infra. Culture is way more relaxed though, especially if you're not into cults.

Can you speak more about this? Id be interested in some of the details (you dont have to throw either under the bus, just curious about the nitty gritty stuff)

I don't work for either but I do follow their projects a bit. I'd say you can start by comparing things like Hadoop or HDFS to things like Google File System and their Spanner-based F1 RDBMS (strong consistency + NoSQL scale). There's clearly a difference in talent or applying it to solve toughest problems at least. Both companies try to work around problems too hard to solve but Google seems to solve the hard one's more often.

Pretty much that. FB has a strong culture of trying to use FOSS 1st, no matter what. The problem is that at the scale FB operates this simply produces 2nd rate solutions with a ton of duct tape. If you're into solving hard problems Google is a way better place to be. In FB it feels like you're solving a lot of problems so it kind of feels way more productive but you quickly realise that most of these are obvious self-inflicted wounds. There just isn't much political will to invest in large non-customer-facing projects (like Google's GFS and its successor, BigTable, Spanner etc.) and perhaps talent is spread a bit too thin.

That being said Google inside too often feels like a cult and I don't miss hearing the word 'googliness' at least 10 times a day.

It would be cool if you could share when you worked at FB and what general area of the infrastructure you worked on. Your statements don’t match the Facebook infrastructure I know. I recently left after spending 8.5+ years on infra, so I saw it both in the really early days, as well as how it is today.

Facebook actually built many major pieces of its infrastructure itself after finding no suitable solution in open source or commercial vendors. Efforts like this were happening even back in the early days of FB infra. I could list out many publicly known Facebook infrastructure projects that solved Hard Problems. And not just in recent years either. :-)

Hard problems like massive scale with strong consistency which three other firms solved? Or "hard" problems like managing COTS components or accelerating PHP that just require time + labor to pull off? Feel free to name projects solving Hard problems whose capabilities had little to no precedent outside Facebook. I mean that straight up rather than sarcastically in case it comes off that way. I'd like to know the best Facebook pulled off.

I’ll talk about one abstract thing and two concrete things that I think show that Facebook has been able to solve Hard problems that required the development of new capabilities rarely or never seen before. I find these three things to be very impressive and notable myself, although I am obviously very biased. I should note that this is not an exhaustive list, it’s more just the three examples I am most familiar with and wanted to take the time to write up.

First, the abstract thing. I don’t know if most engineers outside Facebook appreciate how much activity there is on the site each day. Hundreds of billions of likes and comments. Billions of photos uploaded. Trillions of photos consumed. And growing each day.

And this is for an “online application” in the sense that the data is live and constantly changing. We’re not talking about crawling the web, storing it, doing offline processing, and then building a bunch of indices (which is a different but still legit kind of hard). This is an immense amount of live data producing an even more immense stream of live events; trillions and trillions and trillions that need processing, live and on the fly, every day. It is hard to underestimate how difficult it was to build an application and backend that could drive this kind of social platform. In terms of liveness and scale, there really is nothing out there that can touch Facebook, by orders of magnitude. That is a solved Hard problem in my opinion, if not a meta one. But in my opinion, the largest one.

Here’s two concrete things that I thought are good examples of some Hard problems that Facebook has solved:

1. A global media storage platform that each day is capable of ingesting billions of new photos and videos and delivering trillions as well. This includes Haystack, F4, and Cold Storage, which store hot/warm/cold media objects for Facebook. Each storage layer has specialized, custom software running on distinct storage hardware designed to take advantage of the requirements for each layer. Facebook is the largest photo sharing site in the world by orders of magnitude, and they had to build a very custom photos backend to handle the immense load. I’m not even mentioning their terabit class Edge platform which has a global constellation of POPs that accelerates application traffic and caches popular content close to users. Facebook’s global media storage and delivery platform is truly a unique asset.

2. A datacenter architecture focused on high power efficiency and flexible workloads, comprised of custom built: datacenters, racks, servers, storage nodes, network fabric, rack switches, and aggregation switches. Plus some other things that haven’t been made public yet. This architecture let application developers stop working around various physical performance bottlenecks with compute, storage, cache, and network, and instead just focus on optimizing what was best for the application. At the same time, this new architecture also greatly reduced infrastructure costs.

There are other solved Hard problems I can talk about that are public, and others that I wish I could talk about that aren’t yet. However, I’m not trying to exhaustively defend what Facebook has built, I more just wanted to respond and show a few examples of the very special scale Facebook has and some of the extremely hard problems they’ve had to solve.

re live data. Processing that much live data without changes breaking the whole thing is a Hard problem. Very impressive work on their part. I'll give them that.

re storage platform. I'd have to look into it to Haystack or F4 to see if they were truly unique or incremental improvements on existing stuff. Cold Storage, though, I *did read up on. The work and decisions that went into that were brilliant. Every little detail and optimization they could. Plus, although minor precedents existed, I don't think I saw anyone else thinking or working in the direction they worked. Truly innovative.

re datacenter. I count most of that as more incremental. I've looked at their publications on their architecture. They might strip unnecessary stuff out of a blade, put in a RAM sled for a cache, use a simpler protocol than TCP/IP for internal comms, and so on. The kind of stuff everyone in whole datacenter (and supercomputer) industry does. Most of it is obvious & has plenty precedent. Now, if you know a few specific tech they used that were extremely clever (little to no precedent) please share. Example would be whoever started installing servers into shipping containers for mass production then easy installation... that was brilliant stuff. Was either Facebook or Microsoft.

So, you've given two good examples of Hard problems Facebook solved. Brilliant thinking went into solving them. No doubt a company working some miracles on a regular basis.

I built Haystack with two other engineers and then founded the Everstore team which works on Haystack/F4/Cold Storage. I moved off of managing Everstore to focus on Traffic in fall of 2011, so F4 and Cold Storage happened after my time, but I worked on Haystack and the storage management layer above it for five years.

There were a few things that were interesting or novel about Haystack. CDNs were only giving us a 80% hit rate on photos, and even back in 2007 when we first starting working on Haystack, that was an immense number of misses to serve from the origin. That heavily informed the design goals of Haystack, which were to have a very I/O efficient userland filesystem that used direct I/O to squeeze as much iops from the underlying drives as possible. We also wanted zero indirect I/O, so we used an efficient in-memory index so that all seeks on disk were to service production reads versus indirect index blocks or other metadata. Lastly, Haystack is very very simple. We made it as simple as we could and eschewed anything that looked too clever. I actually think its simplicity is one of the things that made it Hard. It would have been much easier to design something more complex.

If you have any questions about the Haystack paper after reading it, I'd be happy to answer what I can. Just send me an email.

Regarding the datacenters, you're kind of simplifying it, but that's okay. There's a lot more orchestration, automation, and integration, and I wish I could give you a tour of one of our OCP datacenters so you can see it firsthand. There's plenty of things I wish I could talk about on the datacenter front that I feel are truly amazing and without precedent, but I can't until they are public. There's an excellent chance that they will become public and likely even part of OCP in the future though. I'm really proud of how much of Facebook's datacenter architecture has been shared with the world for free. It gives me warm fuzzies all the time.

Anyways, thanks for the fun thread, and let me know if you ever want to talk storage sometime.

The paper was straightforward on what mattered. I liked that your team did their homework on prior efforts, used object-based storage (I pushed it in late 90's), used most proven filesystem (one I'm using it now), kept data structures fairly simple for modern stuff, and wisely used compaction. All good stuff.

I actually do have ideas on where to go from there and that's general. I'd probably charge Facebook for the info, though, given the increase in performance and savings they might generate. I might contact you in the future on that.

> I don't think I saw anyone else thinking or working in the direction they worked. Truly innovative.

The best thing? The guy who designed it previously co-designed Google's cold storage system ;)

If true, that would be an epic burn of my comment. Yet, the only person's name that kept popping up has a LinkedIn that doesn't mention Google. He worked at Microsoft, Amazon, etc. Might not be the main inventor, though. Who are you referring to?

Gary Colman. Left Facebook for a startup earlier on this year.

I'll be darned. Yes, hiring innovators is a great innovation tactic. ;)

> There just isn't much political will to invest in large non-customer-facing projects

How about OpenCompute? It was promoted by FB (even if inspired by Google) and had so much impact on the industry that even HP is now having FoxConn build HP-branded OpenCompute-style servers.

My company (Newisys, under Sanmina) helped with hardware refinements because they were having cooling/airflow problems. They then contracted with Compal to manufacture the servers and then us to do system assembly. Compal's QC was horrifically bad -- we had about a 10% failure rate of servers they sent us. I'm happy OpenCompute is starting to receive traction, but it's not because FB is awesome at hardware. They've gotten a ton of both raw engineering and also DFM assistance from the pros.

I'm very grateful for you confirming my intuition on how they're doing hardware. It's so specialist and difficult I figured they had to be outsourcing most of it to pro's. Not a negative on their part, though, as it's what I'd do as well. There's just some holdovers in dicussions that think these companies are all doing (rather than funding) hardware development.

EDIT to add: The Horus product is interesting. Formerly, I only knew about the NUMAscale for AMD NUMA. Too bad competition rarely survives in this segment. Things were more interesting in the MPP and DSM days.

Jonathan Heiliger launched the OpenCompute initiative while at Facebook: https://www.facebook.com/notes/facebook-engineering/building...

I had the pleasure to meet him now that he is a VC at Vertex Ventures. Very smart and nice guy.

That matches what I've seen in descriptions of their architectures. Duct tape especially. They are pretty good at gluing together a bunch of FOSS stuff and expanding on it, though. I'll give them that.

The one technician for 25,000 servers part stands out the most to me. That's quite an achievement. Seems like they're ahead of supercomputers in management efficiency. Another along these lines is AOL's "lights out" datacenter [1] that they ignore for long periods of time except for efficient, batch jobs of maintenance.

[1] http://www.zdnet.com/article/aol-shrinks-its-lights-out-data...!

My intuition is that there's a Moore's Law-like labor rule for IT labor due to its high cost and the explosion of computing power:

"Every 18-24, an organization must be able to manage an order of magnitude more servers (virtual and physical) with the same number of staff."

Unusually for the HBR there is a lot of interesting stuff in there. For example having the compiler team mixed in with the front end development team.

The compiler team wasn't mixed in. The front end team developed a compiler within its own team over a multi-year period - this was in response to the question "Do you still separate the teams working on innovations from the teams maintaining today’s core business?"

That's what I meant. Sorry I wasn't clear.

Those having trouble reading, try copying the title into a web search and going direct from there and/or using incognito mode.

The paywall is popping up with a message "You've hit your limit of 5 free articles as an anonymous user this month.", despite the fact I haven't been reading anything on HBR for as long as I can remember.

Incognito mode does the job for me. Have you tried?


This blog post is interesting and it's good to see some numbers (one technician per 25,000 servers). Also, it touches upon one interesting aspect - "When do you automate a thing?". It would be interesting to know the actual metrics they use. Shameless plug: http://stackstorm.com/ is very similar to FBAR and is helping customers reap the benefits of automation. We share the vision and we are completely open source. Feedback is appreciated. (I am a full time developer with StackStorm.)

Folks interested in this kind of approach may want to join the Event Driven Automation meet-up in the Bay Area. Facebook is discussing FBAR and hosting an upcoming one (already "sold out"). LinkedIn is speaking at the next one (in SF 6/18) http://www.meetup.com/Auto-Remediation-and-Event-Driven-Auto...

Someone, OP maybe, post a mirror please.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact