That being said Google inside too often feels like a cult and I don't miss hearing the word 'googliness' at least 10 times a day.
Facebook actually built many major pieces of its infrastructure itself after finding no suitable solution in open source or commercial vendors. Efforts like this were happening even back in the early days of FB infra. I could list out many publicly known Facebook infrastructure projects that solved Hard Problems. And not just in recent years either. :-)
First, the abstract thing. I don’t know if most engineers outside Facebook appreciate how much activity there is on the site each day. Hundreds of billions of likes and comments. Billions of photos uploaded. Trillions of photos consumed. And growing each day.
And this is for an “online application” in the sense that the data is live and constantly changing. We’re not talking about crawling the web, storing it, doing offline processing, and then building a bunch of indices (which is a different but still legit kind of hard). This is an immense amount of live data producing an even more immense stream of live events; trillions and trillions and trillions that need processing, live and on the fly, every day. It is hard to underestimate how difficult it was to build an application and backend that could drive this kind of social platform. In terms of liveness and scale, there really is nothing out there that can touch Facebook, by orders of magnitude. That is a solved Hard problem in my opinion, if not a meta one. But in my opinion, the largest one.
Here’s two concrete things that I thought are good examples of some Hard problems that Facebook has solved:
1. A global media storage platform that each day is capable of ingesting billions of new photos and videos and delivering trillions as well. This includes Haystack, F4, and Cold Storage, which store hot/warm/cold media objects for Facebook. Each storage layer has specialized, custom software running on distinct storage hardware designed to take advantage of the requirements for each layer. Facebook is the largest photo sharing site in the world by orders of magnitude, and they had to build a very custom photos backend to handle the immense load. I’m not even mentioning their terabit class Edge platform which has a global constellation of POPs that accelerates application traffic and caches popular content close to users. Facebook’s global media storage and delivery platform is truly a unique asset.
2. A datacenter architecture focused on high power efficiency and flexible workloads, comprised of custom built: datacenters, racks, servers, storage nodes, network fabric, rack switches, and aggregation switches. Plus some other things that haven’t been made public yet. This architecture let application developers stop working around various physical performance bottlenecks with compute, storage, cache, and network, and instead just focus on optimizing what was best for the application. At the same time, this new architecture also greatly reduced infrastructure costs.
There are other solved Hard problems I can talk about that are public, and others that I wish I could talk about that aren’t yet. However, I’m not trying to exhaustively defend what Facebook has built, I more just wanted to respond and show a few examples of the very special scale Facebook has and some of the extremely hard problems they’ve had to solve.
re storage platform. I'd have to look into it to Haystack or F4 to see if they were truly unique or incremental improvements on existing stuff. Cold Storage, though, I *did read up on. The work and decisions that went into that were brilliant. Every little detail and optimization they could. Plus, although minor precedents existed, I don't think I saw anyone else thinking or working in the direction they worked. Truly innovative.
re datacenter. I count most of that as more incremental. I've looked at their publications on their architecture. They might strip unnecessary stuff out of a blade, put in a RAM sled for a cache, use a simpler protocol than TCP/IP for internal comms, and so on. The kind of stuff everyone in whole datacenter (and supercomputer) industry does. Most of it is obvious & has plenty precedent. Now, if you know a few specific tech they used that were extremely clever (little to no precedent) please share. Example would be whoever started installing servers into shipping containers for mass production then easy installation... that was brilliant stuff. Was either Facebook or Microsoft.
So, you've given two good examples of Hard problems Facebook solved. Brilliant thinking went into solving them. No doubt a company working some miracles on a regular basis.
There were a few things that were interesting or novel about Haystack. CDNs were only giving us a 80% hit rate on photos, and even back in 2007 when we first starting working on Haystack, that was an immense number of misses to serve from the origin. That heavily informed the design goals of Haystack, which were to have a very I/O efficient userland filesystem that used direct I/O to squeeze as much iops from the underlying drives as possible. We also wanted zero indirect I/O, so we used an efficient in-memory index so that all seeks on disk were to service production reads versus indirect index blocks or other metadata. Lastly, Haystack is very very simple. We made it as simple as we could and eschewed anything that looked too clever. I actually think its simplicity is one of the things that made it Hard. It would have been much easier to design something more complex.
If you have any questions about the Haystack paper after reading it, I'd be happy to answer what I can. Just send me an email.
Regarding the datacenters, you're kind of simplifying it, but that's okay. There's a lot more orchestration, automation, and integration, and I wish I could give you a tour of one of our OCP datacenters so you can see it firsthand. There's plenty of things I wish I could talk about on the datacenter front that I feel are truly amazing and without precedent, but I can't until they are public. There's an excellent chance that they will become public and likely even part of OCP in the future though. I'm really proud of how much of Facebook's datacenter architecture has been shared with the world for free. It gives me warm fuzzies all the time.
Anyways, thanks for the fun thread, and let me know if you ever want to talk storage sometime.
I actually do have ideas on where to go from there and that's general. I'd probably charge Facebook for the info, though, given the increase in performance and savings they might generate. I might contact you in the future on that.
The best thing? The guy who designed it previously co-designed Google's cold storage system ;)
How about OpenCompute? It was promoted by FB (even if inspired by Google) and had so much impact on the industry that even HP is now having FoxConn build HP-branded OpenCompute-style servers.
EDIT to add: The Horus product is interesting. Formerly, I only knew about the NUMAscale for AMD NUMA. Too bad competition rarely survives in this segment. Things were more interesting in the MPP and DSM days.
I had the pleasure to meet him now that he is a VC at Vertex Ventures. Very smart and nice guy.
"Every 18-24, an organization must be able to manage an order of magnitude more servers (virtual and physical) with the same number of staff."