Hacker News new | past | comments | ask | show | jobs | submit login
Google reveals details about its datacenters (highscalability.com)
281 points by toddh on Aug 11, 2015 | hide | past | web | favorite | 53 comments

The key takeaway for algorithms research seems to be that "[w]e don’t know how to build big networks that deliver lots of bandwidth". This is exactly what S. Borkar argued in his IPDPS'13 keynote [1]. An exa-scale cluster can't be cost-efficient unless the bisection bandwidth is highly sublinearly in the cluster's computing power.

We need new algorithms that - require communication volume and latency significantly sublinear in the local input size (ideally polylogarithmic) - don't depend on randomly distributed input data (most older work does)

It's really too bad that many in the theoretical computer science community think that distributed algorithms were solved in the 90s. They weren't.

[1] http://www.ipdps.org/ipdps2013/SBorkar_IPDPS_May_2013.pdf

One of the biggest things I've had to unlearn as an SRE leaving google is this: RPC traffic is not free, fast, and reliable (So long as you don't go cross-datacenter). For most companies it is expensive and slow. Facebook's networks are still designed for early-2000s era topologies and their newer topologies won't fix that; They've still got way too little top-of-rack bandwidth to the other racks nearby.

Microsoft hasn't even caught on yet, and is still designing for bigger and bigger monolithic servers. I can't tell what Amazon is doing, but they seem to have the idea with ELBs at multiple layers.

This is what continues to shock me about the tech industry. Google started revolutionizing the way data centers work more than 15 years ago, and it's a huge reason why they are able to execute on ambitious projects so well and serve up content that has a ton of compute behind it yet do so incredibly fast yet also incredibly cheaply. That cuts right to the bottom line, yet nobody else has copied google yet. I think it's a huge indication of just how immature the entire industry truly is. It's difficult enough to get anything done these days, let alone get something exceptional done well. We're only just a little bit out of the deep depths of the "software crisis" where the ability to run a successful software project was a hit or miss thing. But we're still in the software crisis shallows, and it'll be a long time before we make land fall.

> nobody else has copied google yet

Why would they ?

Almost nobody is operating at Google's scale and interconnects like Infiniband are rarely saturated to the point that exotic solutions are needed. Fast RPC is nice and all but (a) not that many companies are using microservices and (b) not that many are scaling out to the billions of requests/second.

And the 20 or so companies who are operating at Google's scale are hardly floundering around blind. Facebook has done very interesting work in the infrastructure space, LinkedIn is amazing on the core software side and Netflix has pioneered the use of the cloud.

The two examples I think that matter are Amazon and Microsoft. I don't think they copied Google- I think they independently realized, when they decided to enter the cloud service, that they had to scale up and out their data centers tremendously. And when you do that, you quickly learn that deploying more hosts is easier than deploying the ports to interconnect those hosts with a globally nonblocking fabric.

Amazon and MSFT have both published nice papers showing that they "got it" about 3-5 years ago. I think they operate within an order of magnitude of Google in terms of total external and internal traffic, but that's just a guess because none of these companies publish enough detail to say for sure.

As a googler, I would agree with this bucketing. Amazon and Microsoft are in the same ballpark scale-wise, but companies like LinkedIn and Netflix have lower scaling needs (this isn't a knock on them, just a technical reality). Netflix would likely be in the same ballpark if they didn't leverage Amazon. Facebook is also hovering near the line, and has probably crossed over.

I think this is short-sighted. Everyone who operates with any portion of their stack on any cloud provider is "operating at Google's scale" since even a small service is affected both by the infrastructure on which it is run and all of the other services run on that infrastructure. If you're running in a cloud, you need to plan and develop like you're a product team at facebook or google, and cloud providers need to act like they are providing infrastructure for a cohesive system, not a bunch of individual pieces.

It's not short sighted. It's called not prematurely optimising.

There is a world of difference between the issues Facebook, Google, LinkedIn etc are dealing with and the rest of us. Nobody is worrying whether 10% of their CPU/Memory/Network capacity is going unused if we only have a handful of servers. Or overly worrying about interconnect bandwidth beyond a certain point.

It is fascinating and useful to see how to scale but we aren't all in the dark ages simply because we don't follow their architectural approaches.

No one is saying you should build this in your back office. My comment was limited (by the first bit) to people utilizing public clouds and those who run public clouds. If you are, you are affected by these issues; if you are not, then there's an opportunity cost there, whether or not you believe the early optimization is greater or lesser.

> RPC traffic is expensive and slow, not free, fast, reliable

If i've understood your point, you should have rather said this is what you've learned, not unlearned.

You are correct; I've edited my comment to (hopefully) make it a little clearer. Google RPCs are fast. The RPC trace depth of many requests is >20 in miliseconds. Google RPCs are free - Nobody budgets for intradatacenter traffic. Google RPCs are reliable - Most teams hold themselves to a SLA of 4 9s, as measured by their customers, and many see >5 as the usual state of affairs.

Is it opposite day? You seem to mean <20ms right?

I think he means that over 20 remote procedure calls can be made within milliseconds, which is game-changing.

I parsed his comment as "a request that results in 20 nested RPCs completes in a few ms".

Try this link for what MS is doing :



> ... then they hire bottom-feeders...

I don't work for Microsoft, but I'm kinda disgusted by that comment. By all means criticise their work, but calling them derogatory names is deeply deeply uncool.


Come on. You know that's not welcome here.

It would be nice if the SREs (and SWEs) still at Google would also unlearn these incorrect facts. RPCs fail, bandwidth costs money.

I once saw the co-founder of Cloudera saying that Google exists in a time-warp 5-10 years in the future, and every now and then it gives the rest of us a glimpse of what the future looks like.

Felt exaggerated at the time, but it often seems like the truth.

I actually thought you were exaggerating but 5-10 years is actually very true in some areas.

GFS only translated into HDFS after 5 years, same with MapReduce, BigTable etc.

And still doesn't work nearly as well.

I have always suspected so since Google still does not open source its MapReduce, but I haven't found any actual evidence directly comparing Hadoop MapReduce and Google's. There could also be technical issues preventing Google from open sourcing its implementation, although I don't know any. Now that we are moving on to more sophisticated models and implementation I am hoping more details can emerge.

I don't know anything about MR in particular, but the open-sourcing process at Google often gets bogged down by how interconnected the codebase and ecosystem are. Releasing an open source version of a component means reimplementing a lot of stuff that Google isn't ready to release yet but that that component depends on. Look at Bazel/Blaze - it took them a long time to get it out, and the open version doesn't have feature parity. Also take a look at their governance plan going forward, which hints at the difficulties.[1]

That said, I suspect a large part of the reason is that Google's MapReduce is highly optimized for running in a Google datacenter, and they're generally pretty tight-lipped about the real secret sauce in their infrastructure.


> There could also be technical issues preventing Google from open sourcing its implementation, although I don't know any.

My understanding is that sharing details takes time off very valuable employees, away from fixing important internal issues – a lot more than the company being afraid those details might leak. For the handful of companies who can and needs to implement equivalent, poaching said employees is simpler than back-engineering from blogposts and it makes implementation easier.

MapReduce and associated infrastructure at Google is arguably at the center of their massive and proprietary codebase. I suspect opening that would involve opening much of their core code.

It is in SOME ways an exaggeration—they're also tackling familiar problems—but GFS/Colossus and the data center administration are probably accurate.

However, Facebook and Amazon almost certainly have similar challenges. I don't think they are necessarily unique so much as so huge the rest of us don't have the money or the problems to solve.

Some parts of Google definitely, but others not so much. If anything they were late on social networks and they still haven't really caught up.

I think he was referring to the technology Google is using to build products not the products per se.

So every cluster machine has 40gbit ethernet (?) - does anywhere else do that?

Looking at Table 2 http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183....

Some perhaps, but it is likely broken down into 4x10G. A 16x40G ASIC would get you 48x hosts and 4x40G uplinks.

IBM has had 4x DDR (20Gb) and QDR (40Gb) Infiniband as an option on their blades and frame nodes for about seven years.

I used to work on a program that is using IBM blades with 4x DDR as an MPI processing cluster. The cluster was significantly smaller than what Google is discussing, though.

Yes, this. Infiniband has scaled up significantly since then as well. I believe you can do 4x25 (100Gb) today.

Having this background knowledge gives you a much better idea of why the interconnects is going to be such a big problem going forward.... When I was working with QDR IB 5 years ago, it didn't matter if they had faster cards - the PCI bus at the time couldn't support anything faster. So you were literally riding the edge of the I/O capability all the way through the stack.

Perhaps not related but I remember LHC's raw data rate is about PB/s and they use ASIC hardwares to filter it.


This is different to the sort of thing Google is doing in the data centre. Most of the PBs/sec don't really see the light of day since you may have a multi-megabyte "image" being captured at something like 40MHz, but practically much of the image is zeros for any given capture. So zero-suppression in the first instance already brings the data down to much more manageable numbers, before they hit off-the-shelf computing hardware.

(At least, that was the case a few years ago. I don't know how much it has changed but I would be surprised if they had totally overhauled things).

It is not uncommon for storage nodes in distributed storage systems, especially when they're stuffed with SSDs.

4x10G per node is not totally uncommon from what I've read. Easier to do in smaller clusters though.

Ironically, if you look at the data center as a computer, this looks very much like scaling up, not scaling out.

I wonder if one day we will find that sending all data to a data center for processing doesn't scale. I think that's already a given for some realtime'ish types of applications and it could become more important.

Obviously, the success of decentralised computing depends a lot on the kinds of connected devices and whether or not data makes sense without combining it with data from other devices and users.

With small mobile devices you always have battery issues. With cars, factory equipment or buildings, not so much. But management issues could still make everyone prefer centralisation.

> I wonder if one day we will find that sending all data to a data center for processing doesn't scale.

Of course it scales. Some problems demand more scale. Through more datacenters at it. Tweak the algorithms to distribute data more conservatively.

What could happen is that both processing capacities and sensor capacities grow faster than networking capacities for years. That could make latencies untenable for large classes of applications.

Latency is bounded by the speed of energy transfer (i.e. speed of light under superconducting situations). Even if bandwidth IS a big problem—as it is—latency will always be a sticking point for distributed systems. As someone else pointed, we need to start analyzing distributed algorithms with a) geographical locality and b) complexity on round trips, which have a constant cost determined by latency.

Speed of light determines the lower bound of roundtrip latency, but the sticking point is the upper bound, and that depends on how congested networks are.

If sensors produce more data and data centers can process more data but network capacities can't keep up, then the analysis you are suggesting will show that we need more decentralised processing, not more data centers.

Ahh, I see.

I don't think we were disagreeing; I was just pointing out that data centers scale very well for what data centers were built for. It's not that surprising that we are outgrowing them.

> The amount of bandwidth that needs to be delivered to Google’s servers is outpacing Moore’s Law.

Which means, roughly, that compute and storage continue to track with Moore's Law but bandwidth doesn't. I keep wondering if this isn't some sort of universal limitation on this reality that will force high decentralization.

> I keep wondering if this isn't some sort of universal limitation on this reality that will force high decentralization.

To be fair, bandwidth has a high cap. Not all problems will need to be decentralized.

> Not all problems will need to be decentralized.

Our use of bandwidth is growing without bounds. That means most problems will need to be decentralized, which is akin to saying we're going to need to decentralize. :)

Well, I have full faith humans can adapt. We just haven't needed to, yet, except with e.g. slower memory latency reduction vs cpu speedup.

> The I/O gap is huge. Amin says it has to get solved, if it doesn’t then we’ll stop innovating.

I can imagine you can solve the throughput problem with relative ease, but the speed of light limits latency at a fundamental level, so proximity will always win there.

I tend to think that storage speed/density tech rather than networking is where the true innovations will eventually need to happen for datacenters. You can treat a datacenter as a computer, but you can't ignore the fact that light takes longer to travel from one end of a DC to another than it would from one end of a microchip.

> The end of Moore’s Law means how programs are built is changing.

I feel like this is just not true. Yes we have slowed down significantly over the past couple of years. BUT I do believe that once something big happens (My bet is the replacement of Silicon in chips or much stronger batteries). I feel like Moores law will pick up right where it left off.

Betting on the technological advances we haven't thought of yet to happen is pretty safe - but betting on them solving a particular current engineering challenge is a good way to fail.


Since you add no value to the discussion, this is just spam -- imho pretty shameless indeed.

Next time, at least share a story of something cool you have done, that would make your post much more appealing.

Shameless plugs like this were at one time OK on HN.

Also, it does add value to the discussion. At my company we've been using Cumulus Linux on white box switches to be able to build huge Clos networks at a fraction of the cost of traditional vendors like Juniper or Cisco. Such white box solutions are essential for building networks like this because traditional vendors are far too cost prohibitive.

Also, if you still doubt the relevance of Cumulus, the top comment on HN's previous thread specifically mentioned Cumulus, and that comment was from someone involved in building the first three generations of network:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact