
Espresso – Google’s peering edge architecture - vgt
https://www.blog.google/topics/google-cloud/making-google-cloud-faster-more-available-and-cost-effective-extending-sdn-public-internet-espresso/
======
itchyjunk
"We defined and employed SDN principles to build Jupiter, a datacenter
interconnect capable of supporting more than 100,000 servers and 1 Pb/s of
total bandwidth to host our services."

This type of scales boggle my mind. Though I have found I can no longer keep
up with all the terminologies popping up every day. Posts like these are my
only connection to learning the massive scaling of things to make the modern
networks work.

"We leverage our large-scale computing infrastructure and signals from the
application itself to learn how individual flows are performing, as determined
by the end user’s perception of quality." Is this implying they are using
Machine Learning to improve their own version of content delivery network?

~~~
sroussey
The google network is gold plated, lacks the jitter inherent in the internet
at large or inside other competitors' networks. It makes it tempting to ignore
some aspects of distributed computing, if only for a moment.

~~~
felipemnoa
Could you expand a bit more on you comment? I feel I'm missing some context.
Specifically, what do you mean by gold plated? Why is it tempting to ignore
some aspects of distributed computing? I'm missing a lot of context that you
are implicitly implying so could you elaborate?

~~~
walrus01
It's gold plated because they basically built their own ISP by acquiring
either:

a: dark fiber IRUs between cities/metro areas

b: N x 10 and 100 Gbps wavelengths as L2 transport services from city to city,
from a major carrier such as level3 or zayo

c: some combination of A and B

and they use that to build backbone links between their own network equipment
that they have full control over. Google is its own AS and operates its own
transport network around the US 48 states and around the world.

the exact design of what they're doing within their own AS at layers 1 and 2
is pretty opaque unless you happen to be a carrier partner that is willing to
violate a whole raft of NDAs. But basically they've built their own backbone
to a very massive scale yet without the huge capital expense of actually
laying their own fiber between cities.

their network has incredibly low jitter because they don't run their links to
saturation, and know EXACTLY what the latency is supposed to be from router
interface to router interface between the pairs of core routers that are
installed in each major city. Down to five decimal places, most likely. When
you have your own dark fiber IRUs and operate your own WDM transport platforms
you are in possession of things like OTDR traces for your dark fiber that
tells you down to four decimal places the km length of your fiber path.

It also helps that the sort of people who have 'enable' on the AS15169 routers
and core network gear are recruited from the top tier of network engineers and
appropriately compensated. If they weren't working for Google they would be
working for another major global player like NTT, DT, France Telecom/Orange,
SingTel or Softbank.

~~~
puzzle
Where do you get the crazy idea that Google doesn't run its links to
saturation? It's crazy because it would cost an enormous amount of money.

The B4 paper states multiple times that Google runs links at almost 100%
saturation, versus the standard 30-40%. That's accomplished through the use of
SDN technology and, even before that, through strict application of QoS.

[https://web.stanford.edu/class/cs244/papers/b4-sigcomm2013.p...](https://web.stanford.edu/class/cs244/papers/b4-sigcomm2013.pdf)

A few more details about strategies here:

[https://research.google.com/pubs/archive/45385.pdf](https://research.google.com/pubs/archive/45385.pdf)

Then there's a whole bunch of other host-side optimizations, including the use
of new congestion control algorithms.

[http://queue.acm.org/detail.cfm?id=3022184](http://queue.acm.org/detail.cfm?id=3022184)

You might recognize the name of the last author...

~~~
walrus01
What I mean is that they do not run their links to saturation in the same way
as an ordinary ISP. And because their traffic patterns are very different than
an ordinary ISP, and _much, much more_ geographically distributed, they can do
all sorts of fun software tricks. The end result is the same: Low/no jitter
and no packet loss.

As contrasted with what would happen if you had a theoretical hosting
operation behind 2 x 10 Gbps transit connections to two upstreams, and tried
to run both circuits at 8 to 9 Gbps outbound 24x7.

~~~
danpalmer
For clarity, do you mean that Google can, for example, run to 99% saturation
all the time, whereas a typical ISP might have 30-40% average, with peaks to
full saturation that causes high latency/packet loss when it occurs?

~~~
pas
Yes, that's about right. Since they control both sides of the link, they can
manage the flow from higher up on the [software] stack. Basically, if the link
is getting saturated, the distributed system simply throttles some requests
upstream by diverting traffic from places that result in traffic over that
link. (And of course this requires a very complex control plane, but doable,
and with proper [secondary] controls it probably stays understandable,
manageable, and doesn't go haywire when shit hits the fan.)

~~~
collinmanderson
So I wonder if that means they can do TCP control flow without dropping
packets.

~~~
pas
I guess they do drop packets (it's the best - easiest/cheapest/cleanest - way
to backpropagate pressure - aka backpressure), but they watch for it a lot
more vigorously. Also as I understand they try to separate long lived
connections (between DCs) from internal short lived traffic. Different teams,
different patterns, different control structures.

------
mmaunder
"Google has one of the largest peering surfaces in the world, exchanging data
with Internet Service Providers (ISPs) at 70 metros and generating more than
25 percent of all Internet traffic. "

Wow.

~~~
fiatjaf
70 metros?

~~~
cakeface
I wonder if they are referring to a peering Internet exchange point (IXP) when
they say metro. Basically a building where networks converge and ISPs connect
to each other.

~~~
walrus01
yes, though "metro" is a better way to define it since many IXes are
geographically distributed throughout their city. For example DE-CIX in
frankfurt is in many different datacenters, with their core switches connected
by DE-CIX controlled dark fiber. AMS-IX in amsterdam is in many facilities in
the same metro area, all the same L2 peering fabric. The SIX in Seattle is in
three facilities in the same metro and several local ISPs have built their own
extensions of it to Vancouver BC.

------
Apocryphon
The official Android testing framework from Google is also named Espresso. Are
we running into a classic hard computer science problem?

~~~
Y_Y
Off-by-one errors?

~~~
grkvlt
I'm sure he means something to do with caches; I had it on the tip of my
tongue a moment ago, but the doorbell rang.

~~~
bitwiseand
"There are only two hard problems in CS : 1\. Naming things. 2\. Cache
invalidation. "

~~~
coderholic
"There are only two hard problems in CS : 1. Naming things. 2. Cache
invalidation. 3. Off-by-one errors"

------
smaili
The essence of what Espresso is begins towards the end of the post:

 _Espresso delivers two key pieces of innovation. First, it allows us to
dynamically choose from where to serve individual users based on measurements
of how end-to-end network connections are performing in real time._

 _Second, we separate the logic and control of traffic management from the
confines of individual router “boxes.”_

~~~
EE84M3i
I found this article confusing. It doesn't really say anything about what
"Espresso" actually does, let alone _how_ it does it.

~~~
zaroth
For some reason they didn't actually link to the talk, which I haven't
watched, but presumably actually tries to start answering those questions.

Quick search is showing the 2015 keynote that Amin gave, haven't found the
2017 one yet...

[1] - 2015 ONS Keynote
[https://www.youtube.com/watch?v=FaAZAII2x0w](https://www.youtube.com/watch?v=FaAZAII2x0w)

~~~
mikecb
And here's 2014, which was on the Andromeda NFV stack:
[https://www.youtube.com/watch?v=n4gOZrUwWmc](https://www.youtube.com/watch?v=n4gOZrUwWmc)

And B4:
[https://www.youtube.com/watch?v=tVNlXg0iN-g](https://www.youtube.com/watch?v=tVNlXg0iN-g)

------
dkhenry
I think with platforms like this it is now safe to say that the systems and
services Google is deploying are no longer in the same category as classical
networked systems. This is as foreign a concept from traditional networking
and the seven layer OSI model as non von Neumann computing is from von neumann
computing

~~~
insaneirish
> This is as foreign a concept from traditional networking and the seven layer
> OSI model as non von Neumann computing is from von neumann computing

Not really. The OSI model doesn't say anything about where I run my routing
algorithm and BGP application vs. where my actual switches are.

"Classical" networking is an artifact of viewing routers/switches as
monolithic blocks that embed all of their functionality in one black box. I
said BGP _application_ above because that's what it is, an application for
distributing/communicating state. The same can be said for many other parts of
networking traditionally embedded in the monolithic blob we often call a
router.

Label switched fabrics provide inherent NFV, security functions, and allow you
to influence paths (i.e. traffic engineer) from applications that are equipped
to make decisions based on your priorities, not some rigid vendor
implementation.

You will see more of this.

~~~
walrus01
> The OSI model doesn't say anything about where I run my routing algorithm
> and BGP application vs. where my actual switches are.

If you're $BIGASN and you set up an intra building singlemode crossconnect at
$BIGCITY to establish settlement free peering (let's say for example a 4 x 10
Gbps bonded 802.3ad circuit) with $OTHERBIGASN, they most assuredly are going
to notice if your BGP session and router is not directly on the other end of
that cable.

Because they are going to be expecting sub-1ms latency to your router, and not
"we're taking this session and stuffing it in some sort of tunnel or
encapsulation and sending it somewhere else, to where the thing that actually
speaks BGP is located". It's bad juju to practice deceptive peering.

~~~
insaneirish
> Because they are going to be expecting sub-1ms latency to your router, and
> not "we're taking this session and stuffing it in some sort of tunnel or
> encapsulation and sending it somewhere else, to where the thing that
> actually speaks BGP is located".

Why should they care?

> It's bad juju to practice deceptive peering.

I don't understand applying moral judgment to a technical design choice.

~~~
walrus01
I'm guessing you do not handle peering for a medium to large sized AS, so it's
really hard to explain. First: they should care because the point of
establishing peering in a given city is to give inter-AS traffic the absolute
shortest path and shortest number of hops between two points. If I put a
router in portland, OR and buy a 10 Gbps mpls tunnel to Vancouver BC and join
the vanix, ask to set up with peers there, all traffic will be taking a multi
hundred km round trip to Portland.

Two: it's not moral judgment, it's a technical best practice to actually put
routers in the city in which you set up new edge BGP sessions. Pretty basic
ISP stuff in fact.

~~~
londons_explore
Just because they tunnel the BGP back to some software system to make routing
decisions doesn't mean they tunnel all the user data to that same location. To
do so would be silly.

In fact, a good system would have a couple of systems handling BGP, with
physical location fairly irrelevant, but _acting_ as if they are local to the
peer they are talking to.

------
hueving
These presentations from Google are pretty irritating at these conferences. If
you're familiar with the SDN field (as most ONS attendees would be), this
presentation is essentially nothing but bragging about the scale at which they
operate.

There is no useful information in here to advance the state of the art, no new
ideas, no publicly available implementations (closed or open source). It's
just a very high-level architectural view of a large network given by people
who are incentivized to present it in the most favorable light. And due to the
lack of any concrete details, it's free from critical analysis.

>Espresso delivers two key pieces of innovation. First, it allows us to
dynamically choose from where to serve individual users based on measurements
of how end-to-end network connections are performing in real time. Second, we
separate the logic and control of traffic management from the confines of
individual router “boxes.”

The first has been done before at many levels of the network:

* BGP anycast

* DNS responses based on querier

* Done in load balancer

* IGP protocols to handle traffic internally while taking into account link congestion

I assume their framework gives them much nicer primitives to work with than
the above, which would be an advancement in the field if we could actually see
an API or something.

The second is very far from "innovation". This is the essence of SDN and this
has been the hottest thing since sliced bread in the networking world since
2008 at a minimum [1] and even earlier if you look at things like the Aruba
wireless controller.

1\. [http://archive.openflow.org/documents/openflow-wp-
latest.pdf](http://archive.openflow.org/documents/openflow-wp-latest.pdf)

~~~
runeks
I very much agree. I was hoping Espresso would be a framework for allowing GCP
user applications to leverage Google's SDN, rather than just allow Google to
offer their own services using this technology. I hope that's the next step.

For example, it would be cool if it were possible to move shared-
client/server-secret checking (eg. for an HTTP API) out to the edge of
Google's network, such that a DDoS attack with invalid packets (secrets) never
even reach the application VM/cluster. DDoS attacks, which force applications
offline (by making the app scale up to an unsustainable cost level), could be
prevented this way.

~~~
cobookman
You can do that by using Google clouds https load balancer.

------
filereaper
Yup, this is the kind of thing you get when you put in $30B into
infrastructure.

[https://youtu.be/j_K1YoMHpbk?t=7472](https://youtu.be/j_K1YoMHpbk?t=7472)

------
NTDF9
Can someone with more expertise summarize how this differs from commercial SDN
solutions like: Cisco ACI, Juniper Contrail etc.?

~~~
djrogers
Unfortunately, no - at least not without quite a few more details. As this
stands, it could be the high-level marketing overview for pretty much any SDN
solution available, commercial or open.

~~~
NTDF9
Thanks!

As someone who recently entered this field professionally, I find it amusing
that most SDN solutions out there are just permutations of each other playing
over marketing buzzwords.

Not too different from "cloud computing" from a few years ago.

~~~
djrogers
Yeah, once you get past the hype though, SDNs can be great.

------
apanda
This vision seems very similar to the 2011 talk by Scott Shenker:
[https://www.youtube.com/watch?v=YHeyuD89n1Y](https://www.youtube.com/watch?v=YHeyuD89n1Y)

~~~
SaveTheRbtz
This is a good talk about decomposability of control plane and creating proper
abstractions for it.

PS. "The ability to master complexity is not the same as the ability to
extract simplicity" is a good takeaway. PPS. This is a part of EE 122:
[https://inst.eecs.berkeley.edu/~ee122/fa12/class.html](https://inst.eecs.berkeley.edu/~ee122/fa12/class.html)
PPPS. PDF for SDN lecture:
[https://inst.eecs.berkeley.edu/~ee122/fa12/notes/25-SDN.pdf](https://inst.eecs.berkeley.edu/~ee122/fa12/notes/25-SDN.pdf)

------
danm07
Distracting aside: It's amusing that languages use so many references from the
coffee industry. I wonder how long it will take to fill a Starbucks menu.

------
piyushpr134
One biggest takeaway from this is that they can have multiple machines for the
same IP address. That is just awesome and also explains how they have probably
managed to scale up services 8.8.8.8 without needing to use load balancers.

~~~
biokoda
Anycast is pretty standard.

~~~
mmarx
Especially so for DNS servers, and since long before the google
nameservers[0].

[0] [https://tools.ietf.org/html/rfc3258](https://tools.ietf.org/html/rfc3258)

------
soVeryTired
What does "peering edge" mean? A google search only brings up this article.

~~~
dubcroster
A network is often described with an edge and a core, and there can be several
types of "edges".

For a company like google, you would most likely have an edge towards your
servers as well as an edge towards your peering partners. The peering edge is
therefore the part of the network that is used to connect to BGP peering
partners.

------
KaoruAoiShiho
This is pretty impressive.

------
s73ver
Two Google products named Espresso? That won't be confusing at all.

~~~
knorker
Sure. "Two"

~~~
londons_explore
Found the employee...

------
neduma
tl;dr?

------
sebnap
Wow. Isn't this a trojan horse? People start to use it because of it's
convenience and then it will spread and spread and spread. I mean what's up
when google will run more or less everything?

------
Bud
Damn, for a moment there I was hoping that Google made some sort of really
cool espresso machine. Perhaps with Alexa built-in.

