
A peek at the massive scale of AWS - k4jh
http://www.enterprisetech.com/2014/11/14/rare-peek-massive-scale-aws/
======
pixelmonkey
The thing I learned that blew me away is that with their new networking stack,
the cross-AZ communication within a region is "always <2ms latency and usually
<1ms".

As mentioned in the slides, that's the same latency usually associated with
SSDs and 100X better than the latency typically associated with cross-region
networking. This suggests to me that running distributed databases in multiple
AZ's has almost no latency penalty (e.g. you won't even paying eventual
consistency / replication lag taxes that you might think are a danger). That's
pretty damn cool.

~~~
jpgvm
I was a big Xen guy back in the day.

The big takeaway from Hamilton's talk was actually something that dawned on me
when they released the c3/m3/i2 instances which is what they have done with
SR-IOV.

The software stack that implements the AWS VPC fast path must be running on
the NIC itself. This is a big deal. Being able to implement the require
isolation/tunneling/encapsulation logic and routing in the NIC is a huge
performance boost and drastically simplifies the hypervisor. Well, the
software portion of the hypervisor atleast.

If you listened closely to the talk he laid out the differences in latency
impact of each layer in the networking stack, from the fibre (nanoseconds) up
to the software stack at some number of milliseconds, several orders of
magnitude worse than anything else in the stack.

The reason why the quoted latency is so large is how para-virtualized
networking is implemented. In order for it to be (bandwith) performant it
needs to use large ring buffers with many segments. This is very throughput
optimised and latency suffers a ton as a result.

By moving all of the queuing to the NIC you get a bunch of benefits, namely
the dom0 (basically the host in Xen terminology) is no longer involved in
pushing packets and you are not incuring the cost of the Linux networking
stack 2x. In the paravirt model the skbs are transferred across the circular
buffer into the host OS where they are injected into a virtual interface, thus
traversing the full net stack again.

In the SR-IOV model the address space of the virtual NIC is mapped into the
guest OS using Intel's IO-MMU extensions and the guest is then able to
communicate directly with the NIC, thus 100% bare metal performance.

If SR-IOV was the only improvement it would be impressive, however it's the
consequence of it's existence which makes the biggest difference. If the guest
is talking directly to the NIC, then all of the encap/decap is HW accelerated
too and in theory this means the full networking stack is end-to-end in HW.

~~~
Andys
Note: standard out-of-the-box SR-IOV allows VLAN tagging/stripping outside of
the guest control, maybe they are simply using this in conjunction with a
layer 3 switch to handle the VPC stuff?

~~~
electrum
Amazon VPC is far more complex: [https://www.youtube.com/watch?v=Zd5hsL-
JNY4](https://www.youtube.com/watch?v=Zd5hsL-JNY4) (excellent talk from one of
the creators of VPC)

------
mino
> This is a course of action, Hamilton said laughing, where people “would get
> you a doctor and put you in a nice little room where you were safe and you
> can’t hurt anyone.”

On a much much much smaller scale, this is also what I'm working on during my
current sabbatical:
[https://ripe69.ripe.net/archives/video/177/](https://ripe69.ripe.net/archives/video/177/)

I thought it may be relevant to post, as we our code is under GPL and willing
to collaborate with anyone.

~~~
contingencies
I really liked your RIPE talk, good job. Can I ask which tools, if any, you
use for network simulations?

------
ghshephard
Awesome Article/Review.

This the first reference I've heard to scales of economy and "Blast Radius"
concerns (I.E. How much damage occurs if a data center goes down - apparently
Amazon feels that at around 80,000 (or so) servers, it makes more sense to
build new data centers, than to make existing ones bigger.

This is why Availability zones have multiple data centers (as many as 6 (10?)
in US-East)

Also, while I was aware that Amazon was looking at building their own network
stack - I wasn't aware that they'd replaced all their Cisco/Juniper gear with
white-label ODMs with their own custom software stack. Now that's a company
that takes networking seriously.

~~~
pyvpx
bespoke network equipment and associated software stacks are what everything
in the datacenter, and hopefully in the office/home, will be running in the
next five years.

~~~
epistasis
Rather than bespoke, hopefully commodity. We've already switched to commodity
networking hardware, and will never go back where possible. (Currently big
edge routers still need to be from proprietary vendors, I believe.)

------
patman81
I hope Hamilton's talk from this years AWS reinvent conference will be
released as video. His slides are online
([http://mvdirona.com/jrh/work/](http://mvdirona.com/jrh/work/)), and many
other talks have already been realeased
([https://reinvent.awsevents.com/sessions.html](https://reinvent.awsevents.com/sessions.html)),
but Hamilton's talk seems missing.

~~~
Twirrim
It has been, or at least the talk that seems to be what this article is on.

[https://www.youtube.com/watch?v=JIQETrFC_SQ](https://www.youtube.com/watch?v=JIQETrFC_SQ)

~~~
patman81
Perfect, just what I was looking for!

------
softdev12
I'm a longtime AWS user. Recently I decided to see how the competitor clouds
offering stack up - because I've read a lot about how the second-movers in the
space have caught up to Amazon. And that now there's basically no difference
between the offerings.

I decided first to try Google - specifically Google App Engine. Just to see if
I could get a plain vanilla base case working quickly. And my initial reaction
was that AWS is still head-and-shoulders above everyone (or at least Google).
The Google UI and setup process seemed ridiculously complicated and
unfriendly. With AWS, I was up and running almost immediately. Not so with
Google.

So I immediately ran back to AWS and dropped Google. I'm not sure if my bad
experience with Google was because I had framed my expectations through my AWS
experience and thus wasn't able to use Google the way it intended to be used.
But it just seemed way too unfriendly. It seemed to require needless dev
installs that should just be automated.

When I went back to AWS, just the sheer amount of services they offer seemed
staggering by comparison. Google still has products in BETA that AWS has in
mature mode.

I was going to try out other competitors like Azure, Digital Ocean, etc. but
now feel like there's no need. AWS is just good.

~~~
matthewmacleod
I'm finding this testimonial a bit difficult to believe. App Engine is very
obviously a different product from the whole AWS stable, and that's something
that you should be aware of if you're in a position to be comparing them.

AWS is a high-quality, extensive offering, but it's not suitable for every
situation. The 'sheer amount of services' are in some cases lacklustre
reimplementations of services you could run yourself on EC2, for example.

What this boiled down to is "I wanted to compare things to AWS, so I had a
half-assed look at something that really isn't a competitor then immediately
stopped looking." That's not really convincing.

~~~
jpgvm
It's also missing a lot of important functionality if you are trying to
implement anything other than the basic case.

A perfect example would be the difference between GCE networking and AWS VPC.
GCE networks support routes with priorities and if you insert multiple routes
into the routing table with equal priorities it does what you would expect,
equal cost multi-pathing.

This makes it really easy to implement proper scalable NAT for private
instances, which is just pure pain in AWS VPC.

There are many more examples of this and AWS is not the only culprit, both GCE
and Azure have either missing features or mis-features that make me want to
flip a desk sometimes.

------
StillBored

      But the surprising thing, even to Hamilton, was that network availability went up, not down. 
    

I'm not sure why that was a surprise, KISS, applies just as well to networking
as most other parts of the data center.

Networking, and storage (to a lesser extent) seem to have enterpriseits. By
that, I mean the companies involved in it do everything in their power to
maintain their high margins. One of the ways they do this is adding features
that everyone just has to have (or so sales will tell you). Even when many of
those features are things that should _NEVER_ be actually used in practice,
due to their affects on performance or reliability (take WAN accelerators for
example, otherwise known as how to make your network appear faster 90% of the
time and completely non functional for the remaining).

What I really wish is that the amazon's, microsofts, googles, facebooks, etc
would get together and actually release these switches/etc they are building.
While I imagine they never will because they view it as a competitive
advantage. It sure would be nice to have some of this stuff available for
medium sized data centers without having to spend $$$$$ with the established
vendors just to get something that can do 100Gbit.

~~~
electrum
Facebook has released designs for everything from servers and racks to entire
data centers as part of the Open Compute Project (which has many more member
organizations than just Facebook):
[http://www.opencompute.org/](http://www.opencompute.org/)

A design for network switch hardware and software was recently announced:
[https://code.facebook.com/posts/681382905244727/introducing-...](https://code.facebook.com/posts/681382905244727/introducing-
wedge-and-fboss-the-next-steps-toward-a-disaggregated-network/)

------
petercchang
Great article. Amazon is able to produce a more reliable datacenter by
creating network gear specifically for their use case. By focusing on just
want they need, they can get rid of all the bloat, complications, and expense
in general network systems. The more simple the system, the more reliable.
Another reason general computing will all shift to the cloud.

------
hga
A fantastic overview, ranging from regions (the number of them and how they're
organized) all the way down to their new latency reducing network interface
card.

------
bbrian
In Ireland, Amazon bought a Tesco distribution centre [0] for AWS. Tesco
vacated it for another building [1] which has one of the largest floor spaces
in Europe.

[http://wikimapia.org/#lang=en&lat=53.293772&lon=-6.349669&z=...](http://wikimapia.org/#lang=en&lat=53.293772&lon=-6.349669&z=14&m=b)
[http://www.punchconsulting.com/our-
projects/logistics/tesco-...](http://www.punchconsulting.com/our-
projects/logistics/tesco-distribution-centre-donabate/)

------
polskibus
I wonder how much of this custom tweaking ends up back in the Linux community
due to GPL? If I am interacting with a GPL hypervisor, can I request sources
for it for instance? I mean AWS is a huge Linux success story, does anyone
know if Amazon gives anything back?

~~~
aragot
With GPL, you only must distribute the source to those whom you distributed
the binaries. Which means, Amazon isn't required to give back.

Besides, I don't remember examples of Amazon being nice - not that it doesn't
happen but they don't communicate on it.

------
Sven7
I think Amazon is just riding the tiger here. One thing to remember is back in
the day they were massively over provisioning just to support their peak loads
during Thanksgiving and Christmas. All that spare capacity became AWS. I bet
their utilization rates aren't any different today ...quite possibly worse.

~~~
jeffbarr
> massively over provisioning just to support their peak loads during
> Thanksgiving and Christmas.

Not really true. As part of an annual capacity planning exercise each team was
required to plan and scale for holiday peaks. Infrastructure is not "free" and
each team has to optimize for good performance at an affordable cost.

> All that spare capacity became AWS.

Not true, never was. I regularly reviewed and provided feedback on the
original narrative document for AWS. While I don't have a copy handy, I am
absolutely certain that the document was focused on providing infrastructure
services to developers.

> I bet their utilization rates aren't any different today ...quite possibly
> worse.

I don't have access to those numbers, and have no permission to share them
even if I did. Your thought model for utilization needs to take the EC2 Spot
Market in to account. Savvy users of EC2 have learned to optimize their large-
scale compute jobs and their bidding process to gain access to what would
otherwise be (to your point) underutilized capacity.

The recent "Gojira" run by Cycle Computing (details at
[http://www.marketwired.com/press-release/cycle-computing-
sof...](http://www.marketwired.com/press-release/cycle-computing-software-
deploys-50000-cores-23-minutes-largest-fortune-500-cloud-cluster-1966843.htm)
) is a great example of how the Spot Market can be used to great advantage by
clever developers.

~~~
Sven7
I am sure they don't release numbers on how many "clever users" they have too.

The way I think about it is - can utilization rate grow at the same rate or
higher than machine count at their data centers and for how long?

With all the levels of virtualization available and their market leadership
_today_ that curve can look quite magical I accept. But for how long? Seems
quite a shaky curve to be betting 5 million machines on.

