The same panel also has stats like "max number of hosts simultaneously receiving a deployment (3000)". Depending on how you run the numbers those 3000 hosts could all take nine hours to receive a single deployment and average out to 11.6 deploys / second / host.
Unless their systems are heavily modularized, I have a bit of a hard time believing that something new at Amazon goes live every 11.6 seconds. Maybe I'm wrong, but I'd love to have a better grasp on the context involved here.
Our systems are extremely modular. We've previously disclosed that in excess of a hundred discrete services may be called to generate a single page on our web site. You can find more info about that at the following link.
When we refer to a deployment at Amazon it means a single code push to one or more servers. For example, if you deploy a new piece of code to a thousand hosts that counts as one deployment. In other words a distinct update is pushed every 11.6 seconds.
This. I also used to work at Amazon until recently. Amazon understands decoupling, deeply. And it's applied everywhere. In the code. In the architecture. How teams are organized. How processes are designed.
I think that decoupling teams boils down to giving teams complete ownership. And Amazon got parts of it right. It means that your team owns everything it builds. You own the code, you own the testing and you own the operations: you own the product. Various tools are laid at your feet, and you are asked to build.
Clearly, a benefit is that you can move fast. You don't need permissions from someone half a building away to do something. You don't need to touch code that needs another team's approval. There are no committees that decides on global rules. Your team decides on your team's rules.
Like a shared nothing architecture, there's very little that is shared between teams. Teams are often connected only via their service interfaces. Not much else beyond common tooling.
But even their tooling reflects decoupling. Every tool follows the self-service model ("YOU do what you WANT to do with YOUR stuff"). Their deployment system (named Apollo, mentioned in the slides) and their build system, and their many other tooling, all reflect this model.
Cons. What happens is that you might be reinventing the wheel at Amazon. Often. Code reuse is very low across teams. So there's no shared cost of ownership at Amazon, more often than not. It's the complete opposite at Google w.r.t. code reuse. There are many very high-quality libraries at Google that are designed to be shared. Guava (the Java library) is a great example.
Another con. You may not know what you're doing. But as a team you will still build a rickety solution that gets you to a working solution. This is the result of giving a team complete ownership: they'll build what they know with what they have. Amazon is slowly correcting some of these problems by having teams own specific Hard Problems. A good example is storage systems.
And a lack of consistency is a common issue across Amazon. Code quality and conventions fluctuate wildly across teams.
Overall, Amazon has figured out how to decouple things very well.
There's several different communication methods between services, including REST, SOAP, message queues, and an internal service framework. Its a perfect example of bonafidehan's post.
As for the second question, a page generally doesn't have to make hundreds of requests. You're thinking of a flat architecture. Think of it more like a pipeline: data goes in at A, flows from A->B->C->D, page reads D. So you end up having to call a handful of services. That can be scaled by 1) caching, 2) careful selection of service calls (don't call ordering service unless you're placing an order), 3) asynchronous requests (you're typically going to be IO bound on the latency, so just spin up X service requests and then wait on them all). There are also other tricks that are fairly well known for reducing latency, such as displaying a limited set of information and loading the rest via AJAX.
As a disclaimer for the above, my work doesn't involve working with the Amazon.com website directly, so its based on my limited view in my domain space.
If you own a page or service that calls a bunch of other service, you typically collect metrics on latency of your downstream services. Amazon has libraries to facilitate this, and a good internal system for collecting and presenting this data. If one service is particularly troublesome, then you can reach out to that other team and get them to lower their latency. The other option is to pull in their data closer to you, in a format that you can consume quickly.
Just to clarify, the "deployment every 11.6 seconds" refers to all prod deployments, including internal applications and services. This doesn't mean that a deployment to the retail website (and dependent services) is done every 11.6 seconds, just that some production deployment is done at Amazon every 11.6 seconds.
disclaimer: I work for Amazon, and used to work in the presenter's org.
I think the most impressive stat is a little later in the presentation. Only ~0.001% of deployments actually cause an outage.
Anyone can create a system that generates a lot of deployments, but what really matters is that you can complete all of those deployments safely. Of course, that is still ~0.001% too many outages due to deployments and we are working hard to make that number zero.