Hacker News new | comments | show | ask | jobs | submit login

The same panel also has stats like "max number of hosts simultaneously receiving a deployment (3000)". Depending on how you run the numbers those 3000 hosts could all take nine hours to receive a single deployment and average out to 11.6 deploys / second / host.

Unless their systems are heavily modularized, I have a bit of a hard time believing that something new at Amazon goes live every 11.6 seconds. Maybe I'm wrong, but I'd love to have a better grasp on the context involved here.

Wow. Look at me on the front page of Hacker News!

Our systems are extremely modular. We've previously disclosed that in excess of a hundred discrete services may be called to generate a single page on our web site. You can find more info about that at the following link.


When we refer to a deployment at Amazon it means a single code push to one or more servers. For example, if you deploy a new piece of code to a thousand hosts that counts as one deployment. In other words a distinct update is pushed every 11.6 seconds.

Hopefully that makes sense.

Yeah, it does, and thanks for taking the time to point me at something a little meatier than slides.

I guess it's just hard to imagine that kind of situation when I'm on a two man web dev team that pushes out into the testing server 10-20 times a day and to production once a week, if that.

> "Unless their systems are heavily modularized"

I used to work for Amazon. This is exactly how things are, to a scale that's hard to comprehend.

Knowing how their stuff works internally, a prod deployment every 11.6 seconds is not hard to imagine at all.

This. I also used to work at Amazon until recently. Amazon understands decoupling, deeply. And it's applied everywhere. In the code. In the architecture. How teams are organized. How processes are designed.

I'd love to see a pros-and-cons of this approach from a team organizational standpoint.

I think that decoupling teams boils down to giving teams complete ownership. And Amazon got parts of it right. It means that your team owns everything it builds. You own the code, you own the testing and you own the operations: you own the product. Various tools are laid at your feet, and you are asked to build.

Clearly, a benefit is that you can move fast. You don't need permissions from someone half a building away to do something. You don't need to touch code that needs another team's approval. There are no committees that decides on global rules. Your team decides on your team's rules.

Like a shared nothing architecture, there's very little that is shared between teams. Teams are often connected only via their service interfaces. Not much else beyond common tooling.

But even their tooling reflects decoupling. Every tool follows the self-service model ("YOU do what you WANT to do with YOUR stuff"). Their deployment system (named Apollo, mentioned in the slides) and their build system, and their many other tooling, all reflect this model.

Cons. What happens is that you might be reinventing the wheel at Amazon. Often. Code reuse is very low across teams. So there's no shared cost of ownership at Amazon, more often than not. It's the complete opposite at Google w.r.t. code reuse. There are many very high-quality libraries at Google that are designed to be shared. Guava (the Java library) is a great example.

Another con. You may not know what you're doing. But as a team you will still build a rickety solution that gets you to a working solution. This is the result of giving a team complete ownership: they'll build what they know with what they have. Amazon is slowly correcting some of these problems by having teams own specific Hard Problems. A good example is storage systems.

And a lack of consistency is a common issue across Amazon. Code quality and conventions fluctuate wildly across teams.

Overall, Amazon has figured out how to decouple things very well.

How do these services communicate with each other? How can a single page make hundreds of requests to build a page and yet get it all together in a fraction of a second?

There's several different communication methods between services, including REST, SOAP, message queues, and an internal service framework. Its a perfect example of bonafidehan's post.

As for the second question, a page generally doesn't have to make hundreds of requests. You're thinking of a flat architecture. Think of it more like a pipeline: data goes in at A, flows from A->B->C->D, page reads D. So you end up having to call a handful of services. That can be scaled by 1) caching, 2) careful selection of service calls (don't call ordering service unless you're placing an order), 3) asynchronous requests (you're typically going to be IO bound on the latency, so just spin up X service requests and then wait on them all). There are also other tricks that are fairly well known for reducing latency, such as displaying a limited set of information and loading the rest via AJAX.

As a disclaimer for the above, my work doesn't involve working with the Amazon.com website directly, so its based on my limited view in my domain space.

If you own a page or service that calls a bunch of other service, you typically collect metrics on latency of your downstream services. Amazon has libraries to facilitate this, and a good internal system for collecting and presenting this data. If one service is particularly troublesome, then you can reach out to that other team and get them to lower their latency. The other option is to pull in their data closer to you, in a format that you can consume quickly.

Just to clarify, the "deployment every 11.6 seconds" refers to all prod deployments, including internal applications and services. This doesn't mean that a deployment to the retail website (and dependent services) is done every 11.6 seconds, just that some production deployment is done at Amazon every 11.6 seconds.

disclaimer: I work for Amazon, and used to work in the presenter's org.

It would be fun to update those numbers once in a while to see how much we're speeding up, and team numbers to see who's fastest :)

Minor correction: the slides had the max number of hosts receiving a deployment at 30,000, not 3,000.

I think the most impressive stat is a little later in the presentation. Only ~0.001% of deployments actually cause an outage.

Anyone can create a system that generates a lot of deployments, but what really matters is that you can complete all of those deployments safely. Of course, that is still ~0.001% too many outages due to deployments and we are working hard to make that number zero.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact