Hacker News new | past | comments | ask | show | jobs | submit login
Amazon deploys every 11.6 seconds (oreilly.com)
135 points by DanielRibeiro on Sept 7, 2011 | hide | past | favorite | 24 comments

I heavily recommend to watch the talk, 15 minutes packed with insights for the sysops. Thank you for posting the link!

The same panel also has stats like "max number of hosts simultaneously receiving a deployment (3000)". Depending on how you run the numbers those 3000 hosts could all take nine hours to receive a single deployment and average out to 11.6 deploys / second / host.

Unless their systems are heavily modularized, I have a bit of a hard time believing that something new at Amazon goes live every 11.6 seconds. Maybe I'm wrong, but I'd love to have a better grasp on the context involved here.

Wow. Look at me on the front page of Hacker News!

Our systems are extremely modular. We've previously disclosed that in excess of a hundred discrete services may be called to generate a single page on our web site. You can find more info about that at the following link.


When we refer to a deployment at Amazon it means a single code push to one or more servers. For example, if you deploy a new piece of code to a thousand hosts that counts as one deployment. In other words a distinct update is pushed every 11.6 seconds.

Hopefully that makes sense.

Yeah, it does, and thanks for taking the time to point me at something a little meatier than slides.

I guess it's just hard to imagine that kind of situation when I'm on a two man web dev team that pushes out into the testing server 10-20 times a day and to production once a week, if that.

> "Unless their systems are heavily modularized"

I used to work for Amazon. This is exactly how things are, to a scale that's hard to comprehend.

Knowing how their stuff works internally, a prod deployment every 11.6 seconds is not hard to imagine at all.

This. I also used to work at Amazon until recently. Amazon understands decoupling, deeply. And it's applied everywhere. In the code. In the architecture. How teams are organized. How processes are designed.

I'd love to see a pros-and-cons of this approach from a team organizational standpoint.

I think that decoupling teams boils down to giving teams complete ownership. And Amazon got parts of it right. It means that your team owns everything it builds. You own the code, you own the testing and you own the operations: you own the product. Various tools are laid at your feet, and you are asked to build.

Clearly, a benefit is that you can move fast. You don't need permissions from someone half a building away to do something. You don't need to touch code that needs another team's approval. There are no committees that decides on global rules. Your team decides on your team's rules.

Like a shared nothing architecture, there's very little that is shared between teams. Teams are often connected only via their service interfaces. Not much else beyond common tooling.

But even their tooling reflects decoupling. Every tool follows the self-service model ("YOU do what you WANT to do with YOUR stuff"). Their deployment system (named Apollo, mentioned in the slides) and their build system, and their many other tooling, all reflect this model.

Cons. What happens is that you might be reinventing the wheel at Amazon. Often. Code reuse is very low across teams. So there's no shared cost of ownership at Amazon, more often than not. It's the complete opposite at Google w.r.t. code reuse. There are many very high-quality libraries at Google that are designed to be shared. Guava (the Java library) is a great example.

Another con. You may not know what you're doing. But as a team you will still build a rickety solution that gets you to a working solution. This is the result of giving a team complete ownership: they'll build what they know with what they have. Amazon is slowly correcting some of these problems by having teams own specific Hard Problems. A good example is storage systems.

And a lack of consistency is a common issue across Amazon. Code quality and conventions fluctuate wildly across teams.

Overall, Amazon has figured out how to decouple things very well.

How do these services communicate with each other? How can a single page make hundreds of requests to build a page and yet get it all together in a fraction of a second?

There's several different communication methods between services, including REST, SOAP, message queues, and an internal service framework. Its a perfect example of bonafidehan's post.

As for the second question, a page generally doesn't have to make hundreds of requests. You're thinking of a flat architecture. Think of it more like a pipeline: data goes in at A, flows from A->B->C->D, page reads D. So you end up having to call a handful of services. That can be scaled by 1) caching, 2) careful selection of service calls (don't call ordering service unless you're placing an order), 3) asynchronous requests (you're typically going to be IO bound on the latency, so just spin up X service requests and then wait on them all). There are also other tricks that are fairly well known for reducing latency, such as displaying a limited set of information and loading the rest via AJAX.

As a disclaimer for the above, my work doesn't involve working with the Amazon.com website directly, so its based on my limited view in my domain space.

If you own a page or service that calls a bunch of other service, you typically collect metrics on latency of your downstream services. Amazon has libraries to facilitate this, and a good internal system for collecting and presenting this data. If one service is particularly troublesome, then you can reach out to that other team and get them to lower their latency. The other option is to pull in their data closer to you, in a format that you can consume quickly.

Just to clarify, the "deployment every 11.6 seconds" refers to all prod deployments, including internal applications and services. This doesn't mean that a deployment to the retail website (and dependent services) is done every 11.6 seconds, just that some production deployment is done at Amazon every 11.6 seconds.

disclaimer: I work for Amazon, and used to work in the presenter's org.

It would be fun to update those numbers once in a while to see how much we're speeding up, and team numbers to see who's fastest :)

Minor correction: the slides had the max number of hosts receiving a deployment at 30,000, not 3,000.

I think the most impressive stat is a little later in the presentation. Only ~0.001% of deployments actually cause an outage.

Anyone can create a system that generates a lot of deployments, but what really matters is that you can complete all of those deployments safely. Of course, that is still ~0.001% too many outages due to deployments and we are working hard to make that number zero.

Even though amazon.com is all on EC2 and capacity is demand driven, someone is still buying servers and has some capacity overhead, right? They've just shifted the spend from the amazon.com business unit to the AWS business unit (assuming that's how it's set up)?

You can flatten the demand easily with EC2 by having the cost vary dynamically based on overall load. So (using his example) during the end of November Amazon themselves would want more servers and so the cost could go up slightly. A drug company doing discovery might decide then not to run their computations and to wait until the price drops back down. Likewise with someone cracking passwords, or mining bitcoins.

They have consistent pricing year-round for their on-demand and reserve instances. The spot-instances are priced dynamically by auction and the supply of them would be reduced when Amazon is using more instances itself. The price of the spot instances will never exceed that of the on-demand rate, since no-one would bid greater than a fixed rate for the same service. At the peak usage, spot instances reach the same price as the on-demand rate.

The price of the spot instances will never exceed that of the on-demand rate, since no-one would bid greater than a fixed rate for the same service

Checking the price history in the AWS console reveals that the prices for spot instances occasionally exceed the on-demand rate. In particular t1.micro instances reached $0.05/hr (vs the on-demand $0.02/hr). One possible explanation is that spot instances are more valuable because you can run more of them at a time than on-demand instances (100 total vs 20 total) without having to get an exemption for your use case. Another possible explanation is that people bid higher amounts to guarantee that their instances will run uninterrupted, knowing that even if the price briefly exceeds the on-demand price, the average should still be lower overall.

I don't know about "easily". If you take a higher level view, like the entire internet, then you will see that most sites have a similar usage pattern. How many have the opposite problem?

I guess as we move towards a more global economy it will level out somewhat on a day to day basis, but I don't know if that is a realistic expectation. The season spikes probably won't change.

Is there a recording of the talk associated with this?

per the first comment: http://www.youtube.com/watch?v=dxk8b9rSKOo :)

Here are two other great talks by one of my friends at Amazon.



Bad choice with the Toyota F1. Should have gone with a McLaren or a Ferrari.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact