Hacker News new | comments | ask | show | jobs | submit login
Ask HN: How do big web sites roll out new versions?
110 points by Bluem00 on Dec 27, 2008 | hide | past | web | favorite | 44 comments
The operations of big websites fascinate me. Sites such as http://highscalability.com/ give an glimpse into the architecture of Amazon, Twitter, and other large systems, but leave out the operations side. So, can anyone explain how they deploy new versions of their sites? For instance, I can't imagine Amazon choosing to go down for even a minute, so what do they do? Am I wrong? Thanks!

They'll take a few webservers and their associated middle tier boxes out of service on their front-end loadbalancers, wait for all the sessions to migrate/fail over to others, upgrade them and put them back into service. The loadbalancers will be smart enough to do affinity (put this customer onto this pool of servers if possible). At any one time after the roll-out begins, x% of the customer base will be on the new code where x->100 if things go well, or if something unexpected happens, (100-x)% of customers will never even see it, and the rest (x << 100) will see only a glitch before the loadbalancers swing them back across onto the old code.

In apps like this the policy is to only add columns to tables and new tables, never to remove anything, so the database can be upgraded hot, and the old code can continue to run on it, only the new code will see any new columns/tables.

I've done rolling deployments but also a similar but different way at Yahoo!.

We have access to multiple colos of slightly over-capacity machines. It's not uncommon to migrate by switching all traffic over to one colo. Then upgrade the next colo to the new code and switch all traffic there. Finally upgrade the first colo and then start cross loading the traffic between colos again.

Since we are big enough to have at least a couple of colos per geographic region this means we can avoid rolling migrations which are pretty horrible when you get trapped in a failure mid-roll.

And when all the servers run the new code, are the databases changed to remove old columns or is it an absolute no-no with ever-expanding tables?

At Yahoo we drop columns from the previous version the next time we upgrade the database, e.g. if you stopped using a 1.0 column in version 1.1, then you drop that column as part of the upgrade going to 1.2. The key thing is always being able to roll back if things go seriously wrong; hardware is much, much cheaper than downtime.

Unless you really need the space, I suggest leaving the columns there. Fewer chances for error. If your columns are nullable, the amount of space taken up by an obsolete column tends to one byte, particularly if old data is deleted.

It's different for every site, db backend, etc.

If it's just 1 column, and it doesn't store much data, then probably no. If it's a bunch of columns and it's taking up lots of disk, they they probably migrate what they need to a new table and drop the old one. Altering tables is usually not a good idea. In mysql it can stop replication and can mean hours or days of downtime depending on the situation and amount of data.

Once you are sure you are never going to roll back to the old code you can wipe out the old columns.

In practical terms you generally don't tho'. You might need them again, for example, for some as-yet-unplanned new feature. And doing so is of limited value anyway in many databases, you won't get the space back without a re-org and that's a costly operation to do hot, if you even can at all.

If the upgrade is expected to change the user experience, this is also a great time to do some A/B testing across versions.

For a moderate-sized upgrade, the description given by gaius is pretty accurate. Working at Yahoo, I've been around for a couple pretty huge property changes, and then the deployment process is very different.

Basically, hardware is much cheaper than downtime, and very big web companies have lots more money. So we don't swap out old servers gradually: instead we buy and set up an entirely new set of servers, deploy to them several weeks in advance of the planned launch, and run QA against these production-level boxes. Then, when we are ready to "launch", all we're really doing is a relatively low-risk DNS change: all the potentially tricky deployment issues having been ironed out beforehand.

After a couple of weeks/months of operation on the new hardware, if there have been no major problems, the old boxes are decommissioned -- either re-imaged and put back into service to expand capacity, or more often taken out of service entirely (I've no idea what we do with old boxes when we stop using them, funnily enough).

What is the magnitude of this, and how many servers do you have to buy to do the deployment?

Also, while the DNS is switching there would be some major database inconsistencies between the two server stacks. How do you deal with that?

The number of servers you have to buy is totally property-dependent. It can be anywhere from 2 to several hundred, but is usually less than 20 in total for the median-sized Yahoo property.

Could you clarify what you mean by database inconsistencies between stacks?

instead we buy and set up an entirely new set of servers

What property do you work on? How is it more efficient to buy or reallocate entirely new servers than update your old ones?

How many times per year do you even update your application? The cost of getting the servers ready and installing packages would be prohibitive if you had to roll out, say, an urgent security fix.

I think it depends on the kind of upgrade. Seldo is describing a major migration to a new platform. In the case of a point release a rolling migration or the type I described above are the norm.

When doing a major release seldo is spot on. Especially given the small amount of time it takes for hardware to get out of date it's often cheaper to provision new hardware prior to the major release and then deploy straight onto that. Mostly the old boxes would then get cycled back in to production, something less strenuous like staging, or retired.

Exactly right. As I said at the beginning of my post, the "new boxes" method is only used for major platform upgrades, of a magnitude that would happen once a year or less for most properties.

For the really huge properties, the cross-colo upgrade pattern you (sh1mmer) describe is the way to go.

Ok, I see what you are saying. Maybe it makes sense if you were switching operating systems or were upgrading PHP from say 4.x to 5.x. I'd still like to see the numbers behind it though.

Also, cross-colo is not a factor of being big, it's a factor of having a BCP (business continuity plan). Every team at Yahoo is urged to have BCP for practically everything. I used to do cross-colo upgrades for a relatively small project once we had more than two frontend servers.

I think cross-colo is a size thing. BCP is good for all business, but most start ups can't afford to pay for hosting in several data centers.

I image cheap way to fake multiple colos now is to use visualization, amazon style. Heck you could do exactly what Seldo was suggesting every time on EC2.

name one yahoo property that bought completely new hardware for a new version rollout

Widgets, Profiles, YDN and Mobile, in the last 2 years. And those are just the ones that I know about because I worked on them. YAP was also new hardware but it was also a new product.

seldo's right.

Also, the new Buzz site used all new hardware (and was a new product) and Games rolled out on new hardware when it was upgraded from HF2K to PHP. The old servers were reimaged and folded in with their PHP siblings once it was clear that the build was holding up properly.

haha thats the way it was done at yahoo. go ask HRC now, they'll probably tell you that you are being migrated to a VM-based platform. i'm not speaking derisively, its long time yahoo and other web companies backed off of the ridiculous policies of acquiring servers on a daily basis. the $40mln spend on hardware for panama won't be repeated

What are you talking about? Seldo works at yahoo right now, and has been a part of several product roll-outs.

i remember reading somewhere that ebay changed the background color of their homepage from gray to white, gradually, over 6 months, as not to alienate customers who were used to the gray. can't provide links, and sorry because this isn't really related. i just thought that was interesting. :)

Wow, that's fascinating. If anyone has any more info on this, I'd really like to learn more.

found it here:


"In a nutshell, a meaningless background was removed from a seller page. Pandemonium. After strong resistance the background was reinstated, to everyone’s satisfaction. In fact, the rebellious users were so placated that they failed to notice the designers slowly adjusting the background’s hex values over the next few months. The background got lighter and lighter until one day—pop!—it was gone."

More info is here, linked from the above article: http://www.uie.com/articles/death_of_relaunch/

Look under the title "eBay.com: Satisfying their Users with Gradual Evolution"

The original of this story was in Adam Cohen's book The Perfect Store.

Here is basically what one of our big financial clients [1] does twice per year:

Integration, or sytem testing had taken place over the previous 1-2 month period.

They have 2 large datacenters, let's call one "production" and the other one "disaster recovery" (abbreviated "DR"). Between these and the internet are some large routers (the Cisco ones that cost about what a house costs). At the beginning of "migration" the routers are switched to point from production to DR, so that all internet traffic points to DR.

At this time, the servers are being updated with the new code. This can take some time, especially if large databases need to be restored as part of the migration, or if the update scripts take a long time. There are a lot of servers involved, a ballpark is about 100 servers: some Sun, some WinTel, some IBM mainframes. Some mirrored, some clustered, some all by their little old lonesome. Some applications are Java, some .NET, and some Cobol[2].

Approximate timeline [3]:

People start dialing into the main conference call about 5:30PM eastern [4].

Switch to DR about 6pm eastern.

Code in production is migrated/installed, servers rebooted if necessary. Done about 9-11pm eastern.

Testing [5] starts and continues until about 3AM.

GO/NO-GO decision is made sometime between 3 and 4AM[6].

Rollback if necessary.

Switch routers to point to production at 6AM.

Preliminary postmortem report generally done by 2PM.

If no rollback, then repeat the following evening for DR.

Notes: 1 - They're a Fortune 100 company, I'm not telling who they are.

2 - That I'm aware of. It would not surprise me at all that there are a number of other "brands" of servers or programming languages involved.

3 - The actual timeline is usually a spreadsheet that's at a minimum, 50 pages long, plus about 10 more pages of first, second and third contacts in case something bursts into flames.

4 - It is common to have 100+ people monitoring the main conference line, and 1-2 dozen other conference lines used for individual components/products. One has to be awake and alert in case you're "called out" on the conference line.

5 - In general, because the main URL/URI/Hostnames are now pointing to DR, hosts files are changed so that configuration files don't get edited.

6 - Sometimes the decision gets delayed until almost 6AM if there are some problems.

When I worked at a fairly large "adult dating" website, we would:

1) Up the code from our development environment to our staging server.

2) Personally test the shit out of it. We didn't have any QA people, so have fun finding your own bugs after looking at the same screen for X minutes/hours/days/weeks.

3) When all is good, have one of the senior developers run our sync script which would rsync the code off of our staging server to our master production server into a directory called something like: live_20081228.

4) Log into the master production server and run another sync script that would copy the code from the folder created in step #3 into the live directory. Sync script would then sync that directory to all of our other production servers.

5) If there's problems: 5a) Immediately roll back 5b) Pray that you still have a job.

The process, overall, was pretty terrible and allowed for a lot of problems to arise (which happened regularly).

From their whitepapers you get the sense that much of the point of Amazon Dynamo, Google Protocol Buffers and BigTable, YAHOO PNUTS and UDB, Facebook Thrift objects, etc is to release big systems from the tyranny of the SQL schema.

When you can run multiple versions of code against the same dataset, upgrading the application layer one bit at a time becomes much easier.

Here's a couple of interesting videos, obviously very specific to the particular apps:

"Developing Erlang at Yahoo" http://bayfp.blip.tv/file/1281369/ - Talks about the transition of Delicious from Perl to Erlang, including live migrations of the DB.

"Taking Large-Scale Applications Offline" http://www.youtube.com/watch?v=cePFlJ8sGj4 - in particular, the section on "Versioning and phased rollouts: what to do when you can't assume all servers have the same code"

for the last time, delicious does not run on erlang. erlang was used to in part of the migration of data

Gradually. Except in very rare circumstances, not every visitor has to see the changes immediately. The main challenge is consistency of the data model, which should be separate from the presentation layer anyway, in an MVC-stylee. If you can upgrade the data model without affecting existing versions of the presentation layer, your presentation layer can be rolled out gradually.

It's totally different for different types of sites. Facebook is different than Amazon is different than Yahoo is different than Flickr.

In general, I'd say, most sites data is sharded or clustered. Amazon, I'm guessing, basically has many many different instances of their app running on different clusters all over the world (multiple clusters per datacenter). So they upgrade a cluster at a time, and their databases all synchronize with others of the same version.

Facebook's data is, obviously, sharded by network. So any given cluster runs x number of networks. Upgrades are then, again i'm guessing here, rolled out network by network. The data layer can be different from network to network.

Most of the time though, updates to large scale services aren't changing db schemas or making huge changes to the data layer, so it's as simple as updating the code and rebooting some app servers.

(Obviously I don't know exactly how they do it, cause I don't work at any of these places, but deploying it's that hard of a problem to solve)

On FB I've often seen account unavailable while they do maintenance, but I've never been unable to log into Amazon or place an order. Sharding is very much overrated. However it's physically implemented, Amazon have one logical database and a customer's data is always available.

How is this contrary to what I said? I didn't say Amazon sharded. What/why would they shard? Clustering != sharding.

They most definitely have multiple instances of the app running around the world, which synchronize data with each other, which is why you've never been unable to buy something. It's highly unlikely that all their instances would be down or overloaded at the same time.

Also, they most certainly don't have a single logical database. They use all kinds of things, including SimpleDB. You think product information is stored in the same database, or even in the same away as customer information?

The product data will be in multiple physical replicated shared-nothing databases each of which has the entire dataset - a single logical database. The principle of sharding is that each database has a subset of the data and you place some logic in front of it to direct the query to the right place. Now if I'd been able to buy kitchenware but not garden tools one day, then I might say their product database was sharded. But Amazon is smarter than that.

Are you still arguing that I said Amazon sharded their database? Because I've re-read what I said, and what you said like 4 times and I can't see where I said they shard.

Actually, Amazon is kind of sharding their database: They encapsulate every kind of data into a program that manages it:

>What I mean by that is that within Amazon, all small pieces of business functionality are run as a separate service. For example, this availability of a particular item is a service that is a piece of software running somewhere that encapsulates data, that manages that data.

(see http://queue.acm.org/detail.cfm?id=1388773)

You roll out incrementally, and keep interfaces between components backwards compatible for all versions presently out and any you may need to roll back to, if you possibly can.

When you cannot, I, personally, believe in partitioning traffic across concurrent versions. This can be done dynamically or statically -- really it depends on the nature of your system, in general something will stand out as obviously right for your situation.

To take an example, if your service is primarily user centric, you can partition the system by user and roll out accordingly. Let's say you have four interacting systems: a font end proxy which understands the partition boundaries, an appserver, a caching system, and a database -- pretty typical.

The front end proxy in this system is shared by all users (this need not always be true as you can do subdomain and dns games, but that is a different major headache), but everything behind it can be dedicated to the partition (this is not necessarily efficient, but it is easy).

Now, let's say we need to make a backwards-incompatible,coordinated change to the appserver and databases associated with the partition. As we cannot roll these atomically without downtime we pick an order, let's say appserver first. In this case we will wind up rolling two actual changes to the appserver and one to the databases.

The appserver will go from A (the initial) to an A' which is compatible with both A and B databases, then the databases will go from A to B, and the appservers from A' to B. You'll do this on one small partition and once done, let it bake for a while. After that, you'll roll the same across more. Typically going to exponentially more of the system (ie, 1 partition, 2 partitions, 4 partitions, 8 partitions, etc).

This means you have a, hopefully short lived, interim release of one or more components, which is probably grossly inefficient, but you wind up in a stable state when complete. The cost of doing this is not pleasant, as you basically triple QA time (final state, interim state, two upgrade transitions) and add a non-trivial chunk of development time (interim state). That said, this is why most folks just take the downtime until the cost of the downtime is greater than cost of extra development.

This is, of course, a pain in the ass to coordinate. It is easy to do with relatively small big systems (less than a few hundred servers, say, assuming you have good deployment automation), and probably the pain of coordinating is still less than the pain of baking component versioning into everything... for a while.

An alternate model, which requires significantly more up front investment, is to support this in a multi-tenant system where you don't (for upgrade purposes) dedicate a clone of the system to each partition. Instead you can bake version awareness into service discovery and dynamically route requests accordingly.

A very traditional (of the blue-suited variety) is to use an MQ system for all RPCs and tag versioning into the request, then select from the queue incorporating the version. This makes the upgrade almost trivial from a code-execution point of view, and can even help with data migration as you can queue up updates during the intermediate state database and play a catch-up game to flip over to the end state database. This is the subject for a blog post, though, rather than a comment, as it is kind of hairy :-)

Haven't done anything of the like, but perhaps using something like apache mod_proxy to redirect visitors to machines that are already updated?

Right idea, wrong technology, this is done in hardware.

The "secret sauce" is an invisible layer in front of the webservers. The client's connection terminates in this layer, and the loadbalancers establish new sessions with the web servers - the client never actually touches the server. This extra layer lets you do all sorts of clever stuff that was unimaginable back in the day when all we had was Squid and round-robin DNS.

Well, when Google App Engine comes into play, this problem isn't big. They have something called versioning (in the admin panel) which allows you to change the app to the next version or revert back to an old version.

From an amateur POV: But when usual servers come into play, I have had no clue abt big ones.

Infact a couple of months back, when my home server had a burst of traffic from my mobile app, it couldn't handle it (256kbps connection, 256MB ram with a 1.3 GHz P3 processor. well suited to serve a high bandwidth mobile app of 1000 users and also for constantly collecting data from APIs). But due to sudden traffic outburst, I had to string a friend's comp to mine.

My procedure: I first stopped updating data from APIs to my DB and kept it static. No updating the DB. Copied it to my friend's fast comp. I had my computer interfaced with a mobile which accepts input from the user mobile phones. That was the app's requirement. So I had an advantage of lining up requests right there in the mobile phone while I was switching the server. That trick should help if you are developing a mobile app on a very limited resource (a hobby project).


A regex of machines? Every odd machine in the first datacenter? Software load balancers?

Uh, I'm guessing most places do it much more scientifically than that, and I would be very surprised if anyone in the top 100 sites is using a software loadbalancer (I only know of Wordpress using nginx http://barry.wordpress.com/2008/04/28/load-balancer-update/).

Most places run a dedicated hardward loadbalancers from F5 or Cisco or Juniper.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact