

Ask HN: How do big web sites roll out new versions? - Bluem00

The operations of <i>big</i> websites fascinate me. Sites such as http://highscalability.com/ give an glimpse into the architecture of Amazon, Twitter, and other large systems, but leave out the operations side. So, can anyone explain how they deploy new versions of their sites? For instance, I can't imagine Amazon choosing to go down for even a minute, so what do they do? Am I wrong? Thanks!
======
gaius
They'll take a few webservers and their associated middle tier boxes out of
service on their front-end loadbalancers, wait for all the sessions to
migrate/fail over to others, upgrade them and put them back into service. The
loadbalancers will be smart enough to do affinity (put this customer onto this
pool of servers if possible). At any one time after the roll-out begins, x% of
the customer base will be on the new code where x->100 if things go well, or
if something unexpected happens, (100-x)% of customers will never even see it,
and the rest (x << 100) will see only a glitch before the loadbalancers swing
them back across onto the old code.

In apps like this the policy is to only _add_ columns to tables and new
tables, never to remove anything, so the database can be upgraded hot, and the
old code can continue to run on it, only the new code will see any new
columns/tables.

~~~
Timothee
And when all the servers run the new code, are the databases changed to remove
old columns or is it an absolute no-no with ever-expanding tables?

~~~
brianm
Once you are sure you are never going to roll back to the old code you can
wipe out the old columns.

~~~
gaius
In practical terms you generally don't tho'. You might need them again, for
example, for some as-yet-unplanned new feature. And doing so is of limited
value anyway in many databases, you won't get the space back without a re-org
and that's a costly operation to do hot, if you even can at all.

------
seldo
For a moderate-sized upgrade, the description given by gaius is pretty
accurate. Working at Yahoo, I've been around for a couple pretty huge property
changes, and then the deployment process is very different.

Basically, hardware is much cheaper than downtime, and very big web companies
have lots more money. So we don't swap out old servers gradually: instead we
buy and set up an entirely new set of servers, deploy to them several weeks in
advance of the planned launch, and run QA against these production-level
boxes. Then, when we are ready to "launch", all we're really doing is a
relatively low-risk DNS change: all the potentially tricky deployment issues
having been ironed out beforehand.

After a couple of weeks/months of operation on the new hardware, if there have
been no major problems, the old boxes are decommissioned -- either re-imaged
and put back into service to expand capacity, or more often taken out of
service entirely (I've no idea what we do with old boxes when we stop using
them, funnily enough).

~~~
dcurtis
What is the magnitude of this, and how many servers do you have to buy to do
the deployment?

Also, while the DNS is switching there would be some major database
inconsistencies between the two server stacks. How do you deal with that?

~~~
seldo
The number of servers you have to buy is totally property-dependent. It can be
anywhere from 2 to several hundred, but is usually less than 20 in total for
the median-sized Yahoo property.

Could you clarify what you mean by database inconsistencies between stacks?

------
brianm
You roll out incrementally, and keep interfaces between components backwards
compatible for all versions presently out and any you may need to roll back
to, if you possibly can.

When you cannot, I, personally, believe in partitioning traffic across
concurrent versions. This can be done dynamically or statically -- really it
depends on the nature of your system, in general something will stand out as
obviously right for your situation.

To take an example, if your service is primarily user centric, you can
partition the system by user and roll out accordingly. Let's say you have four
interacting systems: a font end proxy which understands the partition
boundaries, an appserver, a caching system, and a database -- pretty typical.

The front end proxy in this system is shared by all users (this need not
always be true as you can do subdomain and dns games, but that is a different
major headache), but everything behind it can be dedicated to the partition
(this is not necessarily efficient, but it is easy).

Now, let's say we need to make a backwards-incompatible,coordinated change to
the appserver and databases associated with the partition. As we cannot roll
these atomically without downtime we pick an order, let's say appserver first.
In this case we will wind up rolling two actual changes to the appserver and
one to the databases.

The appserver will go from A (the initial) to an A' which is compatible with
both A and B databases, then the databases will go from A to B, and the
appservers from A' to B. You'll do this on one small partition and once done,
let it bake for a while. After that, you'll roll the same across more.
Typically going to exponentially more of the system (ie, 1 partition, 2
partitions, 4 partitions, 8 partitions, etc).

This means you have a, hopefully short lived, interim release of one or more
components, which is probably grossly inefficient, but you wind up in a stable
state when complete. The cost of doing this is not pleasant, as you basically
triple QA time (final state, interim state, two upgrade transitions) and add a
non-trivial chunk of development time (interim state). That said, this is why
most folks just take the downtime until the cost of the downtime is greater
than cost of extra development.

This is, of course, a pain in the ass to coordinate. It is easy to do with
relatively small big systems (less than a few hundred servers, say, assuming
you have good deployment automation), and probably the pain of coordinating is
still less than the pain of baking component versioning into everything... for
a while.

An alternate model, which requires significantly more up front investment, is
to support this in a multi-tenant system where you don't (for upgrade
purposes) dedicate a clone of the system to each partition. Instead you can
bake version awareness into service discovery and dynamically route requests
accordingly.

A very traditional (of the blue-suited variety) is to use an MQ system for all
RPCs and tag versioning into the request, then select from the queue
incorporating the version. This makes the upgrade almost trivial from a code-
execution point of view, and can even help with data migration as you can
queue up updates during the intermediate state database and play a catch-up
game to flip over to the end state database. This is the subject for a blog
post, though, rather than a comment, as it is kind of hairy :-)

------
Tangurena
Here is basically what one of our big financial clients [1] does twice per
year:

Integration, or sytem testing had taken place over the previous 1-2 month
period.

They have 2 large datacenters, let's call one "production" and the other one
"disaster recovery" (abbreviated "DR"). Between these and the internet are
some large routers (the Cisco ones that cost about what a house costs). At the
beginning of "migration" the routers are switched to point from production to
DR, so that all internet traffic points to DR.

At this time, the servers are being updated with the new code. This can take
some time, especially if large databases need to be restored as part of the
migration, or if the update scripts take a long time. There are a lot of
servers involved, a ballpark is about 100 servers: some Sun, some WinTel, some
IBM mainframes. Some mirrored, some clustered, some all by their little old
lonesome. Some applications are Java, some .NET, and some Cobol[2].

Approximate timeline [3]:

People start dialing into the main conference call about 5:30PM eastern [4].

Switch to DR about 6pm eastern.

Code in production is migrated/installed, servers rebooted if necessary. Done
about 9-11pm eastern.

Testing [5] starts and continues until about 3AM.

GO/NO-GO decision is made sometime between 3 and 4AM[6].

Rollback if necessary.

Switch routers to point to production at 6AM.

Preliminary postmortem report generally done by 2PM.

If no rollback, then repeat the following evening for DR.

Notes: 1 - They're a Fortune 100 company, I'm not telling who they are.

2 - That I'm aware of. It would not surprise me at all that there are a number
of other "brands" of servers or programming languages involved.

3 - The actual timeline is usually a spreadsheet that's at a minimum, 50 pages
long, plus about 10 more pages of first, second and third contacts in case
something bursts into flames.

4 - It is common to have 100+ people monitoring the main conference line, and
1-2 dozen other conference lines used for individual components/products. One
has to be awake and alert in case you're "called out" on the conference line.

5 - In general, because the main URL/URI/Hostnames are now pointing to DR,
hosts files are changed so that configuration files don't get edited.

6 - Sometimes the decision gets delayed until almost 6AM if there are some
problems.

------
cheez80
i remember reading somewhere that ebay changed the background color of their
homepage from gray to white, gradually, over 6 months, as not to alienate
customers who were used to the gray. can't provide links, and sorry because
this isn't really related. i just thought that was interesting. :)

~~~
dcurtis
Wow, that's fascinating. If anyone has any more info on this, I'd really like
to learn more.

~~~
cheez80
found it here:

[http://www.cennydd.co.uk/2008/can-we-avoid-redesign-
backlash...](http://www.cennydd.co.uk/2008/can-we-avoid-redesign-backlash/)

"In a nutshell, a meaningless background was removed from a seller page.
Pandemonium. After strong resistance the background was reinstated, to
everyone’s satisfaction. In fact, the rebellious users were so placated that
they failed to notice the designers slowly adjusting the background’s hex
values over the next few months. The background got lighter and lighter until
one day—pop!—it was gone."

~~~
dcurtis
More info is here, linked from the above article:
<http://www.uie.com/articles/death_of_relaunch/>

Look under the title "eBay.com: Satisfying their Users with Gradual Evolution"

------
jonursenbach
When I worked at a fairly large "adult dating" website, we would:

1) Up the code from our development environment to our staging server.

2) Personally test the shit out of it. We didn't have any QA people, so have
fun finding your own bugs after looking at the same screen for X
minutes/hours/days/weeks.

3) When all is good, have one of the senior developers run our sync script
which would rsync the code off of our staging server to our master production
server into a directory called something like: live_20081228.

4) Log into the master production server and run another sync script that
would copy the code from the folder created in step #3 into the live
directory. Sync script would then sync that directory to all of our other
production servers.

5) If there's problems: 5a) Immediately roll back 5b) Pray that you still have
a job.

The process, overall, was pretty terrible and allowed for a lot of problems to
arise (which happened regularly).

------
aristus
From their whitepapers you get the sense that much of the point of Amazon
Dynamo, Google Protocol Buffers and BigTable, YAHOO PNUTS and UDB, Facebook
Thrift objects, etc is to release big systems from the tyranny of the SQL
schema.

When you can run multiple versions of code against the same dataset, upgrading
the application layer one bit at a time becomes much easier.

------
tlrobinson
Here's a couple of interesting videos, obviously very specific to the
particular apps:

"Developing Erlang at Yahoo" <http://bayfp.blip.tv/file/1281369/> \- Talks
about the transition of Delicious from Perl to Erlang, including live
migrations of the DB.

"Taking Large-Scale Applications Offline"
<http://www.youtube.com/watch?v=cePFlJ8sGj4> \- in particular, the section on
"Versioning and phased rollouts: what to do when you can't assume all servers
have the same code"

~~~
unrealwh
for the last time, delicious does not run on erlang. erlang was used to in
part of the migration of data

------
bjclark
It's totally different for different types of sites. Facebook is different
than Amazon is different than Yahoo is different than Flickr.

In general, I'd say, most sites data is sharded or clustered. Amazon, I'm
guessing, basically has many many different instances of their app running on
different clusters all over the world (multiple clusters per datacenter). So
they upgrade a cluster at a time, and their databases all synchronize with
others of the same version.

Facebook's data is, obviously, sharded by network. So any given cluster runs x
number of networks. Upgrades are then, again i'm guessing here, rolled out
network by network. The data layer can be different from network to network.

Most of the time though, updates to large scale services aren't changing db
schemas or making huge changes to the data layer, so it's as simple as
updating the code and rebooting some app servers.

(Obviously I don't know exactly how they do it, cause I don't work at any of
these places, but deploying it's that hard of a problem to solve)

~~~
gaius
On FB I've often seen account unavailable while they do maintenance, but I've
_never_ been unable to log into Amazon or place an order. Sharding is very
much overrated. However it's physically implemented, Amazon have _one_ logical
database and a customer's data is _always_ available.

~~~
bjclark
How is this contrary to what I said? I didn't say Amazon sharded. What/why
would they shard? Clustering != sharding.

They most definitely have multiple instances of the app running around the
world, which synchronize data with each other, which is why you've never been
unable to buy something. It's highly unlikely that all their instances would
be down or overloaded at the same time.

Also, they most certainly don't have a single logical database. They use all
kinds of things, including SimpleDB. You think product information is stored
in the same database, or even in the same away as customer information?

~~~
gaius
The product data will be in multiple physical replicated shared-nothing
databases each of which has the _entire_ dataset - a single logical database.
The principle of sharding is that each database has a _subset_ of the data and
you place some logic in front of it to direct the query to the right place.
Now if I'd been able to buy kitchenware but not garden tools one day, then I
might say their product database was sharded. But Amazon is smarter than that.

~~~
bjclark
Are you still arguing that I said Amazon sharded their database? Because I've
re-read what I said, and what you said like 4 times and I can't see where I
said they shard.

~~~
soult
Actually, Amazon is kind of sharding their database: They encapsulate every
kind of data into a program that manages it:

>What I mean by that is that within Amazon, all small pieces of business
functionality are run as a separate service. For example, this availability of
a particular item is a service that is a piece of software running somewhere
that encapsulates data, that manages that data.

(see <http://queue.acm.org/detail.cfm?id=1388773>)

------
pmjordan
Gradually. Except in very rare circumstances, not every visitor has to see the
changes immediately. The main challenge is consistency of the data model,
which should be separate from the presentation layer anyway, in an MVC-stylee.
If you can upgrade the data model without affecting existing versions of the
presentation layer, your presentation layer can be rolled out gradually.

------
wesley
Haven't done anything of the like, but perhaps using something like apache
mod_proxy to redirect visitors to machines that are already updated?

~~~
gaius
Right idea, wrong technology, this is done in hardware.

The "secret sauce" is an invisible layer in front of the webservers. The
client's connection terminates in this layer, and the loadbalancers establish
new sessions with the web servers - the client never actually touches the
server. This extra layer lets you do all sorts of clever stuff that was
unimaginable back in the day when all we had was Squid and round-robin DNS.

------
SingAlong
Well, when Google App Engine comes into play, this problem isn't big. They
have something called versioning (in the admin panel) which allows you to
change the app to the next version or revert back to an old version.

From an amateur POV: But when usual servers come into play, I have had no clue
abt big ones.

Infact a couple of months back, when my home server had a burst of traffic
from my mobile app, it couldn't handle it (256kbps connection, 256MB ram with
a 1.3 GHz P3 processor. well suited to serve a high bandwidth mobile app of
1000 users and also for constantly collecting data from APIs). But due to
sudden traffic outburst, I had to string a friend's comp to mine.

My procedure: I first stopped updating data from APIs to my DB and kept it
static. No updating the DB. Copied it to my friend's fast comp. I had my
computer interfaced with a mobile which accepts input from the user mobile
phones. That was the app's requirement. So I had an advantage of lining up
requests right there in the mobile phone while I was switching the server.
That trick should help if you are developing a mobile app on a very limited
resource (a hobby project).

