
Why are websites sometimes “down for maintenance”? - arachnids
http://softwareengineering.stackexchange.com/questions/71526/why-are-websites-even-this-one-sometimes-down-for-maintenance
======
sametmax
No server NEEDS to go down for maintenance. You can avoid doing so for
anything, at any scale, DB change, server updates, etc.

The problem is that a 0-downtime system, at a certain scale, is very costly to
create and maintain. You need redundancy everywhere, load balancing
everywhere, data replication, synchronization. Those are hard problems.

Basically you need to arrive to the level of being able to release the Netflix
Chaos Monkey in prod to be sure it works even if part of your system is busy
with the update, or just out of sync. This is certainly doable. It's also very
expensive, requires a lot of time and many experts to work on the problem.

Putting a site on maintenance mode can be a middle ground you choose, because
you don't want to invest that much just to avoid taking down you site for a
little time once in a while.

Economics.

Of course, if you do choose the road of 0down time, you site will gain more
than just availability, it will gain reliability as well, since those best
practices serve both purposes.

~~~
amelius
One of the biggest problems is migrating your data to match updated code,
while the system keeps running.

It's like changing the engine of a driving car.

~~~
sametmax
Yes but again you can create a system that allow that from the begining. It's
just very, very costly to dev. And if you have legacy code, then a rewrite is
even more prohibitive.

------
zer00eyz
Reasons I have been "down for maintenance" in the past.

\- Moving from AWS to our own datacenter. \- Payment processor issues. We
weren't making money with the payment processor down... “down for maintenance”
meant lower customer service costs. \- Because the CEO told me to. I shit you
not. Be wary of working for someone has a name that sounds like it belongs on
a bond villan. \- Because sometimes you NEED all the resources to get
something done quickly \- In the days before AWS and "cloud computing" you
only had hardware on hand. It is hard to get your boss to budget for a traffic
spike of one hour that is greater than the sum of the previous 6 months of
traffic. \- Because non technical people have access to technology: It was
just some javascript -or- I didn't think I needed to tell you before I emailed
5 million people with an offer for free stuff -or- why is everything on sale
for %25 off .... \- Because load and time and complex systems sometimes do
funny things together, "maintenance" means were getting enough data reproduce
it finally. \- The very beginning of a DDOS attack (only for some industries &
sites)

------
greenleafjacob
Always avoidable if that's a priority - schema changes can be done online in
MySQL. Patches can be done on subsets of servers. Erlang even supports hot
code reloading so that even if you had a single point of failure you can
upgrade without losing file descriptors or in memory state. It is a lot
simpler if you have the choice though, since you don't have to have multiple
versions online at the same time. "Divisions of Ericsson that do [hot code
reloading] spend as much time testing them as they do testing their
applications themselves." [1]

[1]:
[http://learnyousomeerlang.com/relups](http://learnyousomeerlang.com/relups)

~~~
endymi0n
There's more and more nuanced reasons actually:

1\. Companies don't know how to do the engineering for maximum uptime, like
you describe. It's way more complicated than the usual CRUD operations

2\. Companies know how to but they decide not to invest this time (we often
traded one hour of downtime against 2-3 man-days for preparing online schema
changes with nasty and inconsistent backfilling in the early days).

And 3. Don't forget disaster recovery. I've seen some of the smartest
companies go down for hours due to a DB misconfiguration, or a Rack PSU
faulting with only one side of the servers connected, even with a reasonably
highly available setup. Stuff like this happens - and then you better have a
proper 503 Maintenance page up and running to prevent Google from delisting
your site. In this case though, "maintenance" is rather an euphemism :)

~~~
gaius
_Companies know how to but they decide not to invest this time_

Cost increases exponentially for diminishing returns once you get into serious
availability. For most businesses, the investment in moving from 99.9% to
99.99% or 99.99% to 99.999% uptime just isn't worth it - most customers are
quite willing to "try again later" in practice, especially if you give them
advance notice or have a regular maintenance slot.

------
ploggingdev
I don't recall sites like Google or Facebook ever being down for maintenance.
Are there any articles that discuss how they manage application layer and
database layer migrations?

~~~
endymi0n
A good start would be all of
[http://highscalability.com](http://highscalability.com) \- but it more or
less boils down to being able to roll back: And that rules out hard schema
changes. So the proper and hard way is always a variant of: 1) Create another
column, 2) Write to both columns at the same time from the database, 3) Create
code to run on the new column, 4) Enable feature switch to run everything on
the new column, 5) Build back code dealing with old column, 6) Remove old
column.

If that looks complicated, it is - and you better only start with these things
if your site earns more money per minute than you need to pay engineers and
project managers to pull that off.

~~~
jholman
This is correct, except your step 4. It should say something like: 4a) Enable
feature switch on 1% of requests, ensure that they're working correctly. 4b)
go to 10%. 4c) start rolling it out across all requests.

------
sigi45
Because of thing they were not thought of.

You don't see 'Maintenance' on systems of companies which do this for a long
time. You might see this at 'normal' companies. Smaller ones who used the
'wrong' database and had to migrate it.

If you start with one database and 'forget' or just don't think about it to
have a master, slave, slave combination, you have to fix that once.

When you made a mistake, you have to fix it once.

Also today you are able to maintain quite a big page with a very small amount
of people. The chances, that one of those didn't think about all necessary
elements of an always online system is not far fetched.

------
e0m
"They're replacing the vacuum tubes in the servers"

------
curt15
I've always wondered whether Apple takes its website "down for maintenance"
before a product launch out of necessity or simply to build excitement.

------
seanwilson
Common causes are things like software upgrades and database changes. There's
probably always a way to avoid it but going down for maintenance might be less
effort and cheaper overall depending on the site. For example, if you can do
it during a known time of low traffic or when you know users will just come
back later. I've noticed several UK bank websites go down for maintenance
during the night.

------
tyingq
The short answer is cost versus benefit.

For some types of websites, zero-downtime upgrades and maintenance are costly.

Online banking is a good example. I have accounts with several banks, and all
of them periodically "go down for maintenance". I assume that's because the
talent and infrastructure needed to do those tasks with zero downtime are more
expensive than whatever customer service hit they take for planned outages.

------
petters
Because it is much easier than performing complicated modifications while the
site is running.

For example, at Google "down for maintenance" is not on the table. That can in
some cases lead to lots of extra work or time, e.g. dual writes for a period
of time followed by mapreduces to fix the remaining part.

My internet bank is often down for maintenance on Sunday nights. I assume it
is because they have a very old system.

------
heisenbit
PHP board software:

\- occasionally benefits from clean-up tasks which can be long running and
would result in an irritating experience. While slow read operations in theory
may be possible it is better to tell the users to come back later than to
erode their confidence.

\- sometimes the database of a board can corrupt. The repair operations (sort
of a disk fsck for the board) require the database exclusively.

\- software upgrades

------
fuzzfactor
Not every aircraft has all the expertise, tools, and spares on board at all
times to be able to service or replace their engines in flight.

If the system has not been designed from the ground up for that type of
service, then the on-board expertise would also have to be gifted at
developing workarounds on-the-spot that reliably work the first time.

------
formula_ninguna
Because computers also need to rest sometimes.

------
nickjackson
I really don't think there is any excuse for it this day and age especially
when building sites from scratch. There are so many different techniques and
technologies for doing zero downtime deploys, not to mention the numerous PaaS
that will do it out of the box if you dont know how.

~~~
viraptor
There's still cost to it. It basically boils down to: do you lose more money
during a manual maintenance period, or by hiring extra people to do all
changes in zero-downtime style. (Or doing slower development with the existing
team) The technology for transparent changes has been available for decades,
although it's true - it's much easier to use today. But it still needs extra
work. And someone has to pay for that work in the end.

------
protomyth
Mistakes were made during the deploy of the new website to production. A
failed website deploy is a bit more noticeable to the public than the failed
deployment of an internal only system.

------
visarga
They need to change the oil.

~~~
dingaling
Which tangentially is why the USAF's E-4B airborne command posts have to land
after about three days. Fuel isn't a problem but they don't have a way to
replenish engine-oil in-flight.

------
Demcox
"petrabytes"...really?

The most upvoted comment forgot how to spell (or doesn't know) to petabyte.

~~~
welly
Or it was a typo or they were thinking of something else at the time or they
were typing their response on their mobile or any other number of reasons
other than "forgot how to spell" or "doesn't know how to spell".

I swear criticising someone's spelling is the last bastion in an
argument/debate/discussion. When you haven't got anything else, attack their
spelling.

I'm not saying that you're getting into an argument or debate but come on, you
know what the guy meant.

~~~
Demcox
I'm sorry that you feel this way, but being a CS student, ones know the
horrors such simple spelling errors can unravel in codes, machine architecture
and programming to name a few.

Correct spelling and grammar is what makes your OS function, the doses of
medicine prescribed correct and it can sometimes be the difference between
life and death.

I view it as the foundation for any thing important that you want to
communicate.

