
Stack Overflow: How we upgrade a live data center - Nick-Craver
http://blog.serverfault.com/2015/03/05/how-we-upgrade-a-live-data-center
======
3princip
I love reading about StackOverflow, particularly their infrastructure.

The site has been a useful resource for so many years, and it works so well.
It was a joy to discover that it all ran on like two racks worth of servers,
and still does. Having seen corporate intranet portals, with maybe a thousand
daily active users, running on excessive* hardware (needlessly, of course),
it's like a breath of fresh air.

*EDIT: Removed hyperbole. Not more hardware, but too much nonetheless.

~~~
GeorgeBeech
We take a lot of pride in how much we get out of each piece of hardware. You
don't need to have 1000 servers to run a large site, just the mindset of
performance first.

------
jmccree
Perhaps off topic, but this post just reminds me so much of why I love AWS
EC2, etc. Not ever having to think about hardware again is wonderful.

~~~
rdoherty
I agree. I read this post and was shocked at the amount of planning, process,
man-hours, hardware issues and other problems that come with hardware. I've
worked at places with ~500 EC2 machines in a dozen autoscaling groups across 3
AZs with many ELBs, databases, SQS queues and other AWS infrastructure and
never had to deal with anything like this when upgrading.

Upgrading hardware in EC2 is as simple as changing a launch configuration and
updating an auto-scaling group. Maybe an hour of my time to update configs,
verify and deploy. Updating something like a database or caching servers is
more work for sure, but with 0 time needed to get to the DC, unpack, rack and
configure servers you do save time with 'the cloud'.

I get that you do pay more for EC2 instances, especially if you keep hardware
for 4 years. But AWS prices drop every year or two along with (generally)
faster versions of software so your overall costs do drop.

How many ops employees would you need for a fleet of 500 servers in a
datacenter? We managed it all with 4 people with AWS.

~~~
bri3d
The Stack Exchange philosophy is that because they can buy truly mega hardware
(each one of those two blade chassis they bought has 72 cores and 1.4TB of
RAM, remember!), they don't need those 500 servers to start with. Plus the
hardware is an asset and you get to depreciate it.

Everywhere I've ever worked we've had the "big spreadsheet" of projected cloud
costs, projected ops costs, and hardware costs. In general the "scale
horizontally" philosophy will favor the cloud while the "scale vertical"
philosophy still seems to favor owned hardware in local datacenters. Which is
superior is a crazy, long-standing debate with no clear answer.

~~~
MrBuddyCasino
Can you elaborate? I thought the answer to that question was to scale up if
you can, because its much simpler and therefore cheaper. Similar to how you
don't give up ACID unless the scale you're working at doesn't permit it
anymore.

~~~
bri3d
There's never really a "one size fits all" answer, which is why it's a long-
running debate and depends heavily on the product.

Scaling horizontally _can_ let you use smaller, cheaper hardware on average
and burst to higher capacity more easily if you need to, at the expense of a
lot of complexity. It also (done right, which is rare) tends to gain you a
greater degree of fault tolerance, since hardware instances become rapidly-
replaceable commodities.

Most web apps have spiky but relatively predictable load. For example, a
typical enterprise SaaS startup gets more traffic during work hours than on
weekends. For these companies the complexity of developing a horizontally
scaled architecture can be offset by the decreased cost of buying really big
machines for peak load and then scaling back to a couple small instances for
periods of below-average load.

That's (ostensibly) why AWS exists in the first place: Amazon had to buy a lot
of peak capacity for Black Friday and Christmas and found it going unused the
rest of the year. They never meant to sell their excess capacity, but they
realized the tools that they built to dynamically scale their infrastructure
were valuable to others.

Plus, a lot of work is offline data analytics, ETL, and so on. It's very cost
effective to scale these workloads horizontally on-demand - spin up extra
workers to run your reporting each hour/night and keep costs down the rest of
the time when you don't need the capacity.

On the flip side, companies like Stack Exchange and Basecamp have high,
relatively stable traffic worldwide. For companies like this it makes more
sense to scale vertically - if they were in the cloud, they would never scale
down or shut down their instances anyway.

Personally, I agree that horizontal scalability is oversold and most people
can, indeed, scale up instead of out. However, a lot of plenty smart people
disagree with me and have valid reasons to scale horizontally, too.

~~~
chx
> a typical enterprise SaaS startup gets more traffic during work hours than
> on weekends.

You still need to budget what you can get if renting dedicated hardware vs.
renting virtual machines. For eg. a Dual Xeon X5670 machine w/ 96GB RAM and
4x480GB SSD can be had for $249 per month (just something random I found for
demo purposes). Even if you do a reserved instance for a year on EC2, you can
get a m3.2xlarge for this kind of money, and that's only 30GB RAM and 2x80GB
SSD.

It might worth it to rent this sort of iron instead of spinning up and down
EC2 instances especially if you can reasonably buy a large enough machine to
cut a lot of headaches arising from distributed computing. The right tool for
the job.

Owning hardware is again a different bag of hurt.

------
cagenut
Woah, the datacenter you guys moved to is across the street from my apartment.
Just a heads up, I considered it for a project myself but ruled it out
because... well this is the back corner of that building:
[http://40.media.tumblr.com/tumblr_lqnh988WRS1qzpdb2o4_1280.j...](http://40.media.tumblr.com/tumblr_lqnh988WRS1qzpdb2o4_1280.jpg)

~~~
GeorgeBeech
That's nothing: [https://gigaom.com/2012/10/31/how-good-prep-and-a-bucket-
bri...](https://gigaom.com/2012/10/31/how-good-prep-and-a-bucket-brigade-kept-
peer-1-online-during-hurricane-sandy/)

We where hosted at that Peer 1 facility during Sandy. Also, if the water makes
it up to the 16th floor it's time to give up anyway ... no one will care at
that point :-D

~~~
cagenut
yea we got hit hard by it too (33 whitehall), thats why we went shopping for a
second location in jersey. we wound up in newark at 165halsey.

------
import
I'm really wondering how SO uses Redis.

For example, how failure scenarios handled and what is the role of slaves.

~~~
Nick-Craver
We have a hot slave at all times for all instances. In the event of a failure
the second slave will kick in. All applications are already connected to both
servers via the StackExchange.Redis
([https://github.com/StackExchange/StackExchange.Redis](https://github.com/StackExchange/StackExchange.Redis))
library. The mechanisms for failover, etc. are built in there. We use this
library via the dashboard (pictured at the bottom of the post) in Opserver
([https://github.com/opserver/Opserver](https://github.com/opserver/Opserver))
to do quick swapping of master/slave, etc. We can do this during the day
without anyone noticing.

Slaves are not just for backups - they also serve as pub/sub mechanisms. Every
publish propagates to slaves. Since we use pub/sub for things like web
sockets, we can easily move that entire concurrent connection load to another
data center with a simple DNS change. Yep, we've tested this - it worked well.

It is of course noting: redis just doesn't fail. We had one out of memory fail
for the Q&A web sites when forking and that's it. It has been rock solid here.

------
rasur
That's a really nice cabling job. I've inherited 2 DC's with just-awful-enough
cabling that I'd be exceptionally happy to set up new racks again and migrate,
just to _sort the bloody cables out_

Sigh, don't think it's gonna happen.

As a customer of Dell, it was interesting to see someone elses insight on
managing a dell farm.

So neat! I am mildly envious! ;)

~~~
Nick-Craver
Thanks! We honestly do this for us and my mild cabling OCD, but if we can help
_anyone_ by sharing the details then it's worth it.

~~~
rasur
I'm curious to know what your perceptions of the FX2's are, since it had been
racked up.

(EDIT: Over and above what's in the article.. Because they look quite nice, is
why I'm asking - we haven't got any.. yet! Still performing well?)

~~~
GeorgeBeech
They are nice, we honestly haven't had to do much with them after we got them
setup since they are VM hosts.

The most annoying part was that the Force10 OS that the IOA's run is slightly
annoying to work with when you are used to working on Cisco gear. I'd almost
rather people stopped doing "it's close to Cicso" and did their own thing
because the connotative dissonance is jarring when something is close but not
quite what you expect.

~~~
rasur
We've got 2 M1000e's, but they're slowly being retired and TBH they're in a
'do-less not do-more' situation (mostly!), so I haven't honestly had much
experience past run-of-the-mill blade config, so that particular issue happily
hasn't really bitten me :)

But thanks (to both you and Nick) for responding, the FX2s look interesting
enough that I might have to see about an eval unit :)

------
xtrumanx
> We learned that accidentally sticking a server with nothing but naked IIS
> into rotation is really bad. Sorry about that one.

Wait, does that mean what I think it means? Did someone get an IIS splash page
when visiting SO?

I was going to say they should live-blog during their next upgrade but it they
did it on twitter which is awesome.

~~~
aalear
Basically, yes. This happened:
[https://twitter.com/atimmer10/status/560568418945236995](https://twitter.com/atimmer10/status/560568418945236995).

~~~
xtrumanx
How long did it take to figure out it occurred? Do you have any stats on how
many people hit that machine?

Would be hilarious if someone who went to SO to figure out why the IIS splash
was showing instead of his app happened to hit that page.

~~~
GeorgeBeech
It took about 20 minutes to notice and fix. The Twitter alerting system let us
know really quick.

------
zaroth
I think what impressed me most was the power density that data center is
supporting. That looks like nearly 10kW of power in a single rack! Many
providers top out at 4kW and HE2 in Fremont was just 1.8kW per rack when I was
there.

------
stinos
_Most companies depreciate hardware across 3 years_

Can somebody explain why exactly? Was there ever some large scale research
with the outcome this was somehow the cheapest? Or is it a mere byproduct of
the consumption-based society? I'm asking because I'ved used lots of different
types of hardware (both computing and non-computing) and >90% lasted _well_
over 3 years. It seems a waste of time/money to get rid of ot after 3 years?
Which matches with SE's findings: they choose 4 years. Still not that much,
but already 'better'

~~~
sarnowski
Most leasing contracts are on a 3 year basis.

~~~
stinos
Ok yes, but why 3?

------
kfcm
Reading this brought back fond memories. Or nightmares. Sometimes it's
difficult to distinguish between the two.

------
chippy
If anything you can see the enthusiasm and passion they have for their job.
It's inspiring!

------
aristotle
Thanks for sharing! Happy to see some of my past work in action.

