

Stack Overflow Architecture - timf
http://highscalability.com/stack-overflow-architecture

======
banditaras
"Joel boasts that for 1/10 the hardware they have performance comparable to
similarly size sites. He wonders if these other sites have good programmers.
Let's see how they did it and you be the judge."

Let me be the judge then. For a site serving 13 million pageviews per month
(80% of them are _uncachable_ searches) we use 2 servers with about the same
configuration (same memory, cpus). The database server has an average load of
1 and the application server (that is serving a bunch of other sites as well)
is under 2 most of the day.

We have 1/2 of their capacity running an equally heavy site (all sites running
on those servers make up 16-18 million pageviews per month). So if they run on
1/10 of similar sites, we run on 1/2 of their 1/10. Even better we pay 100%
percent less of what they are paying. I wonder how smart Atwood is.

I don't intend to be a smart ass here. I would never say "Hey we run with 1/10
of your capacity, you are stupid" because performance heavily depends on the
application. StackOverflow probably has a 90% cache hit ratio (86% of visitors
are from google that land on some question asked some days or months ago). So
3 servers for a cache and forget site (logins and bits for pages that change
often can be cached too) serving 16M pageviews per month is below average.
They maybe doing a whole lot of other things in the backend that we don't know
of , but the same goes for the other sites that "their programmers are stupid
and use 10x hardware."

I would expect them to say what problems they solved and how instead of
bragging about how awesome programmers they (he?) are.

~~~
spolsky
I never asserted we have better performance because we have great programmers
(although we do). I have stated that there is a performance benefit to using
the Microsoft Stack relative to other common platforms like PHP, because C#
just performs way better than PHP. And I've stated that the savings you get
from the small number of servers we require relative to a typical PHP site
more than pays the Microsoft licenses.

~~~
jganetsk
Why pay the Microsoft licenses when you could have Java or Scala serve the
site?

------
kevinpet
I think a key item in Stack Overflow's success is that they can predict with
high reliability how big it will get. They are safe going with scale up
because they know they will not have to scale out.

Compare this to a more mainstream consumer website, say FriendFeed, which is
currently on roughly the same scale as Stack Overflow. FriendFeed's business
model is more "swing for the bleachers" and success would mean scaling to the
size of Twitter. They need an architecture that can handle fast growth if it
comes.

Stack Overflow is a much less risky proposition. They had excellent knowledge
of the market, knew they could pull all the traffic from expertsexchange
almost overnight, and they knew that scaling to tens of millions of monthly
visitors was something they needed to worry about.

------
dasil003
Great high level summary. Given current work scaling up a smaller site
(3million pageviews/month) with more social networking functionality is that
Stack Overflow ought to be pretty amenable to caching. Obviously it's pretty
interactive so it's somewhere middle of the road. But what's a real cache
killer are the types of per-user customization that Facebook does. Not to take
anything away from Stack Overflow, but it seems like it ought to be served
pretty well by standard techniques whereas something like Facebook obviously
needs some juicy custom middleware. Would love to see an article about that.

~~~
alxp
StackOverflow does some pretty clever things like using JavaScript to dim or
hilight questions on the front page based on the user's Ignored or Interested
tags, instead of generating a unique set of front page items HTML for every
user.

~~~
garcara
This might be clever for them but I hate seeing these dimmed items, dimmed !=
ignored.

~~~
ryne
True for me as well even here on HN; I go out of my way to read downvoted
comments when they're barely legible by highlighting them.

------
biohacker42
_My impression is they pay about $11K for OS and SQL licensing._

11K probably "cost" them less then the time and effort to come up to speed on
open source solutions. They were MS experts already and it would have taken
quite a bit to reach the same level of expertise in *nix land.

If you're young and just starting out and are wondering if you should become
an expert in free or commercial software, keep their situation in mind.

~~~
pbz
If they're using BizSpark like the article claims, wouldn't it be free?

~~~
rewind
That's correct. They get three years of production licenses, even for SQL
Server Enterprise Edition.

------
NoHandle
Stack Overflow certainly deserves some credit. I was unaware of how much
growth they have seen and that is mostly due to the fact that the service has
rarely diminished for me. Taking on that kind of increased load, while
preemptively scaling to meet is no easy task.

~~~
marcusbooster
Eh, I could do it in a weekend. (I'm so sorry)

~~~
NoHandle
That made me smile so I forgive you.

------
sstrudeau
StackOverflow's traffic numbers: * 16 million page views a month * 3 million
unique visitors a month * 6 million visits a month

... which is very similar to what my sites are doing (we do more than 20m page
views w/ about the same # of uniques. Our content is very image heavy, though
_maybe_ a little more amenable to caching.

StackOverflow is running on 2 quad-core 8GB boxes and one 8-core 48GB db box.
We're keeping up with VPS "slices" at SliceHost that add up to less than two
full 4-core 16GB standard SliceHost boxes; and I expect to reduce capacity
when I finish moving our image assets off to S3+CDN.

We have one full time developer/sysadmin/etc.: me. Are other similarly
trafficked sites really using a lot more iron? I thought we were typical for
this scale.

------
brown9-2
This article kind of feels like it was just scraped from
[http://blog.stackoverflow.com/2008/09/what-was-stack-
overflo...](http://blog.stackoverflow.com/2008/09/what-was-stack-overflow-
built-with/) and <http://blog.stackoverflow.com>, doesn't it?

I'd be much more interested in knowing about the internal architecture of the
software running the site rather than just "they threw this server together
with that one and then v1.2.3 of this other one too".

~~~
toddh
It was indeed scraped from all the sources listed in the article. I would have
like more too, but for me the emphasis was more on the viability of scale up.

------
jrockway
_The refactorings will be to avoid excessive joins in a lot of key queries.
This is the key lesson from giant multi-terabyte table schemas (like Google’s
BigTable) which are completely join-free._

Congratulations, you just implemented your key/value store in MSSQL.

~~~
rewind
Denormalizing in some areas for performance is hardly the same as implementing
a key/value store in MSSQL.

~~~
tumult
Actually, yeah, it is. :)

~~~
rewind
Right, because I'm sure they stopped using joins (as well as all the other
RDBMS benefits) completely after they made those changes ;-)

------
mukyu
"To get around these problems Salesforce's Craig Weissman, Chief Architect,
created an innovative approach where tables are not created for each customer.
All data from all customers is mapped into the same data table, including
indexes. The schema for that table looks something like orgid, oid, value0,
value1...value500. "orgid" is the organization ID and is how data is never
mixed up. It's a very wide and sparse table, which Oracle seems to handle
well. Hundreds and hundreds of "tables" and custom fields are mapped into the
data table."

I thought I was on thedailywtf for a second there. So they took Oracle, and
implemented an RDBMS in it?

~~~
toddh
It is quite strange, but it does reflect the needs of finding a real solution
to a real problem. I'm not aware at least of similar multitenant site that
serves so many customers, with so much data, with so few servers, and such
extensive customization. Having said that, Oracle seemed non negotiable in his
mind and I wonder if that wasn't so if a different solution might have
evolved.

------
rbanffy
Is every Microsoft-based heavy-traffic database-driven site using this much
hardware for just a couple dozen million page-views a month?

Seriously, the hardware they have has a lot of room for growth.

------
StrawberryFrog
"All data from all customers is mapped into the same data table, including
indexes. The schema for that table looks something like orgid, oid, value0,
value1...value500. "

They make it sound like that's a good thing. It is not. Show that to any
halfway competent SQL guy and you'll get a disgusted response. Implementing a
database on top of an existing RDBMs is an antipattern.

------
blasdel
Their handling of multi-tenancy is bound to be hilarious:
<http://news.ycombinator.com/item?id=67839>

~~~
johns
Huh?

~~~
blasdel
What will be the distinction between 'shared' and 'dedicated' hosting?

How in are they going to get 'shared' hosting working with such a silo-ed
vertical setup? Are they going to be assigning customer sites to machines ala
Dreamhost et. al.?

Will they end up paying for separate Windows Server and SQL Server licenses to
run each 'dedicated' site in its own VM? Are they going to be manually
migrating customers who exceed the size of what they can handle in one VM?

~~~
johns
Fog Creek is running the hosted version and they already have the
infrastructure in place running FogBugz on Demand. And I'm confident enough to
guess that none of the hosted instances will have anywhere near the traffic of
SO.

~~~
blasdel
Totally forgot about FogBugz on Demand, that mostly explains how the 'shared'
hosting will end up working (though doesn't FogBugz use mysql?)

~~~
mhp
On Demand uses SQL Server, but yeah, licensed FogBugz can run against mysql.

