
Why does digg need so many servers? - latch
http://twitter.com/spolsky/statuses/27244766467
======
tibbon
Just as perplexing, why do they need ~68 employees
(<http://about.digg.com/team>).

My counting may be off by 1-2, but last I checked Reddit is coming up on them
traffic-wise (<http://siteanalytics.compete.com/reddit.com+digg.com/>), and
only has 7 employees? And I'm pretty sure Reddit doesn't have 500 servers
either, although I could be wrong.

Both sites offer very similar functionality.

While past its peak, Slashdot only has 18 servers.
(<http://slashdot.org/faq/tech.shtml#te050>)

~~~
Lewisham
My understanding (which is, to say, belief built up over watching the
companies from outside) is that Reddit is currently running what could best be
described as a skeleton crew: Conde Nast won't fund them, so they're just
trying to keep the ship afloat and don't have time for anything else.

Digg has scaled to the size of a company that, for a couple of years now, has
considered itself just on the tipping point of "something big happening." As
in there being a sea change in the way everyone consumes their news, and Digg
is at the center. You'll need lots of employees when that happens, right? I
think Digg has partly grown fat due to non-existent leadership; the struggle
between Jay Adelson and Kevin Rose was been well-documented, and having your
two lead guys check out for at least a year is not a good way to run a
business. A lot of the people there are "sales", which I guess helps Digg stay
profitable, but I don't really think you need as many as they have. Five
community managers on a community that is supposed to manage itself is also
excessive.

I expect the truth of the matter lies somewhere in the middle of Reddit and
Digg. More than Reddit so you're not stagnant, less than Digg so you're not
fat.

~~~
ianb
Everything that is cool about Reddit at this point is what Redditors are doing
_with_ Reddit, not what Reddit is doing with its site. And isn't that how
social media is supposed to work? Not just some notion that users are tools,
that you'll crowdsource your way to being yet-another-traditional-media source
(except without journalists)... social media _should_ be individuals forming
their own communities... communities of interest, communities of practice,
communities of support -- all of which exist in Reddit (/r/SuicideWatch both
disturbs and impresses me).

So the fact Reddit-the-software isn't changing that much doesn't seem like
that big an issue, it's more like infrastructure. A major redesign would be
negatively disruptive to the communities that are building there. Not that
there aren't great things Reddit could do but isn't, but I think they are
doing well at the most important stuff. If Reddit had 68 employees they not
only would be fat, they'd probably fuck up a good thing.

~~~
Lewisham
Do not take my comment to mean I'm anti-Reddit, I like Reddit way more than I
ever did Digg :) But Reddit doesn't really _do_ a lot apart from keeping
everything going.

I agree that altering Reddit now is probably not a good idea, and they don't
need a large headcount, but it's good that Reddit isn't answering to
shareholders directly, because if it wasn't for Digg's implosion, their growth
would have been far too slow.

------
schulz
The difference is that digg has network information (IE: Followers). Stack
overflow doesn't.

Think through the ramifications of each of these page flows:

1\. Give me this question and everybody who commented on it.

2\. Give me every post that this user is following has made over the last n
days.

That's why.

~~~
mhill
That would increase the load on the database. Why does that increase the load
on the webservers?

~~~
schulz
It depends how you approach the problem. I'm guessing here, from my own
experience implementing social graph features, but here goes:

There are a couple of ways you can go about this, the first is the database:
Join the network of people against the land of content and bring it back. This
doesn't work (as an aside this is what people mean when they say web scale, it
has nothing to do with web traffic, it's social graphs) your database will
cry. All though not at first, in development it works fine, and you feel fine,
and for a while you're ok, but you start growing....

Another way you can go about it is by denormalizing. In this world you store a
pointer to each content item for each user. So anytime I do something all the
people [following|watch|connected|friended] to me get a record indicating I
did this. This works, but now you have lots of data (lots and lots of data!)
spread all crazy around. You need some kind of system to push that data out to
everybody. It's those last two that drive up your hardware usage, it's not
necessarily web boxes, but it's boxes in the background broadcasting the
events out to the world, and the datastores to hold it all. Depending on how
your web code works you could also have a lot of overhead on the webservers
putting all that stuff together.

My experience here comes from building the social features into toolbox.com. A
good example is this page [http://it.toolbox.com/people/george_krautzel/posts-
connectio...](http://it.toolbox.com/people/george_krautzel/posts-connections)
That's all the posts from users connected to our CEO (all 750k of them).
Getting that to return in near real time is super fun (and you can probably
tell that I went down the DB join path before it all fell apart).

------
Goladus
Where does the "500 servers" number come from?

Are they compute nodes? Are they webservers(5)? Webservers and appservers(10)?
Web, app, database(12)? Web, app, database, back-end processing(15)? How much
redundancy is built in(30)? Staging environments(60)? Development
environments(120)? Standard IT infrastructure like DNS and email(125)?

"We run over 500 servers" sounds a lot like "we have 500 servers in our
datacenters" not "it takes 500 servers to handle 200MM page views."

~~~
sfrench
I used to work at digg as a engineer. I won't say your distribution numbers
are correct, but you've got the gist. There are multiple environments,
reserved capacity, research and testing machines, etc.

------
jswinghammer
My last job served around 30-40 million pages a month on 4 servers that used
ASP.Net and SQL Server 2000. The only server whose CPU utilization was over
20% was the database server. I'm not sure the CPU utilization on the web
servers ever went above 10%.

A friend of mine's company has a tough time keeping up with 500,000 page views
a month using more equipment. They're using PHP and MySQL.

~~~
rythie
Apple and Oranges comparison really.

\- What do the sites do?

\- How optimised are they?

\- How much static/cachable content is there?

Really there is too little data for this comparison.

~~~
jswinghammer
Both are simple sites by any definition. The database queries for almost all
the pages views are clustered index seeks (on SQL server) and in MySQL they
are mostly looking up on the primary keys for the tables used (these are
MyISAM tables so no clustered indexes here).

Both are equally unoptimized outside of the database queries which I believe
are fairly well optimized.

No real cacheble content on either outside of the usual suspects (JS, CSS,
some images)

------
mrtron
Why does it take 20 feet of land to grow a gallon of orange juice and 200 feet
of land to grow a gallon of apple juice?

~~~
njharman
A site that posts links, has voting and comments vs a site that posts
questions has voting and comments is hardly apples vs oranges.

------
marcrosoft
One thing is for sure, Spolsky sure knows how to scale his ego.

~~~
jswinghammer
I'm not sure what that's supposed to mean.

------
jbail
Maybe Digg uses small servers whereas StackOverflow uses big servers?

~~~
fauigerzigerk
Using big servers would be what Microsoft's licensing model makes you want to
do.

------
jgrahamc
Non-linearity.

~~~
nkassis
Even then, 500 servers? That's an incredible amount.

EDIT: Then again Facebook is using 60000 servers at the moment.

~~~
ceejayoz
Facebook has a lot more than 100 times digg's registered users, though.

~~~
dasil003
Not to mention a basically non-existent cachability profile, and an order of
magnitude more functionality.

~~~
nkassis
I'm assuming you mean Facebook has more functionality and no cachebility. That
might be true but I wonder how expensive getting a Digg discussion(with
100-200 posts) page compared to say a 5 answer SO question or even just a
simple Facebook profile.

~~~
ceejayoz
As I understand it, a Facebook profile is anything but simple.

------
wanderr
I mentioned this to Spolsky already but depending how you want to count
servers (web servers or all front facing servers or all servers in your
infrastructure?), we have 10-20 servers servicing 350 million page views per
month. That is of course not counting all the ajax requests that we also
service, which don't count as page views. So we service 17-35 million page
views per server, compared to StackExchange's 12 million page views per
server.

We use PHP and Apache, so the gloating about ASP being way more efficient than
PHP seems to be unfounded.

------
benologist
Digg must be pluggr's biggest customer... <http://pluggr.info/>

------
ojbyrne
Where does the "500" come from?

~~~
sfrench
Kevin mentioned it somewhere a few weeks ago. I forget where though.

------
nozepas
bad server administrators? no optimization? no cache? static content served
from apache servers instead of, for example lighttpd?

there could be many reasons

------
fauigerzigerk
What's the cost per page view in each case?

------
robryan
Depends on the servers to, I'm sure it would take more than 5 servers that
Google run to run SO.

------
eogas
Digg is using Eee PCs as servers?

------
mhill
[Sunglasses On] Sounds like the .Net stack performs much better than the LAMP
stack.

~~~
mhill
I thought people on HN are more open to technology discussion. It seems people
can't handle their bubble being pierced when presented with facts.

It would be more interesting you have facts to refute my statement than
downvotes.

