
StackOverflow Update: 560M Pageviews a Month, 25 Servers - quicksilver03
http://highscalability.com/blog/2014/7/21/stackoverflow-update-560m-pageviews-a-month-25-servers-and-i.html
======
noelwelsh
StackOverflow shows just how powerful using a fast language can be. Compare
the TechEmpower benchmarks for, say, Java vs Rails on a very simple JSON
serialization benchmark (we can assume .Net would achieve comparable
performance to the JVM):

[http://www.techempower.com/benchmarks/](http://www.techempower.com/benchmarks/)

The Java servers get around 900K requests/s on beefy hardware while Rails
squeezes out 6K. That's a 150x difference! Any real application will be slower
than this, and on cloud hardware you can expect to be a order of magnitude
slower as well. You just don't have much headroom if you use languages like
Ruby before you have to scale out. And once you scale out you have to worry
about sys. admin. and all the problems of a distributed system.

It's a one off cost to learn an efficient language, but it pays returns
forever.

~~~
Maarten88
> StackOverflow shows just how powerful using a fast language can be. Compare
> the TechEmpower benchmarks [...] (we can assume .Net would achieve
> comparable performance to the JVM)

Except those TechEmpower benchmarks show .NET is not nearly as fast as Java. I
think StackExchange prove that the platform is NOT the most important: it's
much more important to make performance a priority in all engineering
decisions, to benchmark everything beforehand and develop new technology where
there's no good standard solution available. (think Dapper, protobuf-net,
StackExchange.Redis)

~~~
bhauer
I'd like to add one caveat with respect to our C# tests: we did not run the
tests on Windows in Round 9. With the help of contributors, we've recently
revamped our benchmark toolset, and have yet to pull the Windows
implementation of the toolset up to date. The C# data in Round 9 is
exclusively from Mono on Linux.

To more accurately judge the request routing performance of C# tests on
Windows, see Round 8 data. For example, see Round 8 limited to Java, C#, and
Ruby tests on i7 hardware [1].

Other important notes on that data:

* The http-listener test is a low-level test implementation.

* The rack-jruby test is running on Torqbox (TorqueBox 4), which is based on Undertow and has very little Ruby code [2]. This test is mostly about Undertow/Torqbox, with a bit of JRuby.

Another interesting test is Fortunes on i7 in Round 8 [3]. Fortunes involves
request routing, a database query, ORM translation of multiple rows, dynamic
addition of rows, in-memory sorting, server-side templating, XSS
countermeasures, and UTF-8 encoding. Here you will see Java frameworks at
about 45K rps, ASP.NET on Windows at about 15K rps, and Ruby at about 2.5K
rps.

[1]
[http://www.techempower.com/benchmarks/#section=data-r8&hw=i7...](http://www.techempower.com/benchmarks/#section=data-r8&hw=i7&test=json&l=39e)

[2]
[https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...](https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/rack/config.ru)

[3]
[http://www.techempower.com/benchmarks/#section=data-r8&hw=i7...](http://www.techempower.com/benchmarks/#section=data-r8&hw=i7&test=fortune&l=39e)

------
kbenson
> The cost of inefficient code can be higher than you think. Efficient code
> stretches hardware further, reduces power usage, makes code easier for
> programmers to understand.

I'm curious what the reasoning is for "Efficient code ... makes code easier
for programmers to understand". To my mind, efficient code (in this case, I
assume coding to the hardware, as they mention elsewhere), has many benefits,
but making it easier to understand is _not_ one of them. A useful comment by
the code may help, but that's not a result of efficient coding, that's a
result of good commenting practice.

~~~
shurcooL
Sometimes "efficient" and "easy to understand and be sure is correct" don't
have to be mutually exclusive; see this example [1] of Java and Go.

Note that the code is autogenerated, so it should be equally efficient. The Go
version also happens to be very simple and no different than most humans would
write by hand (without trying very hard to optimize).

[1]
[https://gist.github.com/shurcooL/9f94bbd021b4693cf584](https://gist.github.com/shurcooL/9f94bbd021b4693cf584)

~~~
nostrademons
The code samples in your gist aren't doing the same thing. The Java version
decodes the UTF-8 stored in the protobuf into Java's native UTF-16 Strings on
first access, while Go strings are native UTF-8 and so only a nil check is
necessary.

Or are you saying that languages should always use UTF-8 natively? I would
agree with you on that, but disagree that this proves your point that
"efficient" and "easy to understand and verify correctness" aren't mutually
exclusive. pb.getSomeId().charAt(1000) runs in constant time in Java (albeit
failing to give correct data if getSomeId() contains codepoints higher than
\uFFFF), but pb.GetSomeId()[1000] will give you garbage data in Go if your
field contains non-ASCII text. To get a valid codepoint in Go, you'd need to
do string([]rune(pb.GetSomeId())[1000]), which runs in linear time and omits
the check for valid UTF-8 bytes.

~~~
shurcooL
That's a good point, it seems the Java code is checking if the entire string
is valid UTF-8 while the Go version isn't. I wonder why the behavior of
generated protobufs is different between the two.

------
lnanek2
Really hate the separate sites thing. I use a half dozen of them and they are
computer related, so it is a pain in the ass always having to register and not
being allowed to comment/answer at first, etc.. They have a few improvements
now like importing your profile from other sites, but they shouldn't even have
so many separate sites in the first place, just a tag or a category or
something.

~~~
recursive
You get +100 rep bonus for linking your accounts, so you shouldn't have a
problem being allowed to comment.

~~~
rayshan
How? I can't seem to find this option in account admin panels.

~~~
recursive
It happens automatically when you use the same account on multiple SE sites.

------
tomblomfield
"One problem is not many tests. Tests aren’t needed because there’s a great
community... If users find any problems with it they report the bugs that
they’ve found."

I'm often surprised at the paucity of test-coverage in relatively large
companies.

[http://nerds.airbnb.com/testing-at-airbnb/](http://nerds.airbnb.com/testing-
at-airbnb/)

~~~
mseebach
I think we're going trough a thesis/anti-thesis cycle on tests - in the
beginning, there was militant testing, 100% coverage, testing getters and
setters etc (as well as more complex stuff, obviously). Then some people
started coming around to the idea that there are actually large swathes of
code that is simple enough that testing doesn't actually add much value
especially compared to the effort of writing them, then that probably got a
bit out of hand (to what you're referring to). Maybe the pendulum will swing
back and we'll find a good heuristic for just how much testing is the right
amount that isn't all or nothing?

~~~
yazaddaruvala
It really also does depend on the language, or rather compiler. Specifically,
tests are far more useful when static analysis doesn't exist.

------
habosa
> With their SQL Servers loaded with 384 GB of RAM and 2TB of SSD, AWS would
> cost a fortune.

I have next to zero experience with server administration, but 384GB seems
like a lot to me. Is that common for production servers for popular web
services? Do you need a customized OS to address that much memory? Seems like
you'd really need to beef up the cache hierarchy make 0.38TB of RAM fast.

~~~
bradyd
I don't believe that much RAM is uncommon for large scale database servers.
384GB RAM is only about $5000 from Dell. They also have a new server model
coming out that supports up to 6TB of RAM [0].

[0]
[http://www.dell.com/us/business/p/poweredge-r920/pd](http://www.dell.com/us/business/p/poweredge-r920/pd)

~~~
superuser2
How common is it to buy _one_ extremely powerful database server? Wouldn't you
need the high availability properties of something like a Galera cluster? Or
are most people accepting the single point of failure at huge database server?

(I honestly don't know, it's just surprising.)

~~~
emilv
While you do want HA you also want to keep things simple. You get great HA
with just one backup machine. The times you lose two machines are so few that
you might crash the site with software failures far more often anyway.

Databases have inherent locking problems that takes more resources to resolve
with more machines, more so in some database systems than others. When scaling
you often hit a point where you get down to macro logistics, so imagine a car
highway: You can get more throughput by adding more lanes. But what that all
cars are going to merge to one lane at some point in the middle of the trip
(because of a tunnel, bridge or something: it's impossible to have more than
one lane at that point)? Now you won't get any throughput benefits of the
multiple lanes after all! Just more latency because you get queues up until
the single lane and because of the queues all cars are going slow and need to
accelerate on the single lane, so the average speed is low too. You are also
having too much resources after the single lane because you can never fill all
those lanes. It might be better to have one beefed-up single-lane road all the
way that people can go fast on. Basically: Remove locks and you get better
overall performance.

Yes, this is on the expense of HA. Yes, the costs of scaling up grows
asymptotically faster than the costs of scaling out. So this is definitely a
trade-off in some sense.

------
chiph
I've recently started using their micro-ORM Dapper, and I like it a lot. I get
the performance of hand-coded SQL, but without the tedious mapping from
SqlDataReader to my entity.

~~~
Aaronontheweb
Dapper is great - I promptly dumped NHibernate and Entity Framework for it.
It's the only ORM I've found that makes it trivially easy (as it should be) to
use stored procedures and views in queries.

~~~
atonse
+1 for Dapper - I'm a big fan. It is very fast and we use it to serve our live
(mostly read-only) traffic, while using something like Entity Framework for
the backend Administrative UI, which has a lot of inserts and updates.

~~~
chiph
There's a couple of user-contrib projects that do similar things for inserts &
updates. Dapper Extensions is the one I'm using.

[https://github.com/tmsmith/Dapper-
Extensions](https://github.com/tmsmith/Dapper-Extensions)

~~~
Aaronontheweb
I wrote a Dapper extension for working with SQL Server's geospatial queries
and types a couple of years ago - have they added anything like that yet?
Otherwise I'd be happy to add it.

~~~
Nick-Craver
Marc Gravell added in the ability to put in pretty much any custom type
without adding any dependency weight about a month ago. You can see the commit
here: [https://github.com/StackExchange/dapper-dot-
net/commit/e26ee...](https://github.com/StackExchange/dapper-dot-
net/commit/e26ee0abe5bdff561ff59fded87cabb9f5d983a1)

Look towards the end at the tests for example of how to hook up a custom type
(it's pretty simple).

~~~
Aaronontheweb
Cool - the one I wrote was years ago. Looks like it's not necessary any more.

------
MichaelGG
So this is what, a 2000 request/sec peak? Over 11 servers, that's like 200
requests/sec peak per frontend?

The problem with scale-up is if you actually have to get a few times larger,
it becomes super expensive. But fortunately hardware is increasing so much
that you can probably just get away with it now. There's probably a crossover
point we're rapidly approaching where even global-scale sites can just do all
their transactions in RAM and keep it there (replicated). I know that's what
VoltDB is counting on.

~~~
Nick-Craver
Peak is more like 2600-3000 requests/sec on most weekdays. Remember that
programming, being a profession, means our weekdays are significantly busier
than weekends (as you can see here:
[https://www.quantcast.com/p-c1rF4kxgLUzNc](https://www.quantcast.com/p-c1rF4kxgLUzNc)).

It's almost all over 9 servers, because 10 and 11 are only for
meta.stackexchange.com, meta.stackoverflow.com, and the development tier.
Those servers also run around 10-20% CPU which means we have quite a bit of
headroom available. Here's a screenshot of our dashboard taken just now:
[http://i.stack.imgur.com/HPdtl.png](http://i.stack.imgur.com/HPdtl.png) We
can currently handle the full load of all sites (including Stack Overflow) on
2 servers...not 1 though, that ends badly with thread exhaustion.

We could add web servers pretty cheaply; these servers are approaching 4 years
old and weren't even close to top-of-the-line back them. Even current
generation replacements would be several times more powerful, if we needed to
go that route.

Honestly the only scale-up problem we have is SSD space on the SQL boxes due
to the growth pattern of reliability vs. space in the non-consumer space. By
that I mean drives that have capacitors for power loss and such. I actually
just wrote a lengthy email about what we're planning for storage on one of our
SQL clusters...perhaps I should echo it verbatim as a blog post? I'm not sure
how many people care about that sort of stuff outside our teams.

Nick Craver - Stack Exchange Sysadmin & Developer

~~~
voltagex_
>I actually just wrote a lengthy email about what we're planning for storage
on one of our SQL clusters...perhaps I should echo it verbatim as a blog post?
I'm not sure how many people care about that sort of stuff outside our teams.

I'm sure some DBAs and devs here would find it interesting.

~~~
bagels
This is definitely the case. Every writeup from those in the trenches I read
and share with coworkers.

------
lemcoe9
Makes me wonder why smaller website teams need dozens of engineers to keep
their infinitesimally smaller app running.

~~~
daigoba66
I personally think it's due to overly complex applications and systems with
many moving (and breaking) parts.

Also StackOverflow often practices "scale-up" instead of "scale-out".

~~~
theandrewbailey
More specifically, I think it's due to long feature lists and multiple problem
domains. I'm also thinking about in-house line-of-business software.

------
sz4kerto
Extremely interesting read, especially because it goes against many
fashionable engineering practices. They see (correctly) that abstractions, DI
have their own compromises. Also, they're not religious (about technology),
that's also quite rare.

------
ternaryoperator
"SO goes to great lengths to reduce garbage collection costs, skipping
practices like TDD, avoiding layers of abstraction, and using static methods."

I don't understand this at all. What does TDD have to do with reducing garbage
collection?

~~~
numbsafari
My guess is that they feel that the layers of indirection and abstraction
often needed to make TDD work result in an object creation pattern that
results in heavy GC load during normal operation. The references to "using
static methods" is probably related to this.

ps. That's my guess, but I'd encourage you to post your question to the meta
site for SO.

~~~
bjourne
IME that always happens when you try to perform tdd in combination with
Javaesque encapsulation. The good solution to the problem is to not be so
afraid of classes seeing each others internals. The bad solution is to add,
factory patterns, dependency injectors and other useless layers just to try
and keep your design both well encapsulated and testable.

~~~
edwinnathaniel
You do realize that most efficient DI framework only inject the dependency
once.

Separating Controller, Repository, and Services are good practices as well and
let's be honest, we're looking at 3 methods layer at most.

Here's what happened in Java:

1\. When you deploy your WAR, the DI will inject necessary component _once_
(and these components only instantiated _once_ for the whole web-app so there
you go, Singleton without hardcoding).

2\. A request comes in, gets processed by the pumped-up Servlet (already DI-
ed). If the servlet has not been instantiated, it will be instantiated _once_
and the instance of the Servlet is kept in memory.

3\. Another request comes in, gets processed by the same pumped-up Servlet
that is already instantiated and has already been injected with the component
(no more injection, no more instantiation, no more Object to create...)

So I've got to ask this question: what GC problem we're trying to solve here?

Some of the static methods are understandable but if Component A requires
Component B and both of them have already been instantiated _once_ and have
been properly wired up together, we have 2 Objects for the lifetime of the
application.

I'd pay for a wee bit extra hardware for the cost of maintainable code ;)

Discipline (and knowledge of object graph) definitely help to reach to that
point.

~~~
bjourne
You probably meant to respond Marco's comment? And afaik Stackoverflow is
written in ASP.NET, not Java.

~~~
edwinnathaniel
C# or Java, the whole request pipeline processing should be more or less the
same. Unless one platform does things less efficient than the other.

Rails and Django do things differently as to my knowledge they do it by
spawning processes instead of threads. There are app-server for Rails or
Django that may use threads for efficiency/performance reason but I am under
the impression the whole LAMP stack is still 1 request 1 process (even though
they re-use those processes from a pool of already allocated processes).

------
rakoo
What I'd like to see is how much cache hit they add on each level of cache. I
remember some presentation of the facebook images stack where less than 10%
(if I remember well) of the requests actually hit the disks; it would be
interesting to see the patterns for the whole SE galaxy.

~~~
skeletonjelly
As someone mentioned above, SQL Server hits memory first, and as they have
384GB of memory, most of the requests would sit there. That's just on the DB
server. In the linked article it all authenticated requests hit the DB, with
anon users getting a cached copy.

------
jonhmchan
Relatively new stack dev here. I came in on the other side of the fence of a
lot of these technologies (bread and butter is Python in Flask with Mongo on
Heroku on a mac) but since I started here, I've been constantly and pleasantly
surprised by how performant everything here has been despite my biases. It's
mighty fun.

------
AlisdairO
Stack Overflow is the example I'm forever using when arguing against premature
scale-out. For non-trivial applications scale-out has substantial complexity
costs attached to it, and the overwhelming majority of applications will
_never truly need it_.

It's frustrating to see time wasted obsessing over trying to maintain eventual
consistency (or chasing bugs when you failed to do so) on systems that could
quite happily run on a single, not that beefy machine.

~~~
samstave
> __ _For non-trivial applications scale-out has substantial complexity costs
> attached to it_ __

Forgie me if I am misunderstanding you - but non-trivial applications can
actually require scale out.

From my perspective, StackExchange is not techinically that complex. They have
built a very efficient, cost-effective and performant stack for their singular
application and that works very well for them, but the complexity of their
forum is not an extraordinarily complex problem.

~~~
AlisdairO
By nontrivial I mean 'application which, in your document database, requires
multi-document updates to perform some individual logical operations'. This is
a rather low bar :-).

The fact is, the vast majority of projects that programmers are working on are
less computationally complex than stack overflow. That's not to say that forum
software is all that complex, more that most problems are pretty simple. Of
course there are real reasons to use scale out - I simply advocate thinking
hard about whether your problem will ever truly need it before taking the
substantial complexity hit of coding for it.

~~~
AlisdairO
Er, I don't seem to be able to edit, but I guess I should specify that I'm
really talking about scale-out of writes here. Read scalability is an
obviously easier problem..

------
nmjohn
> AWS would cost a fortune.

Isn't that comparing apples to oranges? Taking any application built for large
servers and putting the same application onto a cloud-based architecture will
be more expensive.

Cloud-based architecture requires ground up differences in how the application
is built. Now whether or not it is better to use a cloud-based approach or
traditional bare metal is highly subjective and isn't my point.

~~~
incision
_> 'Isn't that comparing apples to oranges?'_

On another site, in another context, it probably would be, but here it's
really presented as a contrast of scale-up versus scale-out - something the
regular audience of highscalability will certainly grok.

In context...

 _' Stack Overflow still uses a scale-up strategy. No clouds in site. With
their SQL Servers loaded with 384 GB of RAM and 2TB of SSD, AWS would cost a
fortune. The cloud would also slow them down, making it harder to optimize and
troubleshoot system issues. Plus, SO doesn’t need a horizontal scaling
strategy. Large peak loads, where scaling out makes sense, hasn’t been a
problem because they’ve been quite successful at sizing their system
correctly.'_

It's an acknowledgement of their relatively unique strategy and the short list
of caveats that make it possible.

------
fideloper
I love that all the traffic is running through a single HAProxy box - a
software load balancer (vs a hardware LB like F5).

And they've moved SSL termination to it.

That's a great quality product, and easy to setup.

Edit: I work I'm software that supports MySQL,PG and SqlServer. SqlServerr
seems to be the most stable and consistent in performance - they're hard to
kill! One of my few liked MS products :D

------
ck2
This is the fascinating part to me, their SSD have not failed:

 _Failures have not been a problem, even with hundreds of intel 2.5 " SSDs in
production, a single one hasn’t failed yet. One or more spare parts are kept
for each model, but multiple drive failure hasn't been a concern._

~~~
Nick-Craver
Yep, still true. We lost one Intel 910 drive (PCIe SSD), and that was very
abnormal - died so soon it was almost DOA. We hooked up directly with Intel
for them to The replacement is still going strong as is another 910 we have.

All of those 2.5" Intels though, still trucking along! We're looking at some
P3700s PCIe NVMe drives now, blog post coming about that.

------
programminggeek
One smart thing they are doing is putting different sites on different
databases. It effectively as a kind of partitioning to allow for horizontal
scaling if needed.

If nothing else, it keeps less data in each table, so queries should be faster
due to smaller datasets.

Sounds like they know what they are doing.

~~~
beachstartup
beware that you can over-optimized this.

for example, do not put every single customer of an saas solution into a
separate database.

~~~
csharrison
I'm curious, why not? If your saas was small enough and had few enough
customers with large data needs, separating them each out into a separate
database seems like a viable solution.

~~~
beachstartup
yeah, if you have an experienced team, you can easily refactor your one-
customer-per-db model into something that will handle 1,000 new customers
signing up every month. given a good enough team, you can even do this while
doing other sort-of important things like customer support, bug fixes, and new
development.

however, inexperienced developers and teams are not good enough, or fast
enough, to do that, and end up painting themselves into a corner when they
have thousands of live customers to support and need to change their entire
application architecture + database schema when their backup, replication, and
housekeeping tasks choke on 1,000+ databases.

and oftentimes, one-customer-per-db also means one-instance-of-application-
per-customer, another anti pattern to avoid.

we see this kind of stuff ALL the time. it happens a lot - people make
terrible decisions and then are stuck with them 5 years down the line and are
looking at a monumental cost to redo. not everyone is experience enough to
work their way out of an awful situation like that.

------
ksec
It is fascinating to see how well its server it holding out. With Plenty of
headroom to spare!.

With Higher Capacity DDR4, and Haswell or even Broadwell Xeon, and PCIe SSD
getting cheaper and faster. It is not hard to imagine they could handle 3 - 4
times the load when they upgrade their server two years down the line.

But i would love to see Joel's take on it though. Since he is a Ruby guy now,
and I dont think you could even achieve 20% of SO performance with RoR.

~~~
farresito
It would definitely be several orders of magnitude slower. That's the price
you pay when you are using a language like ruby: it's very productive and fast
to iterate in, but way less performant. At the end of the day, what you use
depends on your needs.

~~~
ksec
Arh... I actually meant Jeff, not Joel.

Hardware is cheap, Programmers are expensive! But i am sure there is a line
where this crosses over though.

~~~
judk
The point of this article is that it costs more programmers to maintain a
cloud than to maintain a single master database

------
tjdetwiler
Can anyone provide context on "Heavy usage of static classes and methods, for
simplicity and better performance."

Is performance _really_ an issue here?

~~~
MichaelGG
I find the opposition to static classes odd. Unless you need the data
encapsulation of an object, why force people to create an object to call
functions? I find too much C# code loves creating objects for no real purpose
other than that's how it works.

For SO, if we guess they max out around 2000 req/sec, then if a bunch of
objects are being allocated for each request, there could be added GC
pressure. From my own experience developing much higher-scale managed code,
allocating objects is a real pain and people will take a lot of effort to
avoid it.

~~~
Griever
I think the real issue (although it doesn't seem to be an issue for SO) is
testability. AFAIK, static classes cannot be mocked, and you are pretty much
boxing yourself into only doing integration tests for everything. Not
necessarily a bad way to go, just lots more setup per test.

~~~
MrBuddyCasino
Yep, that used to be one of the main reasons to have DI in Java. But now that
everything (even static methods) can be mocked, that reason is no longer
valid. Is this not possible in .NET?

------
guiomie
Can someone help me understand what they mean by: "Garbage collection driven
programming. SO goes to great lengths to reduce garbage collection costs,
skipping practices like TDD, avoiding layers of abstraction, and using static
methods. While extreme, the result is highly performing code."

~~~
louthy
I assume it means using C# more as a functional language than an OO language,
so that most garbage collection is of short-lifetime objects and therefore
cheaper to collect (generation 0).

~~~
virmundi
I take that statement to mean C# as C. Functional languages are terrible for
garbage collection. Static methods are great since they are static, loaded
once and probably in-lineable by the compiler/JIT.

~~~
thomasz
The usual wisdom in clr-land is that the generational garbage collector can
cope just fine with a lot of small, short lived objects. It absolutely doesn't
like big, long lived objects that reach G1, G2 or the large object heap and
must be traversed for a full gc.

This means that normally, you want to avoid data structures that introduce a
lot of objects trees, linked lists and stuff like that. In some circumstances
you might lose some speed, but win big time when the full GC comes.
Additionally, It may be worthwhile to consider avoiding heap allocations by
organizing your data as value types instead of ordinary objects.

~~~
Nick-Craver
It depends on your definitions of "a lot" is. It can scale pretty high by
default, but when you're doing hundreds of millions of objects in a short
window, you can actually measure pauses in the app domain while GC runs. These
have a pretty decent impact on requests that hit up against that block.

The vast majority of programmers need not worry about this, but if you're
putting millions of objects through a single app domain in a short time, then
it's a concern you should be aware of. This applies to pretty much any managed
language.

~~~
thomasz
Are you sure that this is really because of Gen0 collections all by
themselves, and not because the short cadence of collections leads to a lot of
premature Gen1 promotions? I could imagine something like

    
    
       var a = InitGiantArray<Foo>();
       var b = InitGiantArray<Bar>();
    

could leave you with a lot of foo garbage in Gen1 or even Gen2, right were it
really hurts. On the other hand I'd be surprised if something akin to

    
    
       do {
           Tuple.Create(new object(), new object());
       } while (true);
    

but with something useful thrown in would suffer much under GC pauses. Looks
like I have to test it :I)

------
ilaksh
So funny because the most important application for programmers uses the
complete opposite of almost every web development trend out there.

Also the SSDs are funny to me because I recently had someone who was supposed
to be an expert say that SSDs were a fatal mistake with a high likelihood of
massive failure.

------
epsteinbargraph
Soooo... a average of 200 pageviews a second - I suppose that doesn't sound as
cool.

Or, to put it another way; 1 server per 8 pageviews per second

Does not seem that efficient when you put it like that. Even a resource hog
like a wordpress install can eclipse that level.

*made correction- still not impressed

~~~
jre
I think you're missing a 0 somewhere : 560000000 / (60 * 60 * 24 * 31) = 209

------
yblu
> Redis has 2 slaves, SQL has 2 replicas, tag engine has 3 nodes, elastic has
> 3 nodes - any other service has high availability as well (and exists in
> both data centers)

Does this mean the 2 Redis slaves, 2 SQL replicates etc. are in a secondary
data center?

~~~
Nick-Craver
We have 2 SQL Clusters. In the primary data center (New York) there is usually
1 master and 1 read-only replica in each cluster. There's also 1 read-only
replica (async) in the DR data center (Oregon). If we're running in Oregon
then the primary is there and both of the New York replicas are read-only and
async.

For redis both data centers have 2 servers, the chain would look like this
typically:

    
    
      ny-redis01
       - ny-redis02
         - or-redis01
           - or-redis02
    

Not everything is slaved between data centers (very temporary cache data we
don't need to eat bandwidth by syncing, etc.) but the big items are so we
don't come up without any shared cache in case of a hard down in the active
data center. We can start without a cache if we need to, but it isn't very
graceful.

Here's a screenshot of our redis dashboard just a moment ago (counts are low
because we upgraded to 2.8.12 last week, which needed a restart):
[http://i.stack.imgur.com/IgaBU.png](http://i.stack.imgur.com/IgaBU.png)

Nick Craver - Stack Exchange Sysadmin & Developer

~~~
yblu
Great info, Nick. Thanks for sharing.

------
doczoidberg
I would love to see comparing costs of their solution against cloud solutions
as Azure.

They say that hardware is cheaper than programmers. Wouldn't that speak for
the cloud solution (software based scaling instead of admins)?

~~~
Nick-Craver
Hardware is cheaper than developers and efficient code. But you're only as
fast as your slowest bottleneck and all the current cloud solutions have
fundamental performance or capacity limits we run into.

Could you do it well if building for the cloud from day one? Yeah sure, I
think so. Could you consistency render all your pages performing several up to
date queries and cache fetches across that cloud network you don't control and
getting sub 50ms render times? Well that's another matter, and unless you're
talking about _substantially_ higher cost (at least 3-4x), the answer is no -
it's still more economical for us to host our own servers.

And yes I realize that's a fun claim to make without detail. As soon as I get
a free day (I think there's one scheduled in 2017 somewhere), I'll get with
George to do another estimate of what AWS and Azure would cost and what
bottlenecks they each have.

Nick Craver - Stack Exchange Sysadmin & Developer

------
zobzu
So doing it right works. Nice :) I like these articles as it gives good
examples/help supporting decisions when you dont want the crappy hack, but the
do it right instead.

------
Perceptes
I'm grateful to them them for being so transparent about how their systems
work. It's rare to get this level of detail on how a large, successful site
operates.

------
UVB-76
Anyone else not quite expecting StackExchange to be using a Microsoft stack?

~~~
stonemetal
Joel worked at MS on office, Excel if I recall correctly. Jeff's blog is named
for a feature from a book published by MS. I would be more surprised if it was
MS free.

~~~
Alupis
You mean to tell me Joel is partially responsible for the abomination known as
Excel 2007 where I can't pull-drag the damn window from screen to screen (but
all of the other Office 2007 products do!)

~~~
tim333
He seems to have left Microsoft in 1995, so no, not really.

------
bikamonki
I never ever ever never use SO's search feature: I land on the result through
Google. So I guess they should give Google some credit for the claimed
performance?

~~~
cruise02
Google isn't serving Stack Exchange's pages to you. They published stats on
pages served by their own servers. Google has nothing to do with it.

~~~
bikamonki
You miss the point: I never use the search feature of the site, I go to the
answer/subject directly from Google's results. I do not think Google is
running my keywords against SO's server in 'real-time' and returning the
answers as results, Google returns indexed/cached copies. I am sure that many
users behave like I do, no? They might as well keep a static copy of all pages
in RAM and only update when new comments/answers are posted.

------
phkahler
I haven't been able to use SO since they shut off myopenid last year. I can't
log in, so can't change my credentials ;-( Any ideas HN?

~~~
Shog9
You can email us (as George said) or you can just punch your email address
into the account recovery tool and get an automated email telling you what to
click within the space of a few minutes:
[http://stackoverflow.com/users/account-
recovery](http://stackoverflow.com/users/account-recovery)

We... probably need to make that link a bit more obvious.

~~~
phkahler
Thanks, that was easy. So now I just have a password instead of linking to
another account @facebook or google+ I guess that works ;-)

------
Nux
"Stack Overflow still uses Microsoft products. Microsoft infrastructure works
and is cheap enough, so there’s no compelling reason to change." <\- there's
your problem :-)

~~~
serge2k
54th ranked site in the world with a small number of servers and apparently
very maintainable and stable systems.

I would love that problem.

------
josephscott
"with just 25 servers"

This really needs to stop. Go to stackexchange.com and you'll find that more
than half of the HTTP requests are to cdn.sstatic.net.

Looking up the IP addresses for cdn.sstatic.net returned five entries for me,
all owned by CloudFlare. None of the CloudFlare servers that they are using
seem to be in that 25 count.

Sure, these are all for static assets, that isn't the point. There are way
more than 25 servers being used to serve the StackOverflow sites.

~~~
tiernano
Stack Exchange dont own their CDN. If you really start couting external
servers, you would be adding DNS, possible client side proxy servers, client
side network infrastructure, etc... these are things that OTHER people do, not
SO.

~~~
duderific
I would say though, CDN doesn't fall in the same category as DNS. Serving
static assets off other networks does take a huge load off the SE
infrastructure and should probably be mentioned in the article.

~~~
mschuster91
Only in terms of bandwidth, though. The CPU load by processing is the
interesting figure, not side noise by images/js/css.

