Hacker News new | past | comments | ask | show | jobs | submit login
Stack Overflow: The Architecture – 2016 Edition (nickcraver.com)
328 points by Nick-Craver on Feb 17, 2016 | hide | past | favorite | 29 comments

"During peak, we have about 500,000 concurrent websocket connections open. That’s a lot of browsers. Fun fact: some of those browsers have been open for over 18 months. We’re not sure why. Someone should go check if those developers are still alive."

It's impressive that SO held the connection open for 18 months! That's some seriously good uptime for a process that manages a TCP connection.

Strange, no wonder why some Browsers stats are stuck with old versions.

i am betting those persistent connections are bots, but not sure what kind of software they'd use? PhantomJs crashes every 50 or so requests.

They have 11 IIS servers but interestingly he says this:

What do we need to run Stack Overflow? That hasn’t changed much since 2013, but due to the optimizations and new hardware mentioned above, we’re down to needing only 1 web server. We have unintentionally tested this, successfully, a few times. To be clear: I’m saying it works. I’m not saying it’s a good idea. It’s fun though, every time.

Interesting to consider how some companies doing much less work on their machines than SO need clusters of hundreds of servers, meanwhile they can serve the 57th site by global traffic (according to Alexa) from one physical machine .

Sure there is other stuff backing that one, but the next time you hear someone talk about big clusters and hundreds or thousands of nodes, just take a moment to appreciate how much can be done with one rack of gear these days.

Granted it's a machine with 768 GB RAM.

It's only got 64GB if I read data center upgrade post from 2015 correctly.

"The new web tier has dual Intel 2687W v3 processors and 64GB of DDR4 memory. We re-used the same dual Intel 320 300GB SSDs for the OS RAID 1."

To be fair, they cache the hell out of their content.

We don't really cache that much, for example: every post, comment, user, etc. on the page is pulled and rendered live. We output cache certain pages for 60 seconds for anonymous users, but it has almost no performance impact (I think it's a 4% hit rate?) Some things are cached, but each page render is very dynamic - more so than most I think? I don't have a great source of comparison for similar sites though.

Nick, first of all great write-up, it's really nice when people take time from their schedule to write such detailed informations about how their internal systems work!

I could be wrong but from the article and the numbers you posted it seems that you do cache 89% of your db queries results so maybe this is what jldugger referred to : 504,816,843 (+170,244,740) SQL Queries (from HTTP requests alone) 5,831,683,114 (+5,418,818,063) Redis hits

Ah I see the confusion now. To clarify: we use redis for a great many things, not just serving HTTP requests. For example, we use redis lists and sets to continually recalculate mobile feeds for users - that's roughly 4 billion of the hits every day.

At one point, I got the impression, perhaps mistaken, that something akin to varnish was set to cache pages for 5 minutes.

I interpret this as: Page render latency probably degrades, but the site doesn't go down. Slower loads certainly have a monetary cost, making 11 machines worthwhile, but in a pinch at least they won't go fully offline if 10 die.

That is quite the pinch

Wish more sites/companies/employees published this sort of information - really interesting stuff.

Are there any other examples of big (largely) Windows stacks? Stack Exchange is the only one I've seen discussed.

Todd Hoff's High Scalability [1] is a very nice blog to follow if you like this kind of information about real-world stacks. He's been writing it for years. (Caveat: It's a while since I followed it closely.)

[1] http://highscalability.com/

Stuff the Internet Says on Scalability is an excellent weekly article he does with plenty of great resources. Lots of stuff that's just reposted from HN, but also many pieces on real-world architectures that I never see on HN or /r/programming.

Not at this level of detail, but in the event you don't know about the site, try http://stackshare.io/stacks . You can see the app stack for the included companies at the levels: Application & Data, Utilities, DevOps, Business Tools.

Most of the places that have highly scaled Windows infrastructure and highly fault tolerant architecture tend to be the kinds of places with very paranoid middle managers..

Source: I work at one.

Match.com and PoF come to mind -- Match on the very large infrastructure side, PoF more along the lines of "punching above its weight."

There tons of big .net stacks, you just don't here about them in the startup world.

In the corporate world however....

> Fun fact: some of those browsers have been open for over 18 months. We’re not sure why. Someone should go check if those developers are still alive.

How is that actually possible to keep a single TCP/IP connection open over 18 months?!

Someone probably opened up a browser on a server to look something up and then left it running. With a static IP and battery backup, only a reboot is likely to take it down. The connection should be able to survive short outages due to a firewall restart or momentary problem upstream, especially if there's no NAT in between.

I'm using SSH's ControlMaster in persistent mode on my desktop at work. The connections stay open for as long as the desktop is up, which can be months at a time.

Web crawler that investigates websockets.

So it's 125 million SQL queries or 6 million page loads per machine per day.

It also seems like average query is 1.2ms - am I missing something?

Why the Windows stack though?

Jeff Atwood and Joel Spolsky have their roots in the Windows community, so they used what they were familiar with.

Because it clearly works for them?

Why not? It appears to be working well.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact