"During peak, we have about 500,000 concurrent websocket connections open. That’s a lot of browsers. Fun fact: some of those browsers have been open for over 18 months. We’re not sure why. Someone should go check if those developers are still alive."
It's impressive that SO held the connection open for 18 months! That's some seriously good uptime for a process that manages a TCP connection.
They have 11 IIS servers but interestingly he says this:
What do we need to run Stack Overflow? That hasn’t changed much since 2013, but due to the optimizations and new hardware mentioned above, we’re down to needing only 1 web server. We have unintentionally tested this, successfully, a few times. To be clear: I’m saying it works. I’m not saying it’s a good idea. It’s fun though, every time.
Interesting to consider how some companies doing much less work on their machines than SO need clusters of hundreds of servers, meanwhile they can serve the 57th site by global traffic (according to Alexa) from one physical machine .
Sure there is other stuff backing that one, but the next time you hear someone talk about big clusters and hundreds or thousands of nodes, just take a moment to appreciate how much can be done with one rack of gear these days.
We don't really cache that much, for example: every post, comment, user, etc. on the page is pulled and rendered live. We output cache certain pages for 60 seconds for anonymous users, but it has almost no performance impact (I think it's a 4% hit rate?)
Some things are cached, but each page render is very dynamic - more so than most I think? I don't have a great source of comparison for similar sites though.
Nick, first of all great write-up, it's really nice when people take time from their schedule to write such detailed informations about how their internal systems work!
I could be wrong but from the article and the numbers you posted it seems that you do cache 89% of your db queries results so maybe this is what jldugger referred to :
504,816,843 (+170,244,740) SQL Queries (from HTTP requests alone)
5,831,683,114 (+5,418,818,063) Redis hits
Ah I see the confusion now. To clarify: we use redis for a great many things, not just serving HTTP requests. For example, we use redis lists and sets to continually recalculate mobile feeds for users - that's roughly 4 billion of the hits every day.
I interpret this as: Page render latency probably degrades, but the site doesn't go down. Slower loads certainly have a monetary cost, making 11 machines worthwhile, but in a pinch at least they won't go fully offline if 10 die.
Todd Hoff's High Scalability [1] is a very nice blog to follow if you like this kind of information about real-world stacks. He's been writing it for years. (Caveat: It's a while since I followed it closely.)
Stuff the Internet Says on Scalability is an excellent weekly article he does with plenty of great resources. Lots of stuff that's just reposted from HN, but also many pieces on real-world architectures that I never see on HN or /r/programming.
Not at this level of detail, but in the event you don't know about the site, try http://stackshare.io/stacks . You can see the app stack for the included companies at the levels: Application & Data, Utilities, DevOps, Business Tools.
Most of the places that have highly scaled Windows infrastructure and highly fault tolerant architecture tend to be the kinds of places with very paranoid middle managers..
Someone probably opened up a browser on a server to look something up and then left it running. With a static IP and battery backup, only a reboot is likely to take it down. The connection should be able to survive short outages due to a firewall restart or momentary problem upstream, especially if there's no NAT in between.
I'm using SSH's ControlMaster in persistent mode on my desktop at work. The connections stay open for as long as the desktop is up, which can be months at a time.
It's impressive that SO held the connection open for 18 months! That's some seriously good uptime for a process that manages a TCP connection.