Hacker News new | comments | ask | show | jobs | submit login
The architecture of Stack Overflow [video] (dev-metal.com)
98 points by schmylan on Jan 13, 2014 | hide | past | web | favorite | 48 comments

Some points that I find interesting:

[1] StackOverflow has VERY FEW tests. He says that StackOverflow doesn't use many unit tests because of their active community and heavy usage of static code.

[2] Most StackOverflow employees work remotely. This is very different than a lot of companies that are now trying to force employees back into an office.

[3] Heavy usage of Static classes and methods. His main argument is that this gives them better performance than a more standard OO approach.

[4] Caching even simple pages in order to avoid performance issues caused by garbage collection.

[5] They don't worry about making a "Square Wheel". If their developers can write something more lightweight than an already developed alternative, they do! This is very different from the normal mindset of " don't reinvent the wheel ".

[6] Always using multiple monitors. I love this. I feel like my productivity is nearly halved when I am working on one tiny screen.

Overall, I was surprised at how few of the "norms" that they follow. Either way, seems like it could be a pretty cool place to work.

It's not that no one follows the norms or tests. On the Careers team we do much more automated testing because there's money and literally people's jobs at stake. We have unit tests, integration tests and UI tests that all run on every push. All the tests must succeed before a production build run is even possible.

[7]. Millions of page views and just 25 servers for whole infrastructure, which includes everything including load balancing, cache, dbs' etc. etc.

Seems very very optimized and cost effective. Its brilliant.

I see more and more static methods and classes last 2 years maybe. It's probably more about the stateless design and less side effects, but it definitely helps garbage collection if you avoid classes at session scope or smaller. In OOP there is another pattern that helps - object pools, but it's a lot of work to get it to work correctly and it's not as efficient.

If most of your classes are small, I don't see why people have to resort to static methods.

If your methods are static, there are tendency/lust to use static member variables (hence stateful) which will cause side effects.

Don't forget the following points too:

1) You still have pooled objects somewhere (stateless business logic classes like XYZServices, repository classes that may be backed by pooled DB connections and Transaction Managers) provided/managed by your Application Server or by 3rd-party framework (Spring does this).

2) Your Application Server tend to have beefy hardware, good enough not to care of GC hiccups.

There are other reasons to use static methods but I don't think they're strong enough in this case.

Small classes are actually worse GC-wise. Because they will fill up the GC graph with many small nodes as opposed to fewer large nodes, which are released in bulk with little fragmentation. Small and large nodes have the same GC overhead essentially. In general you want your objects to be large. When they are small, once the GC realizes what's going on (usually at some high threshold 90% or so), it will have to run a some O(n^x) graph reduction algorithm or defragmenataion. Special tuning is required for such cases. Beefy hardware doesn't help in many cases due to locks. There are very few production-ready lock-less GCs.

Object lifetime is a much more important factor than class size for most server request / response style processing.

Typically there are three lifetimes for objects in server processes. Those that are allocated around startup and are never deallocated; those that are allocated per-request and become garbage once the response goes out; and lifetimes that span multiple requests, like objects in caches.

The first are normally ultra-cheap to "collect": with a generational GC, you simply don't scan them at all, because they haven't changed.

The second group, per-request, are also fairly cheap to collect. Every so often, you GC the youngest generation, and you only need to keep track of references in registers and on the stack. Ideally many requests will have occurred between collections, and the only objects that get kept alive are objects that are in-flight for the current request. And this is why you need at least three generations; you really don't want to have to scan the oldest generation to collect these ephemeral objects after they've built up over a number of youngest generation collections.

It's the third group that kills you. You can save on the cost of scanning the whole heap, using write barriers to track new roots buried in the oldest generation; but that adds accounting costs, and eventually overtakes the cost of a whole heap GC. These guys can also cause the fragmentation you're worried about - they need to be compacted down, copied possibly multiple times. On the CLR, last time I checked, you need a full gen2 GC in order to get rid of them, as they've likely survived a gen1 collection.

With these guys, it's worthwhile doing the big object thing. In fact, it may be worthwhile not having any GC heap storage for them at all, and refer to them using different techniques, like ephemeral keys that look up in Redis, or native pointers stored in statically allocated arrays.

In app servers I've designed, I've never seen GC CPU usage over 5% or so, even with heavy usage of tiny short-lived objects. But you need to care about lifetime.

We're talking in the context of stateless Request <-> Response of the Web-Application nature here.

When a Request comes in, the App-Server will allocate (or use from the pool) a thread to serve that Request (in .NET/JVM world, Ruby/Python uses Processes unless you use different App Server).

If you create small objects within the scope of that Request (which usually lives inside a method) and that objects are contained and don't hold references to any long-lived objects, they will be GC-ed quickly (and potentially way quicker) once the method is finished.

Thread is GC-ed as well once it's finished (unless you wish to release them back to the 'unused' pool).

My feeling is that their use of static methods have nothing to do at all with GC.

> StackOverflow employees work from home.

Many do, but they have a fairly large office in NYC and a smaller one in London.

The Stack Overflow Q&A dev team has 2 people in New York, out of a team of 10 team. The Careers dev team is more New York heavy, 3 remote and 5 in New York. The sysadmin team is also quite remote, though I don't know the breakdown offhand.

I believe at this point most new technical hires are remote.

Our offices are mostly sales, Denver and London exclusively so.

I saw that Jason went remote recently. Any particular reason so many devs are going remote? Is it people making individual decisions or the company providing new incentives to do so? My impression when you were at 55 was that most devs worked at the office (I've been at Fog Creek since a little before you guys moved. Hi!).

People making individual decisions. All else equal, we'd slightly prefer to have people in NYC, because we think the in-person time is a plus for the casual interaction that happens in between "getting things done". But we've set our selves up to make real work and official team collaboration work almost entirely online. We've learned that the in-person benefit is more than outweighed by how much you get from being able to hire the best talent that loves the product anywhere, not just the ones willing to live in the city you happen to be in.

The most common reason for someone going remote (that I'm aware of) is starting a family. New York's great, but spacious it is not.

I can think of 3 devs who have gone remote, and 2 devs (including myself) who have moved to NYC since I've been here. Most people stay wherever they were hired. The only location-specific policy I'm aware of is a cost-of-living adjustment in NYC (though that may also apply to London/SF/etc., I don't honestly know).

You are correct. I edited my post and actually found a good blog post on the subject.


Here are the slides for anyone who's interested: https://speakerdeck.com/sklivvz/the-architecture-of-stackove...

The most important thing, technically, is having great developers who ship.

For piths sake, I want to say "Everything else is noise" but that isn't true. Everything else can help or hurt, depending on the application and how doctrinaire the application of a given approach/methodology is, the organizational knock on effects (e.g. "Mr Tough Guy Testalot" holds up the release train or nukes your architecture to make it 'testable'), etc. but, seriously, "great developers who ship" is really what moves the needle.

Having a great Ops staff also helps ;) Of note is Thomas Limoncelli who wrote "The Practice of System and Network Administration" [1] and "Time Management for System Administrators" [2] works for Stack Exchange (formerly at Google). The Practice of System and Network Administration is basically the bible for most sysadmins, myself included.

ps. I only singled Thomas Limoncelli out as an example just to highlight the caliber of their Ops staff.

[1] http://www.amazon.com/Practice-System-Network-Administration...

[2] http://www.amazon.com/Management-System-Administrators-Thoma...

Violently agree.

Vehemently? Or do you want to punch someone?


It's funnier.

Why not combine both?

Every feature/user story has to go through a workflow of selected for development -> UX design(If required) -> Development -> Unit Tests(Or the otherway around) -> Staging -> Load Test -> Acceptance Test -> Production -> Analytics(To see if people actually use it) -> Learn from analytics -> back to start if required.

The goal is to get as many issues through the workflow as fast and rigoursly(no shortcuts) as possible at a sustainable pace. Have a continuous flow of features rolling out through this process. Ideally with continuous delivery to automate the majority of it.

Everything else is noise. If you have great developers who ship, then by definition you don't have doctrinaire methodology or "Mr Tough Guy Testalot" (I generally find "Mr No Test" to be a much bigger problem anyway). You might the situation where you have great devs but bad management, but that's next to impossible in the real world.

There's really only two steps to great software development.

1. Hire good developers.

2. Don't hire bad developers

He mentioned that they use the servicestack.text library. I've looked into servicestack recently (using the nuget packages), but then found the library to be pay-to-play. There's an older version (v3) that is BSD licensed that is being maintained. Do any of you have experience with it? I have grown tired of Microsoft pushing new solutions to the same problem (REST service with WCF and then Asp.net web api).

We used it at the time I gave that talk, we don't anymore. We only used JSON serialization and we have rolled out our own free solution, Jil.


Technically we use Newtonsoft and Jil, Jil replacing Newtonsoft as we become increasingly confident in it.

I wouldn't suggest anyone use Jil in a production role unless you're at Stack Overflow. It's too untested at the moment, and the typical person can't get me on the horn to fix whatever just broke.

Why would I use Jil over Newtosoft ?

You wouldn't right now (Kevin doesn't recommend it). But in the end it you'll want to use it if JSON serialization is a performance bottleneck for you.

ServiceStack is just plain awesome when it comes to developing web services, though it's gone commercial for v4 onward. Nancy is another popular alternative - it's basically Sinatra for .Net. Every time I go back to WCF I want to stab myself in the face.

I would love to know more about the Databases:

- Are they used for different things on the sites?

- Is data partitioned across tables?

- Are they all SQL Server instances?

I would like to know more about this as well.

It sounds like they are all SQL Server instances. However, he made it seem like they are reproducing the schema once per site? I.e., a separate database per site rather than sharding the shared data to multiple hosts per site. Did I hear this right in the question/answer portion?

Stack Exchange has one database per-site, so Stack Overflow gets on, Super User gets one, Server Fault gets one, and so on. The schema for these is the same.

There are a few wrinkles. There is one "network wide" database which has things like login credentials, and aggregated data (mostly exposed through stackexchange.com user profiles, or APIs). Careers Stack Overflow, stackexchange.com, and Area 51 all have their own unique database schema.

All databases are MS SQL Server.

How do you manage schema changes with release deployments across across all of the databases that are meant to be standard?

All the schema changes are applied to all site databases at the same time. They need to be backwards compatible so, for example, if you need to rename a column - a worst case scenario - it's a multiple steps process: add a new column, add code which works with both columns, back fill the new column, change code so it works with the new column only, remove the old column.

Thanks for the reply. We have a similar architecture where I work so this is interesting to me. A couple more questions if you don't mind:

- Do you use any tools for orchestrating the rollout of those schema changes or do you just have some homegrown scripts?

- Do you separate your schema versioning and deployment process from your application versioning and deployment process?

- How do you handle cases where backwards-compatibility is not possible? For example, a new application feature that depends on a brand new table.

Before the title was moderated there was an important tidbit. StackOverflow doesn't unit-test. Fascinating.

tldw; He says he doesn't advocate it but they get away with it by having the community test it out for them in their meta site. Then the community writes up the bugs.

He actually says " I'm not advocating that you shouldn't put in tests. [ The reason we can get away with this ] is that we have a great community. "

I take this to mean that he feels that StackOverflow doesn't need tests. Not that tests are useless.

User community as testers presents some interesting pros and cons.


* Tests are self-updating. Add a new feature: tests come in for free. Change a feature: tests automatically update. Fail to document a change: tests fail.

* Tests are unusually thorough

* Eventually consistent testing. If nobody ever complains, it probably wasn't a bug worth fixing.


* Tests cannot be run offline. Feature must be committed and deployed before tests can be run.

* Potentially large quantity of false positives (bad bug reports)

* Potentially large quantity of false negatives (nobody notices particular bug, release considered good)

* Does not work for non-user-visible features

So basically you trade the reliability of your tests for a substantial build/release speedup. Some users experience each bug, but they are the users who are actively using the meta-community and have signed up to experience more bugs. Still, lack of pre-release unit testing must radically increase the importance of VERY careful code reviews.

Not the decision I would have made, but definitely has the sorts of advantages that a small team of engineers drool would drool over.

Remember that our community writes bug reports but also vets bug reports. We rarely have to deal with bad reports. Interestingly, large quantities of false negatives are a non-issue.

Presumably the same reason why they don't have a ton of bad questions on stack overflow: their community scoring would apply just as much to bug reports

That's an accurate read.

- Stack Exchange employee

Dear any Stack Overflow Developers,

Can you describe the network infrastructure in finer detail? Specifically what type of load balancer are you running?

And what's peak RPS? Where are your network peaks? (I'm guessing major peak US Pacific and minor US Atlantic?)

IIRC at first they had an entire Microsoft stack (I may be mistaken on that).

But nowadays, from what I've read here on HN by SE devs in other threads, they're using lots and lots and lots of Linux: HAProxy, Redis, Nagios, etc.

I just double-checked the slide and although I didn't notice it at first, you can see that 'HA Proxy' and 'Redis' are mentioned.

The core Q&A is in C#/MS-SQL so that's probably not going to move to Linux anytime soon.

This might be a stackoverflow question, so what is a static code?

is there an open source, self-hosted version of stack overflow that you can deploy on your own domain?

To be clear: there is no version of the actual Stack Overflow code that is publicly available. There are, however, numerous open-source reimplementations of portions of the site code.

Also (as the video perhaps mentioned), the Stack Overflow developers have often been able to spin off pieces of the code as open-source libraries. See http://blog.stackoverflow.com/2012/02/stack-exchange-open-so...

I wouldn't trust Joel Spolsky's code expertise -- just look at Excel internals! Nevertheless, Stack Overflow is super cool. But that tells nothing about its architectural quality.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact