Hacker News new | comments | show | ask | jobs | submit login
Disqus: Scaling the World’s Largest Django Application (ontwik.com)
101 points by ahmicro 2552 days ago | hide | past | web | favorite | 31 comments

Disqus is a classic case study when it comes to scalability.

If there is one thing I learned from Disqus it is the power of keeping a lightweight stack. Disqus keep it simple, and prove that all the myths that "Django/SQL/whatever doesn't scale" are obscene.

Even for an app with requests per second in the 5 digit range - they do pretty damn good with the basic Django stack with no more than some small tweaks.

That was exactly my thought.

No NoSQL. And they use transactions for write queries.

They use apache (not even nginx!).

Only 25% of their servers are pure (no snapshot) caching servers (not 50% or 75%).

They prefer vertical partitioning over sharding (but they still use sharding).

I understand they are looking at redis for some of their features, but really, their main stack is traditional and proven. Very enlightening.

I actually find it astonishing that they're using Apache for this. I have had a lot of problems with Apache behaving weirdly in the past, especially with mod_wsgi.

You are aware the Apache + mod_wsgi is the recommended way for deploying Django apps, right?

Yes, I'm aware of that. I've been using mod_wsgi since modpython was the recommended way for deploying with Django. The problem is that when you're running a high number of Django instances (via a large number of daemons) you can get all sorts of problems with Apache itself. Some of these issues have been due to mod_wsgi, and Graham has, in my past experiences, been very responsive about fixes. Other issues are simply due to Apache, and when you're running multiple instances in a memory-restricted environment you're left with a tough configuration job. I've even had a case where thread contention in a low-memory setup caused a complete Apache lock-up.

So, as you can see, for the normal case the recommended setup is fine, but for extreme cases you should use the combination with caution.

YMMV, of course, but if you're pushing your setup to the limit and don't have any options for extra servers, I would heartily recommend uwsgi+nginx or fcgi+lighttpd instead.

Not like this is a problem I have to worry about. But where on earth does one learn this stuff?

The talk is useful - as an overview of what they use - but I know nothing of how to implement a single step.

It's called experience.

Which perhaps sounds rude, but it's not meant to be.

This stuff isn't taught per se, you learn it bit by bit as you solve each problem that you face.

I learned about HAProxy when my site load exceeded that which a single web server could manage.

I learned about heartbeat when I had to update my HAProxy and it knocked the site offline.

I learned about master/slave replication of databases when a site I worked on had considerably more reads than writes and scaling vertically (buying a bigger box) cost more than scaling horizontally (adding cheap read slaves).

I learned of sharding when I worked on a graph stored in an Oracle database and performing calculations on the whole graph exceeded that one physical box.

I learned of one-hop replication to solve the problem of a sharded graph.

I learned of partitioning to solve the problem of having one big database and the computability not being maxed but the storage being maxed.

I learned of memcached when I wanted to reduce page generation times and realised going to the database was more expensive than keeping it in cheap RAM elsewhere on the network: http://www.buro9.com/blog/2010/11/18/numbers-every-developer...

I learned of reverse proxy caches when I wanted to make sure requests for things already served never reached the web layer again.

I learned about Varnish when I considered that most reverse proxies use disk storage for their cache.

We can go on and on here, but the message is that you learn these things one at a time solving real problems that you come up against. There is always the next hurdle to jump through, and when you get there you too will learn how to get past it.

I'd emphasise that you cannot attempt to do this prematurely, that premature optimisation quote really applies well to architecture too. Keep things as simple as they can be and just know that when you get to a hurdle that someone else has already solved it and you've just got to find out where it's written down (if anywhere), what they used, how they approached it, the upsides, downsides, what they'd do differently, etc.

You could try sites like http://highscalability.com/ but I would urge you not to implement things without knowing why you're implementing them. Don't cargo cult ( http://en.wikipedia.org/wiki/Cargo_cult ) this stuff, it's really key to do only what you need to do, when you need to do it.

seriously dude... write that book...

Cause i can't find it anywhere...

This is one of those books that the market is basically incapable of publishing, because its like "Hey why don't you do 6 months of hard work and then we'll pay you a $10,000 ~ $20,000 advance which you will never earn out" when Plan B looks like "Save some large company whose system being down is costing $X00,000 a day, make client look like hero, get compensated accordingly."

And as I was trying to emphasise in my post, there's not a "right way" to do things, some problems just make some approaches more right than others in those instances.

Just with Bentley's Programming Pearls book, he underlines again and again that knowing your problem is more important than knowing the best algorithms, the one that will work for you depends on your problem and only you know that.

Highscalability.com and shared slidedecks act as a community generated set of architectural patterns, but no-one should implement them without knowing what their problems will be.

What I don't understand - and perhaps will be incapable of understanding until I face these issues myself (and hopefully I will in the future) - is why this view that scaling is in a sense un-documentable...

What is it about scaling an application that can't be reduced to an algorithm - at some level of abstraction at least, so newbies can at least get an idea of how they should start thinking about it.

To put the question in another way - could a framework like django ever come to provide scaling tools out of the box? Or is it just something that fundamentally can't be reduced. Might it be that there just haven't been enough people who have faced scaling problems that repeatable patterns haven't yet become obvious?

It's undocumentable because as strange as it seems, no two problems are the same. There is no one-size-fits-all approach, not even a one-size-fits-many. At best we have a series of one-size-may-fit-you-if-you're-lucky options. Disqus isn't using NoSQL or Nginx, something a good number of scaled web applications have switched to. Why? It doesn't solve their problem. Why is their problem different from others? That's a long and complex answer that revolves around almost every aspect of how their applications run, access data, what types of data they're accessing, and so on, and so forth.

Is the problem their algorithm? Does it spend a long amount of CPU time working away? Could it be written a different way? Is it data access, is the lag caused by queries taking to long? Is that down to badly formatted queries, inefficient schema, server problems or something else? Is it even a single problem or a combination of a multitude of minor little niggles that combine into a big headache? Do you do thousands of little queries, or smaller numbers of big ones? Are your tables narrow or wide? What size and types of data do you store in the fields?

A number of the things that Disqus have done to scale out aren't appropriate for other environments, by nature of the fundamentals of the app. All they can advise on is how to scale your python/Django/MySQL based commenting system, but even then your approach to writing one might be different to theirs.

Quite simply, no one can tell you how to scale your application and its infrastructure, because every application and infrastructure is unique by the very nature of every problem being unique, and every solution more so.

That's not to say there is no value in the information that Disqus has provided. Quite the contrary, there is every bit of value there, and I greatly appreciate them posting it. There is a good chance that whilst some of what they've done won't be of use to you, some of it may be. It may even be of use for other reasons that are entirely different from those that benefit Disqus.

Quick example. You want to load balance web traffic, what do you chose for it? Is software of hardware best? Do you do lots of SSL (would an SSL Accelerator be of use?) Do you want the servers to directly respond to the client, or respond through the load balancers?

Apache httpd+mod_proxy Nginx HAProxy ldirectord lighttpd

You want to add in caching: Varnish Apache Traffic Server Squid Polipo

and so on! Each software package has its own particular strengths and weaknesses, and its as more a matter of gut instinct and intimate knowledge of the way your code and site works, than anything else that can help you find the right way to scale.

Thanks for that in depth reply... I get ya in the abstract... but yeah I guess I have to go through it myself to really see it.

Can anyone speak to how close these books come to this? Or recommend other books?

Building Scalable Web Sites http://oreilly.com/catalog/9780596102357

The Art of Capacity Planning http://oreilly.com/catalog/9780596518585

Web Operations http://oreilly.com/catalog/0636920000136

John Allspaw's books are great. I'd recommend these also:

The Art of Scalability http://www.amazon.com/Art-Scalability-Architecture-Organizat...

Scalable Internet Architectures http://www.amazon.com/Scalable-Internet-Architectures-Theo-S...

Enterprise Cloud Computing http://www.amazon.com/Enterprise-Cloud-Computing-Architectur...

Both of John Allspaw's books (the latter two on your list) look good from their table of contents.

And if you're in doubt, John is now VP of Ops at Etsy and came from Flickr before that: http://www.kitchensoap.com/about-me/

His blog is interesting too: http://www.kitchensoap.com/

So without having read the books, I would shoot for the latter 2 if I wanted to have hard copies around to introduce me to this kind of stuff.

I've found "Scalable Internet Architetures" by Theo Schlossnagle to also be quite valuable. It contains general advice on how to approach problem solving when it comes to building uh, scalable architectures.


I'm a big fan of Building Scalable Web Sites - it's a bit old now (it predates cloud-computing-for-everything) but still very relevant if you're just starting to learn about this stuff. It's basically everything the author learned scaling Flickr from a tiny site to several hundred million photos.

This stuff isn't taught per se, you learn it bit by bit as you solve each problem that you face.

Some of it is taught. I learned about HAproxy second-hand, before any problem had to be solved.

I do agree that much of this has to be learned initially through experience, but I also believe much of it can be taught. Teaching it in a classroom wouldn't work, though, as Ops only exists when there's something to operate.

It's my perpetual dilemma: few startups have enough [1] scale to warrant having a senior enough sysadmin to be a mentor, let alone another one or two more junior ones, but large companies are awkward environments for experimentation.

I try and dump some bits of wisdom/experience to my blog. Shameless plug: http://blog.maxkalashnikov.com

[1] Even Disqus is borderline in terms of traffic. Moore's law and even its more linear corollary for I/O can take one a long way.

I find in practice that even more important than figuring out which tools to use to solve your problem is figuring out what your problem is in the first place. The importance of measuring tools is often vastly under-represented in this discussion. Doing so properly almost always involves writing some custom profiling code, and having a good understanding of where resource bottlenecks are likely to be in your systems (and then ignoring that when the numbers tell you otherwise).

I have some experience with optimizing, scaling, and administering less-than-web-scale systems. Much of the knowledge comes from reading posts like this or presentations (thank God for Slideshare), or attending conferences where they give talks with titles like How We Scaled Linkedin (that was JavaOne if I remember correctly, and that talk was single-handedly worth the price of sending me to America for my company).

You can also learn from observation when working with people who are better than you. This is one of the best reasons to have some industry experience prior to raring off into startup land. I am a much, much, much better engineer in 2011 than I was in 2007 partly because I am older and wiser but mostly because I sat next to the second best engineer I've ever met and took notes when he told me that everything I knew was wrong.

Then there is the last option: your lack of expertise with X bites you in the keister, you fix X, you now have expertise with X and hopefully write up your experience somewhere to decrease the net amount of keister-biting in the world.

I think most of it comes from experience, sticking to fundamentals and learning from the mistakes of others but don't listen to me. I'm a noob to all this stuff myself.

Thank you so much for submitting that. I'm creating a Django application that has the potential to store and work with even more data than Disqus, so I've always been worrying about how to scale this to such a huge scale. Thanks to your submission, I'm no longer as crazed about it.

As an aside, AFAIK douban.com is still using Python and Quixote[1]. Back in 2007 they were doing 2 million pageviews per day[2,3]. According to Alexa they are busier yet now. They use the SCGI protocol as well.

1. http://quixote.ca/

2. http://mail.mems-exchange.org/durusmail/quixote-users/5441/

3. http://mail.mems-exchange.org/durusmail/quixote-users/5657/

Good to hear Quixote is still going. It was my favourite framework in the pre-WSGI Django/Pylons days.

How does Disqus make money? Has IntenseDebate made money since being acquired? Does revealing this information make Disqus a more likely acquisition?

Disqus has premium add-ons: http://disqus.com/addons

I know disqus does loads of traffic, but sheesh ~100 servers - thats exceptionally non-trivial.

On the upside, there's long phases of "just add more of the same" in scaling these things. The fun comes in waves, every time when you hit one of the various physical barriers (namely latency and bandwidth).

That is to say the host-count in isolation is not the most interesting figure. I've seen large sites run on 20 machines or on 500, depending on the skills of the management- and developer-team, and how much they care about the infrastructure cost in the big picture.

The host-count becomes more interesting when you relate it to the request rate. 17k/sec is absolutely a worthwhile workload, even when (as likely in the disqus case) reads dominate writes by far.

That said the relation of 100 hosts / 17k rps seems about reasonable.

However (not meaning to narrow their achievement) the engineer in me can't help but wonder if perhaps even a little more could be squeezed out on the caching front? I was a bit surprised to not see varnish on the slides; fragment caching on the perimeter can achieve mind-boggling results.

Varnish wasn't in production at that point. We're testing/using Varnish now for some things. It definitely is helping.

In the general caching front though, I want to note though that Disqus is particularly hard to cache -- there's a very long tail leading to relatively a high miss:hit ratio per pound of caching.

Great points. We've actually deployed additional caching on the frontend, and Varnish should be appearing on future slides.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact