
Disqus: Scaling the World’s Largest Django Application (2010) [video] - pajju
http://ontwik.com/python/disqus-scaling-the-world%E2%80%99s-largest-django-application/
======
zeeg
As these slides are very old, here's some updates:

* We use Flask and nginx in various areas now (the main app is still Django). Our realtime app, for example, is powered off of uwsgi and Flask.

* There are nearly 1b monthly uniques across the network serviced by the platform.

* ~300 servers

* Still Postgres (with Slony, multiple clusters), Redis, Memcache, and some Cassandra for newer things (not comments).

Also mostly confident we're still the "largest Django app" in terms of
traffic.

~~~
reinhardt
Virtual servers or bare metal? What do you use for herding them?

~~~
lucian303
I second that. Would be interesting to know a bit more about the systems
setup.

Also why the decision to use an ORM? To manage the partitions?

~~~
zeeg
Eventually you need abstraction. Python is fast enough, as is Django, that the
abstraction costs us less than the value it provides us.

That said, we do manage our partitions (currently) using an ORM layer (at the
application level). We do want to break this out into a proxy middleware at
some point though.

~~~
lucian303
Right, just interested in the choose of such an anti pattern.

What proxies are you considering?

------
lewispb
Largest? Is Disqus really 'larger' than Instagram?

~~~
simonw
I think it was in 2010 when they gave this talk.

~~~
saevarom
Yeah these figures are really outdated. This one from PyCon 2011 is newer:
[http://www.slideshare.net/zeeg/pycon-2011-scaling-
disqus-725...](http://www.slideshare.net/zeeg/pycon-2011-scaling-
disqus-7251315)

~~~
tangue
Interesting : According to these slides [14] there is no difference between
Apache + mod_wsgi and Nginx + uWSGI

------
zerop
All with apache and haproxy? No nginx or uwsgi/gunicorn or Redis? How old is
this article?

~~~
recuter
In the newer (2011) slides they mention growth by ~100 Million comments in 6
months. Most comments are tiny, I'd be surprised if this whole dataset was
much larger then ~100GB.

I don't really understand why Disqus and Reddit for that matter don't just
switch to redis. There has got to be a good reason because I can't think of
one. Sounds like you _could_ run the whole thing from just one box and only
have slaves for redundancy.

Why isn't vertical scaling more in vogue?

~~~
bretthoerner
If you really think that's the case then you don't understand Disqus/Reddit OR
Redis.

~~~
recuter
Well I opened with "I don't understand". I'm not saying that they could, I'm
saying I don't understand why they can't.

I'd love for you to elaborate. As far as I can tell they are not using
PostgreSQL as a relational database, but rather, as a column store. So why not
use Cassandra or even Redis (if the amount of data can totally fit into RAM
easily, maybe it can't)?

In fact I think Reddit moved to Cassandra... anyway, I am not an expert, I'm
asking.

~~~
bretthoerner
I can't speak much to Reddit. I know they moved some things to Cassandra, but
as a user I'll say I haven't been impressed with their latency and uptime
since.

As a developer, I can speak a bit about Disqus. I don't speak on their behalf,
but I did work there for two years (ironically I was also the first to use
Redis there[1]) so I can at least explain why I think using Redis for the
whole site is a terrible idea. I'll also note upfront that where I currently
work we use a ton of Cassandra, Redis, and very little MySQL on the side, so
hopefully I won't be pegged as some kind of "RDBMS only" guy.

Anyway, some reasons:

1) Relational data. Disqus really _is_ relational. You have users, user have
posts, posts belong to a thread, threads belong to sites, sites belong to an
account (imagine one account for the different CNN websites). And that's a
very, very small subset of the number of tables and foreign keys involved.
People don't realize how many features Disqus really has above and beyond
"post a text blob to a thread."

Being able to write a query that uses joins is huge. The alternative in
Redis/Cassandra is having to denormalize your data into every single possible
way you may want to do a "query" on later. Oh, by the way, I promise you will
forget a few ways and regret having to backfill/fix all the broken
denormalizations.

Even if you don't forget to denormalize anything upfront, the biggest joke k/v
and document stores ever played on the developer community was convincing them
that they save development time by being "schema free". When Disqus wants to
add a new feature it's often only a new JOIN/INDEX away. If you realize a year
into your Redis deployment that you want to be able to tell a user how many
comments they made per month in the year... what do you do? In Postgres you
hit the datetime column index and call it a day.

2) Memory (and cost). The Disqus network is actually pretty huge. Storing the
entire dataset in RAM (Redis) would cost a lot more than using an efficient DB
like Postgres that is a pro at moving data between disk and RAM. Cassandra
would work better than Redis here, but the other problems I list still hold.

Also, as soon as you have to break from one Redis instance to two (either to
scale CPU or to live on another box to increase available RAM) you lose a lot
of server-side functionality like being able to union sets, or use the
embedded Lua to fake 'queries' because now you have keys that live on seperate
systems. Before anyone says you should shard by "site", see my link below. I
did just that, but you have to understand that Disqus is more than just
"comments for my website". Say you shard by website, now how do I run a do a
union across sets that involve a single user who has posted to 100 different
websites? I can't. Back to backfilling and denormalizing tons of data that
also needs to be resident in RAM and kept in sync.

\---

I could add more, but I just realized that the linked talk probably spoke
about the big sharded Postgres K/V type store that they built. Here's the
thing: all of the core stuff (like from point 1) isn't stored in there. It's
used when it can be, for scalability, but the majority of the app is still in
a behemoth Postgres instance that is replicated many times over. As to why not
use Redis for _that_ part? I'd say because it's memory only and because Disqus
has Postgres expertise. Also, it's not truly "key value" because it still has
indexes for say, datetimes or post_id or site_id which make doing a lot of
non-relational queries handy without having to denormalize. Now, why not use
Cassandra for that? Well, I would. :)

[1]
[https://github.com/bretthoerner/blog/blob/master/2011/2/21/r...](https://github.com/bretthoerner/blog/blob/master/2011/2/21/redis-
at-disqus.rst)

~~~
recuter
Ah, terrific response, thanks very much. :)

So when I said switch to Redis I meant to replace the 'big sharded Postgres
K/V type store that they use' not the part where they actually use relational
features of the database.

I'm always curios about the idea of scaling UP versus OUT -- like you mention,
going from one Redis instance to two mucks up the waters. So why do it at all?
(Maybe a year from now Redis Cluster will finally come out and solve this)

1TB of RAM is going to dip below five figures soon, I guess if you can't fit
into that it is moot.

"If you realize a year into your Redis deployment that you want to be able to
tell a user how many comments they made per month in the year... what do you
do? In Postgres you hit the datetime column index and call it a day."

I would do bloom filters, but point _very_ well taken. No silver bullets.
Thanks again for the reply.

~~~
zeeg
Disqus wouldn't fit into 1TB of memory as a denormalized data set.

It doesnt even fit (indexed, at least) into 1TB of memory as a normalized data
set.

At the scale we're at, you're required to make tradeoffs and come up with less
than standard solutions to problems. Our solution, as many others have done
before us, is to shard datasets (both Redis and SQL).

~~~
recuter
Congrats on the growth, that's a lot of comments! What I really want to know
is how you guys solved being google bot friendly -- I have a fogy memory of a
blog post or HN comment from around the time of the new version coming out
that said there was something interesting that will be shared about that in
the future.

~~~
zeeg
All I can really say (not being on the Google side of things) is: iframes

------
hahainternet
What struck me instantly was the use of Slony. I haven't listened to the whole
thing yet but I am interested in their justification. Perhaps they just
haven't moved to 9.2 yet.

~~~
sugarcode
slony offers some compelling advantages over streaming replication - even on
new databases we setup, we still like slony for several reasons:

* Version upgrades (streaming replication requires the same PG version between master/slave, slony does not)

* Logical replication gives us finer-grained control over how data is replicated across the cluster

* Ability to create additional indexes on slaves

slony isn't perfect and it has caused us some headaches, but its flexibility
makes it our go-to replication tool for postgres.

------
Kilimanjaro
Using django for a distributed commenting system?

Hammer and screws.

~~~
legutierr
Just wondering (I'm not necessarially objecting to your statement):

* what would you see as the right they of app for Django, if not this? Obviously Django is handling it, so what's the issue?

* what is the right tool to build a distributed comment system?

~~~
Kilimanjaro
What is a distributed commenting system? An HTML script tag to load a static
file that will ping an api that will serve you comments.

What is django? A framework with strong routing, OR-mapping, templates and
admin modules.

You need none to build it. Your main concern shouldn't be a coding framework,
but load balancing, caching, failover control, more sysadmin stuff than code.

Any language could do. Framework? not needed.

Now, for the backend system to control that monster, then yes, you may use
django.

So, use django for complex apps that need routing, data management and UI
presentation, plus a powerful admin module.

* I am a python/django developer.

~~~
tkaemming
> What is a distributed commenting system? An HTML script tag to load a static
> file that will ping an api that will serve you comments.

This is how Disqus works. Something still has to power the API that serves the
data for it and we happen to like Django a lot so that's what we use. There
are also a lot of parts of Disqus that are not the embed (moderation panel,
account management, etc.)

~~~
Kilimanjaro
Exactly, that's my point. Not meant to demerit your great work.

The admin/crud stuff, moderation panel, account management, etc, that's what
django was developed for. Great choice.

If you ask me to start Disqus from scratch again, I'd probably use django too,
for the admin part, but for the API I'd go commando, closer to the metal,
without framework at all. No need to load a 100MB routing/modeling/templating
monster just to perform an invisible API call.

Highly optimized plain python scripts would work better.

Sometimes frameworks are more like handcuffs. Just sometimes.

~~~
zeeg
We have never ever used the Django admin, and I would choose Django any day of
the week for a project that is web on the web and using a database.

(In fact almost every single project I've ever built has used Django, and
there's never been a limiting factor of that choice)

------
stef25
Painful spam in the disqus comments at the bottom of that page. Don't they
have something in place against this?

~~~
beaumartinez
"+1"s are hardly spam, they're just low quality comments. If they filtered
those, the web would be a very commentless place

~~~
d0ugal
+1

------
TommyDANGerous
Great watch.

