Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Some quick Django optimisation lessons (lukeplant.me.uk)
96 points by pajju on May 19, 2012 | hide | past | favorite | 20 comments


I've also run into O(n) and even O(n*m) # queries situations. They're entirely avoidable by simply sticking closer to the relational model than the object model. Sadly, this is almost always impossible with just Django's ORM. You very quickly end up writing unmanaged models and raw SQL.

SQLAlchemy is so much better when it comes to this, since things like joins are always explicit. However, it's rather hard to use it together with the Django ORM bits, especially the admin. One method I want to explore is replacing only the raw SQL bits with SQLAlchemy, and use its reflection to map to the Django ORM models.


It's also important to remember that the Django ORM's select_related call only goes one way. Consider the case:

    class Clan(models.Model):
        ...

    class Player(models.Model):
        clan = models.ForeignKey(Clan)
Doing Clan.objects.all().select_related() will preload all the Player objects associated with the Clans. Doing Player.objects.all().select_related() will _not_ select the Clan objects, and should you refer to them in any way, will result in a query per clan you access.

In my case, where I ordered a list of players by their clans, in pages with ~200 players, I was doing ~200 queries. Django provides a workaround for this (the {% regroup x by y %} template tag), and now supports doing joins in Python in 1.4 (I forget the call, however) so that the ORM can cope with this case... but this behaviour caught me off guard, and is something to keep in mind.

You're absolutely correct in that hand-writing the SQL would fix this issue too, but then, as you say, you lose database portability.


Not django specific:

Personally, after years of trying to solve the Object/Relational mismatch, I realized that it's better to drop the object side and stay with the relational side if you need a database.

The price is a little bit of syntactic sugar. The return is the ability to reason about your database and how it accesses data.

web2py eschewed ORM for a DAL (Database Abstraction Layer). I do not know how it compares to SQLAlchemy, but it is independent of everything else web2py; I've been using it in non-web projects and it is excellent, although a little slow if you have large result sets.


SQLAlchemy is philosophically similar to DAL from what I'm reading, but closer to pure set algebra than SQL. On top of that, it builds SQL-specific features and declarative models (as objects).


Agree with the "stick to relational model advice". In my team we stopped writing classes for each tables, using simple modules and functions instead and the result is less code, less state, more explicit parameter passing, less circular dependencies, less classmethods, less properties, more atomic tests, and more decoupled code.


Another thing you can do is build caching into managers.

Here is one such effort (shameless plug): https://github.com/stucchio/Guacamole


Also consider using Johnny Cache or CacheMachine backed by memcached.

Johnny Cache does automatic invalidation, but will invalidate an entire table's cache on one write. If you have a site with a small number of writes, this is an easy and instant win. Setup is under ten lines of configuration.

CacheMachine, on the other hand, tracks the objects that were returned for individual queries. The queries are invalidated when one of the associated objects changes, but not when a new object is added to the db. I believe addons.mozilla.com uses this.

http://packages.python.org/johnny-cache/

https://github.com/jbalogh/django-cache-machine


I can confirm that CacheMachine is being used by at least 2 global projects that are in production. Haven't faced any critical problems so far.


Correct me if I'm wrong but I don't think Johnny Cache works with Django 1.4 yet.


Thanks, I'll check that out. How does it compare to johnny-cache, which is what I've been using? It has the advantage for me that I only need to include it in settings.py, and I don't need to manually plug in a manager into my models, so it's easy to enable and disable for testing. Also, it has support for different caching backends, so I can have multiple django instances using a common memcache for the models even if the models are highly mutable.


Well, I'm not very familiar with johnny-cache, so I don't want to say a lot. It's quite possible that it does everything guacamole does and more.

The point of guacamole is that you probably don't want to cache every database read. You probably just want to cache a few models. I.e., if your database has 1 million items (which are frequently updated) and 1000 brands (which are more or less static), you probably want to cache those 1000 brands but not the million items.

Once you've done that, actions like

    {% for item in items %}
        {{ item.brand.name}}
    {% endfor %}
no longer involve sending 50 queries to the db. (Or even to memcached - 50 hits to memcached can still be 25-50ms.)

Guacamole doesn't support other cache backends, though that would be pretty easy to add.


Also worth looking at prefetch_related(), which landed in 1.4, which can be used to alleviate common O(n) situations:

https://docs.djangoproject.com/en/dev/ref/models/querysets/#...

I've also used this in the past to do something similar, grabbing the ids of all specified related fields (related to the same model type) and pulling them back in a single query, with the option to only grab certain columns as dicts instead of full model instances if you're only selecting additional data required for specific templates:

http://djangosnippets.org/snippets/1117/


"Django makes it extremely easy to generate database queries, which is both a good and bad thing"

I experienced this when a junior developer made queries in a loop inside template without realizing that it is actually going to be a lot of SQL queries in the end.


Related question: is there an easy build-in way in Django to write to stdout a count of DB queries made for a request?



I second django-debugserver. This is one of those nifty packages you wish were included with the default Django distribution. More than once, it has helped me find and fix query-related issues.

Highly recommended.


I haven't worked with Django in a bit so my memory is foggy, but if settings.DEBUG is true, you should be able to do something like:

    from django.db import connection
    print len(connection.queries)
Put that in a middleware class's process_response() and it should print the query count for every request.


One option is to use the debug_toolbar package. You can set it up to display only when testing or only when an admin is logged in, and it will give you a convenient bar on the right side that shows exactly how long everything took, along with all SQL queries. This allowed me to greatly cut down the number of queries by judiciously using fetch_related.


Maybe it is just a matter of style but I really dislike when advices are repeated three times, I regard this as disrespectful of the reader.


That seems like an unholy amount of queries. Would be nice if there was an explanation as to the why behind the queries on those paths.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: