

MongoDB, count() and the big O - dpeck
http://www.coffeepowered.net/2011/07/17/mongodb-count-and-the-big-o/

======
benologist
MongoDB _really_ sucks at count. It's a problem I've run into repeatedly with
games that automatically submit highscores (ie without a name field/submit
button) using an API call of ours that saves then returns the rank by checking
how many scores are higher or lower than X.... on games that get millions of
plays.

My solution ultimately ended up being to put the leaderboards that are doing
that particular thing into a special "truncate" list that periodically deletes
everything after their 50,000th score.

~~~
latch
I run/built mogade.com, which does the same thing. I blogged about how I
switched to Redis for this specific feature (it made it top of HN, weren't you
paying attention that day!?). Maybe you'll find it useful, simple test went
from over 5minutes to 147ms:

[http://openmymind.net/2011/5/8/Practical-NoSQL-Solving-a-
Rea...](http://openmymind.net/2011/5/8/Practical-NoSQL-Solving-a-Real-Problem-
w-Mongo-Red/)

~~~
benologist
I distinctly remember The Little MongoDb Book, I must have missed (or not had
the problem at that time) the leaderboard one heh.

------
dpeck
Currently painted myself into a corner with this and trying to find a
workaround that I'm happy with. Using mongoid as my ORM and subclassing models
seemed like a great idea, that is until I've got 20M instances of these
models. Its currently making my dashboard functionality both useless for the
user and slowing the system down considerable when its requested.

Understandably this is the cost of using things like mongodb. Until now I've
been very happy with the benefits and tradeoffs but the problems with count()
for reasonably sized collections was a particularly nasty surprise.

~~~
cheald
Author of the article here. The count caching solved the problem for most
practical production purposes (CRITICAL ALLITERATION), but it's still uglier
than I'd like. However, if the use case is pagination, then "faking it" once
you reach a certain size works well, and is unlikely to be noticed by the
user. If your counts are on relatively static pieces of your dataset, then
keeping counts manually somewhere (via $inc) will get you that performance
back, but obviously at the cost of having to maintain it all manually.
Incremental map/reduce isn't a bad idea, but again, depends on the complexity
of what's being counted.

All in all, it's something of a sore spot in an otherwise very enjoyable
development experience.

------
wisty
Count can be done with an incremental map-reduced. CouchDB does this
automatically, but you could easily (read - with some difficulty) replicate
this with any stack.

This would be much easier to do if Mongo had triggers.

------
scg
On a related note, skip(n) is also O(n) so the farther you paginate the slower
it gets.

~~~
cheald
Yup, but the slow part of count() is the value comparison over _n_ documents,
moreso than the actual iteration. So while the point holds true, it's less of
a practical concern.

------
sideshownz
I had the same problem with a dataset of 1m+ records in mongo and really slow
pagination with will_paginate. I have since swapped it out with the 'kaminari'
gem (<https://github.com/amatsuda/kaminari>). This should solve your problem
and is _almost_ as simple as will_paginate to drop in. FYI im using mongoid
and not mongo mapper

~~~
cheald
Kaminari should suffer from the same issues; it's still invoking count() on
the scope (at least, it appears to, from the source), which is going to
trigger the problem. If you're counting the whole collection, it's going to be
very fast, since mongo will just return the number of records in the
collection, but if you're viewing a subset of the collection, it's still going
to iterate over each record and do a value comparison.

------
kevin1024
You can combine limit() with count by passing true to the count function. So
if you only care about counts greater than, say, 10000, you can do
db.collection.find({criteria:'123}).limit(10000).count(true) and you will only
get back a max of 10000. This puts an upper bound on waiting for some large
counts to come back.

~~~
cheald
That's awfully handy. It unfortunately doesn't solve the pagination problem,
though. :/

~~~
taf2
it does, if you change your pagination interface to only care if there are
more results after this page...

~~~
cheald
If your pagination interface is simply "Next", then why not skip the count all
together, limit(N+1) documents, and offer the "Next" button if the result set
is N+1 large?

------
artsrc
I think that SQL databases are bad at count too.

I think the right map reduce view in Couch is good for count.

~~~
cheald
Postgres is notoriously bad at count. MySQL seems to handle it quite well,
though, for some reason.

~~~
saurik
This is only true of the MyISAM backend (which has poor concurrency and has
more complex table repair). If you use InnoDB, count(*) will be a full table
scan, just as in PostgreSQL (as this is a direct tradeoff on those other
properties).

~~~
onethumb
That's only true if you don't have any WHERE conditions. If you do, and
they're indexed (as described by this author), it'll return the count fast.

~~~
saurik
Right... but PostgreSQL does that as well, so it really isn't relevant to a
PostgreSQL vs. MySQL comment, which is what I was responding to. ;P PostgreSQL
and InnoDB have the exact same constraints and tradeoffs with respect to
count() performance, AFAIK.

(That said, an index doesn't make this magically faster... the fast
performance that people credit MyISAM with is that asking for the total number
of rows in a table was O(1), as its lack of intricate transaction concurrency
allowed it to just keep a global counter; one which of course was difficult
and costly to repair after an unclean outage. In comparison, that indexes can
optimize a count() over a where is obvious and true of any sane database
engine, whether it be PostgreSQL or any of the common MySQL backends.)

~~~
HarrisonFisk
There is a big difference with regards to PostgreSQL and InnoDB when you have
an indexed COUNT(*) clause.

MySQL can do index only queries, whereas PG currently can not (it is part of
the next release iirc). This can often result in a fraction of the disk I/O
being done for MySQL.

~~~
saurik
Interesting; I did not realize that InnoDB supported covering indexes. That
said, it is still "only" a constant factor: the difference between O(n)
(looking at just the index) and O(2n) (scanning the index, but having to also
pull that original record, which you now know the exact position of, to see if
it is alive for the current transaction).

Thanks for pointing this out, though! I'm going to go look into how InnoDB
manages to handle that.

------
siculars
/rant/ You know what I think is "a big fat pile of suck"? This guy and his
hack blog post. Congrats, you figured out what O(n) means. You know, this
stuff is not magic. You can't magically say, hey database, give me everything
I want and do it for me immediately. Oh, and make it open source, aka free.
Oh, and on and on and on.

This stuff is hard. Nothing comes for free. I don't even use mongodb in any
production project and am really feeling for them. All these pseudo functional
developers think they know how things should work and then get all indignant
when stuff doesn't work the way they want it to or think it should.

In this season of Thanks, let us give thanks to the real engineers who make
these databases (and other software) and then give them out to everyone for
free. Go send a polite email to their mailing lists and say "Dear OpenSource
Developer, Thank you for all your hard work."

Re. this specific problem - cache your data and stop crying about it or simply
don't present counts to users.

~~~
cheald
Hi, big fat pile of hack-posting author here. I know it's not magic. I'm a
huge fan of Mongo. I blog about the issues I have with it, because the issues
I don't have with it are far less interesting. The solution wasn't "AMG, MONGO
SUX", it was "here's how I band-aided the problem".

The impetus for this particular post was that it was somewhat surprising that
counts were still doing full scans, even if the query conditions were on an
indexed field. My expectation was that the index might have some information
on how big a particular part of the index tree is (given two bounds), but it
turns out that it doesn't, so you end up looking at each document and doing
inspection per row. This isn't normally a problem, but the comparison is
relatively slow, so the whole thing starts to drag down faster than you'd
normally expect.

When the problem is hidden below several levels of library/ORM candy coating,
it's not going to be immediately obvious that this _is_ a problem, especially
if the queries against that field are otherwise performant thanks to proper
use of indexes. It's non-obvious, and given that someone else found it
interesting enough to post here, I'm apparently not the only person in the
world to ever have this problem. There are multiple open tickets in the Mongo
JIRA system, as well as multiple discussions on this topic on the mailing
list. This isn't me whining that 10gen won't fix my software, it's an
acknowledgment of the problem and steps towards fixing it.

10gen themselves have acknowledged via the mailing lists that counts are
slower than they should be, and it's on the list of things to improve. Until
then, we have to make do with workarounds.

~~~
nnythm
I think that more than the content of your post, siculars is responding to the
somewhat inflammatory title. As soon as I saw the title I thought, "Someone is
going to be complaining about mongoDB having a slow operation, and it is going
to be upvoted endlessly because mongoDB is webscale." Upon closer inspection,
it became clear that this title is only inflammatory because it is on HN, and
also that the title was not designed for HN: on your blog, it is simply a
description of what the post is about. On HN, it becomes a honeypot.

@cheald, clearly this isn't your fault, and you don't control the tenor of the
discussion of mongoDB. What follows isn't a criticism of you, but more a
general comment.

I don't think that criticism of mongoDB should be stifled on HN (it clearly
still has huge flaws--the lack of granularity when grabbing the write lock
comes to mind), but it seems like it's a little hard to have an intelligent
conversation in this context. It might make sense to encourage HN to discuss
mongo in a way that it's not criticizing a performance detail (no database can
do everything for everyone), eg where it can be compared against other
databases. When criticizing a performance detail, it might make sense to move
that discussion to the MongoDB Jira, where your voice can actually help
influence decision making.

