Records in a collection mostly use the same set of keys. Every record includes the full key which makes the storage that much larger. Strings can interned and replaced with a far smaller numerical token to reduce storage. (I'd be happy for it to be attempted for string values as well.)
The other is compression which is a straight forward cpu versus storage tradeoff.
Compression (well, lack of) is probably the thing that'll cause me to migrate off MongoDB one of these days.
Mongo's behaviour when the working set exceed memory size used to be pretty terrible. (I moved on from the relevant project so I don't know how much of an issue it still is.) https://jira.mongodb.org/browse/SERVER-574 - essentialy what happens is that Mongo keeps trying to execute queries rather than throttling to let existing ones finish which causes even more thrashing and longer query times.
Reducing the memory consumption would curtail the onset of that.
Resyncing will drop disk usage from 200GB to 50GB(!). So much wasted space.
Come talk to us if you're interested. We are stabilizing things and hope to open up evaluations more in a week or two.
Though you are right, the architecture we tried earlier, when we were doing a proof of concept, would not have helped with compression. Back then we were trying to play nice with mongodb's data format and keep its legacy indexes around. It turned out that for many reasons (concurrency, recovery, and compression, and a few others), it was a lot better to replace all of the storage code, so that's what some of us have been doing the past couple months.
But small fields won't compress so well on their own. Often there's a lot of redundancy across records for the same fields which is great for compression. This might be a great way to achieve the benefits of field name tokenization too (which is similar to part of how most compressors work). I'd like to see block compression, rather than field compression .
_hopefully_ the mmap interface would provide the best of all worlds: mongodb continues to be simple with respect to how it handles getting data from disk and the kernel/fs can do it's magic behind the scenes of mmap. Of course, it could be that mmap + compressed filesystem leads to some unexpected (and bad) perf results. But then again, I've never tried :) have you?