Records in a collection mostly use the same set of keys. Every record includes the full key which makes the storage that much larger. Strings can interned and replaced with a far smaller numerical token to reduce storage. (I'd be happy for it to be attempted for string values as well.)
The other is compression which is a straight forward cpu versus storage tradeoff.
Compression (well, lack of) is probably the thing that'll cause me to migrate off MongoDB one of these days.
Mongo's behaviour when the working set exceed memory size used to be pretty terrible. (I moved on from the relevant project so I don't know how much of an issue it still is.) https://jira.mongodb.org/browse/SERVER-574 - essentialy what happens is that Mongo keeps trying to execute queries rather than throttling to let existing ones finish which causes even more thrashing and longer query times.
Reducing the memory consumption would curtail the onset of that.
Resyncing will drop disk usage from 200GB to 50GB(!). So much wasted space.
Come talk to us if you're interested. We are stabilizing things and hope to open up evaluations more in a week or two.
Though you are right, the architecture we tried earlier, when we were doing a proof of concept, would not have helped with compression. Back then we were trying to play nice with mongodb's data format and keep its legacy indexes around. It turned out that for many reasons (concurrency, recovery, and compression, and a few others), it was a lot better to replace all of the storage code, so that's what some of us have been doing the past couple months.
But small fields won't compress so well on their own. Often there's a lot of redundancy across records for the same fields which is great for compression. This might be a great way to achieve the benefits of field name tokenization too (which is similar to part of how most compressors work). I'd like to see block compression, rather than field compression .
_hopefully_ the mmap interface would provide the best of all worlds: mongodb continues to be simple with respect to how it handles getting data from disk and the kernel/fs can do it's magic behind the scenes of mmap. Of course, it could be that mmap + compressed filesystem leads to some unexpected (and bad) perf results. But then again, I've never tried :) have you?
Unfortunately, the better customer service doesn't provide better answers or solutions to problems, and the improvements aren't targeting long standing, basic issues with the platform.
This feature announcement fits the pattern: an improvement, but not in a critical area like indexing, or document-level locking, sharding stability, etc. There are basic fundamentals that need addressing like overcoming the 20K maximum connections per server, or mongos CPU usage. These are the things I deal with in production and that are business critical, but those feature requests sit untouched in JIRA for months or years.
This feature seems interesting, but it solves a problem I don't have. I'd prefer them to solve real world problems.
Right now my application stack is woefully under utilized due to this completely arbitrary decision on their part not to allow the end user to set the connection limit. They've even admitted that they just picked that number four years ago and haven't looked at it since.
Just like they limit replica sets to 12 members maximum. What if I have higher read requirements than that limitation allows? Well, too bad.
I think this points at a fundamental issue with MongoDB at the design level - they don't allow end users to make any decisions about how to configure the product, even if those decisions might turn out bad.
Every Enterprise-level DB allows end user tuning of parameters, so clearly MongoDB isn't Enterprise-level.
I think it is a little much to equate problems you have with being the only real world problem. Perhaps it's a real world problem that you don't have?
These aren't edge cases specific to my environment. They are very real, very visible issues that are discussed on the mailing lists all the time.
To be fair, I educated the client on the alternatives should 10Gen not improve the aggregation framework so that it could do arbitrary transforms without failure should the output size exceed 16MB. [Use other MongoDB mechanisms, custom compilation with higher per-document size cap, etc.] What annoys me is that 10Gen did not mention this incredibly important limitation when they were touting the planned features of 2.2. My client would not have minded had it simply been the case that queries with large result set sizes were somewhat slow. What this client could not tolerate were failures to deliver any results at all. In retrospect, I wish that I had pushed harder for a solution based on the Hadoop stack. While it seems to have its own demons, at least there is an ecosystem dedicated to fixing the most blatant of limitations.
When I inquired about that, they suggested using an ssh tunnel. I can see why they like mmapped files for storing data.