I wish the 10gen folks would do some work on various issues that have been in MongoDB for a long time. For me the most serious one is data size. Storage is pretty terrible, especially compared to CPU these days. I would far prefer the use of more CPU in order to reduce storage.
Records in a collection mostly use the same set of keys. Every record includes the full key which makes the storage that much larger. Strings can interned and replaced with a far smaller numerical token to reduce storage. (I'd be happy for it to be attempted for string values as well.)
The other is compression which is a straight forward cpu versus storage tradeoff.
The lack of compression has a huge effect on working set size, especially as the BSON representation doesn't really save space just CPU.
Mongo's behaviour when the working set exceed memory size used to be pretty terrible. (I moved on from the relevant project so I don't know how much of an issue it still is.) https://jira.mongodb.org/browse/SERVER-574 - essentialy what happens is that Mongo keeps trying to execute queries rather than throttling to let existing ones finish which causes even more thrashing and longer query times.
Reducing the memory consumption would curtail the onset of that.
I've seen your posting before, but don't see the relevance. Our indices are a trivial portion of our data size, and even if they were zero bytes in size it wouldn't make an appreciable difference to the data size.
Though you are right, the architecture we tried earlier, when we were doing a proof of concept, would not have helped with compression. Back then we were trying to play nice with mongodb's data format and keep its legacy indexes around. It turned out that for many reasons (concurrency, recovery, and compression, and a few others), it was a lot better to replace all of the storage code, so that's what some of us have been doing the past couple months.
What we are doing now is replacing all of mongodb's storage. So all of the data and all the indexes are in fractal trees. This means we can compress everything (including the field names!) and keep everything nicely defragmented.
That sounds great if you've got large fields with lots of redundancy. In fact, we do this.
But small fields won't compress so well on their own. Often there's a lot of redundancy across records for the same fields which is great for compression. This might be a great way to achieve the benefits of field name tokenization too (which is similar to part of how most compressors work). I'd like to see block compression, rather than field compression .
Hmm...interesting. I don't know if this will work, but you could try storing your MongoDB database on a compressed ZFS partition. Since MongoDB uses mmap, this would have the nice side-effect of your working set remaining uncompressed, and only being compressed when written to or read from disk.
you're not the first person to suggest that to me :) although I haven't thought about using ZFS for this. You're not the only one to suggest ZFS. Why that and not compression in btrfs (or something else entirely)?
_hopefully_ the mmap interface would provide the best of all worlds: mongodb continues to be simple with respect to how it handles getting data from disk and the kernel/fs can do it's magic behind the scenes of mmap. Of course, it could be that mmap + compressed filesystem leads to some unexpected (and bad) perf results. But then again, I've never tried :) have you?
That then means all access has to be done via said wrapper. For example I couldn't continue using the Mongo shell as I occasionally do, or if someone comes up with a handy tool written using a different language than my usual (Python).
I don't disagree with your points but do keep in mind they are a large company and work towards new features (such as this query matcher) does not imply they don't also have people working on the features that are important to you.
I have btrfs on my systems, but gave up and got new SSDs with ext4. The copy on write approach is terrible for Mongo's database performance. btrfs doesn't have tools that tell you what the compression ratios etc are so you can't know unless you fill up storage. It is possible to turn off data copy on write, but that also turns off compression! Also MongoDB likes to use lots of 2GB files instead of a single file for the database which makes this sort of stuff even harder to apply.
As a large commercial user of MongoDB (almost 300 instances of it running in production), I've seen some big shifts in 10gen's focus lately. They've really ramped up the customer service, and they are making more regular releases that seem like improvements.
Unfortunately, the better customer service doesn't provide better answers or solutions to problems, and the improvements aren't targeting long standing, basic issues with the platform.
This feature announcement fits the pattern: an improvement, but not in a critical area like indexing, or document-level locking, sharding stability, etc. There are basic fundamentals that need addressing like overcoming the 20K maximum connections per server, or mongos CPU usage. These are the things I deal with in production and that are business critical, but those feature requests sit untouched in JIRA for months or years.
This feature seems interesting, but it solves a problem I don't have. I'd prefer them to solve real world problems.
Because I want to run more than 100 instances of a webserver per mongos process, but I can't because it causes the connections per server to go over 20K.
Right now my application stack is woefully under utilized due to this completely arbitrary decision on their part not to allow the end user to set the connection limit. They've even admitted that they just picked that number four years ago and haven't looked at it since.
Just like they limit replica sets to 12 members maximum. What if I have higher read requirements than that limitation allows? Well, too bad.
I think this points at a fundamental issue with MongoDB at the design level - they don't allow end users to make any decisions about how to configure the product, even if those decisions might turn out bad.
Every Enterprise-level DB allows end user tuning of parameters, so clearly MongoDB isn't Enterprise-level.
To be fair, I educated the client on the alternatives should 10Gen not improve the aggregation framework so that it could do arbitrary transforms without failure should the output size exceed 16MB. [Use other MongoDB mechanisms, custom compilation with higher per-document size cap, etc.] What annoys me is that 10Gen did not mention this incredibly important limitation when they were touting the planned features of 2.2. My client would not have minded had it simply been the case that queries with large result set sizes were somewhat slow. What this client could not tolerate were failures to deliver any results at all. In retrospect, I wish that I had pushed harder for a solution based on the Hadoop stack. While it seems to have its own demons, at least there is an ecosystem dedicated to fixing the most blatant of limitations.
Do I regret choosing MongoDB a couple years ago to augment Postgres for several data analysis applications? Not really. But I do wish Mongo has been more mature. I'm happy to see each step forward taken, but it does bother me that often serious issues get overlooked for press-friendly feature additions. Until Mongo adopts some form of compression I won't be using for new projects aside from personal ones where development ease trumps everything else.