My company uses MongoDB. Our biggest pain points are:
1. MongoDB has massive storage overhead per field due to the BSON format. Even if you use single character field names, you're still looking at space wasted on null terminators. 32bit fixed length int32s also bloat your storage use. We solve this by serializing our objects as binary blobs into the DB, and only using extra fields when we need an index.
2. In Mongo, the entire DB eventually gets paged into memory and relies on the OS paging system which murders performance. For a humongous DB, not so much.
3. #1 and #2 force #3, which is sharding. MongoDB requires deploying a "config cluster" - 3 additional instances to manage sharding (annoying that the nodes themselves cannot manage this, and expensive from an ops/cost standpoint).
What I would like to know is:
1. What is the storage overhead per field of a document in RethinkDB? If it's greater than 1 byte, I'm wary.
1. In the coming release we'll be storing documents on disk via protocol buffers, which, unlike BSON has an extremely low overhead on fields. A few releases after that we'll be able to do much better via compression of attribute name information (though this feature isn't specced yet).
2. No ETA yet, but we're about to publish an updated, better document, better architected client-driver to server API spec, so we'll be seeing many more drivers soon.
If you use proto-bufs, it means you already have a system for internal auto-schematization. Why not pack all the fields together and use a bit-vector header to signify which fields are present and which fields have default values? I'd LOVE to see a document DB with ~1 bit overhead per field.
Yes, that's pretty much what we're going to do. It's a bit hard to guarantee everything in a fully concurrent, sharded environment so it'll take a bit of time, but that's basically the plan.
10gen have been thinking about compression but nothing specific has happened yet (https://jira.mongodb.org/browse/SERVER-164). ZFS + compression is interesting, but not 'production' quality if you're using linux, and last time I tried to get MongoDB running on Solaris I gave up...
The issue has been open for over two and a half years, is one of the most highly voted issues, and has yet to even have reached active engineering status.
Agree with you that compression is just a workaround for the awful BSON format.
1. MongoDB has massive storage overhead per field due to the BSON format. Even if you use single character field names, you're still looking at space wasted on null terminators. 32bit fixed length int32s also bloat your storage use. We solve this by serializing our objects as binary blobs into the DB, and only using extra fields when we need an index.
2. In Mongo, the entire DB eventually gets paged into memory and relies on the OS paging system which murders performance. For a humongous DB, not so much.
3. #1 and #2 force #3, which is sharding. MongoDB requires deploying a "config cluster" - 3 additional instances to manage sharding (annoying that the nodes themselves cannot manage this, and expensive from an ops/cost standpoint).
What I would like to know is:
1. What is the storage overhead per field of a document in RethinkDB? If it's greater than 1 byte, I'm wary.
2. Where is the .Net driver?