
What MongoDB got right - reqres
https://blog.nelhage.com/2015/11/what-mongodb-got-right/
======
s_kilk
> Let's start with the simplest one. Making the developer interface to the
> database a structured format instead of a textual query language was a clear
> win.

I think this is the most significant factor, by far. With Mongo it's turtles
(or at least Maps/Hashes) all the way down, without a strange pseudo-english
layer near the bottom that forces you to translate back and forth. For some
devs that's a big deal.

For the last while I've been experimenting with bringing the same feature to
PostgreSQL ([http://bedquiltdb.github.io](http://bedquiltdb.github.io)), turns
out it's very do-able, but I don't have enough time to make it as featureful
as it needs to be.

~~~
collyw
SQL is still one of the most readable languages in my opinion. Its the one
language where I find it easier to read queries than write them.

~~~
jerf
SQL's fine on its own. The problem arises when you try to manipulate it. A
query to get messages that were sent between now and a week ago, joining in
the user table for both the sender and the receiver to get their full names,
is sensible enough when written out as SQL. I can quibble about some of the
affordances of SQL, but it's good enough for most queries. The problem is if
you try to create some sort of data that represents "get the message body",
"messages sent since last week", "get the sender's name", "get the receiver's
name", and somehow programmatically assemble a query out of the pieces.
(Slight quibble, this isn't actually about OO, this is about "anything that
isn't SQL". A Haskell library will have the exact same problem.) An AST-based
approach doesn't guarantee that this will be easy, but it's still a step above
trying to assemble an SQL query from those pieces.

It's totally possible. There's a ton of libraries that do it in various ways.
It's just not even remotely _easy_ , like it should be. If you look into the
inside of those libraries you'll find a C'thulian monstronsity of special
cases interacting with special cases until the whole thing just explodes into
a brain-consuming mess, because SQL was very, very clearly not written for
this use case.

In another 10 or 20 years I look forward to data-based analysis that tries to
determine how much of the "NoSQL" movement was because the relational data
model doesn't work for all use cases, and how much of it was the entirely
accidental (in the Brooks sense) problems with A: SQL, the language itself,
not its capabilities and B: schema migrations with no _essential_ reason to be
as painful as they are. (And something something column stores, but I'm not
sure where they fit into this story exactly.) And to be clear on tone, I
really am interested. I'm pretty sure the answer won't be either extreme but
I'm pretty uncertain about where in the middle we'll fall.

~~~
virtualwhys
> and somehow programmatically assemble a query out of the pieces.

You can do this today, in various statically typed languages[1] (perhaps
dynamic languages as well, though the composition will likely be more ad hoc).

Agreed re: under the hood complexity, but that's more in supporting multiple
database engines (and related corner cases) than the transformation of Query >
AST > SQL.

[1]
[https://news.ycombinator.com/item?id=10525040](https://news.ycombinator.com/item?id=10525040)

~~~
jerf
Having written some of this code myself, I have to disagree; transforming SQL
fragments is legitimately frustratingly challenging. I have to admit I don't
know how to show a small example that captures the problem, though, because
this is one of those cases where the small examples always look easy. It isn't
until you're trying to support all of them at once that it is a problem.
Combining two fragments that each specify tables, joins, where clause filters,
and potentially subqueries with each of those recursively is nontrivial when
you get down to it.

It also _really_ doesn't help that "SQL is declarative" is basically a lie,
and it very frequently _totally matters_ which "synonymous" query you actually
throw at the database, thus eliminating a lot of the obvious clean answers in
any practically-useful library.

~~~
virtualwhys
Definitely not an easy task but the degree of difficulty depends on the
language.

Both Haskell and Scala, for example, have sufficiently powerful type systems
to allow for building up a typed query expression of arbitrary complexity,
which can then be deconstructed via pattern matching to assemble the sql
statement. Easy? Not at all, but very much doable, and incredibly
elegant...until you need to support various database engines and their
limitations/extended features; then the implementation hacks begin :\

Personally I think the work of Stephan Zeiger on the Slick library in Scala is
ground breaking. Also, Wadler et al's recent-ish paper on a composable query
DSL in F# is worth checking out[1]

[1] [http://homepages.inf.ed.ac.uk/wadler/papers/yow/dsl-
long.pdf](http://homepages.inf.ed.ac.uk/wadler/papers/yow/dsl-long.pdf)

~~~
jerf
You seem to persist in believing the problem is on the _input_ side. It's not;
it's on the _output_ side. My entire point is that the resulting SQL
generation code is what ends up quite hairy, because the way we have found to
separate concerns in 2015 and SQL are very, very different.

Let me put it this way... it is _precisely because_ a fluent, Haskell-native
SQL querying interface little resembles SQL in either syntax or usage that
there is the problem. It is precisely that there have to be these library _at
all_ that is the problem. If it were easy, these library wouldn't even exist,
or would be little more than drivers, but they're not just drivers... they do
a _lot_ of real work.

If SQL didn't stink by the standards of modern separation of concerns, we
wouldn't _need_ "ground breaking" work!

Or, to put it another another way:

    
    
        $ git clone git@github.com:slick/slick.git
        $ cd slick
        $ cloc .
        ---------------------------------------
        Language   files  blank  comment  code
        ---------------------------------------
        Scala      261    3589   3844     23129
    

Cut down to just the Scala. 23kilolines of Scala is a lot of Scala! This is
not an "easy" task. "Easy" would be something that just wrapped up the
existing syntax in a slightly more native form and would clock in somewhere in
the several hundred range.

~~~
virtualwhys
No, I'm saying input is easy, type system does virtually all the work. Output
is where the effort is spent (i.e. pattern match on query expression to
assemble the statement).

Why these libraries exist is because of string-ly typed programming; in the
case of sql: 1) it doesn't compose; 2) is not safe (sql injection attacks); 3)
difficult to refactor; 4) untyped, therefore whole class of bugs arise.

And yes, these libraries do a ton of work, well beyond just generating sql
statements, which, in the case of Slick pushes the LOC count way higher (non-
blocking IO, supports basically every database engine, native function
support, jdbc modeled in scala, etc., etc., it's a huge engineering effort,
somehow by one person).

Anyway, I'd like to see a much smaller composable query dsl with fewer
features and opt-in database support. Compile that to javascript and run in
the browser against local database would be very interesting. I think this can
be done, but would probably be pretty restrictive in terms of features
supported.

------
bryanlarsen
"So while MongoDB today may not be a great database, I think there's a good
chance that the MongoDB of 5 or 10 years from now truly will be."

Either MongoDB will be, or other databases that have learned the lessons, both
good and bad, of MongoDB.

RethinkDB appears to have captured the "MongoDB done right" mindshare, and
PostgreSQL has gained JSON and is gaining better replication in order to cover
the same niches.

~~~
threeseed
> RethinkDB appears to have captured the "MongoDB done right" mindshare

Mindshare is irrelevant. MongoDB is killing it in the enterprise right now.
They have integration with Oracle, Teradata, Hadoop and countless partnerships
with other vendors. You can guarantee MongoDB will still be around in 20 years
the way it is positioning itself. Can't say the same about RethinkDB (as great
as it is).

> PostgreSQL has gained JSON and is gaining better replication in order to
> cover the same niches

The PostgreSQL replication story is pretty pathetic given how old/mature it
is. And I've seen nothing to suggest that anything is really improving in this
area. There are a range of addons none of which are supported or built in.
Basic replication is confusing, the documentation non existent in parts and
good luck getting any support.

You compare it to MongoDB (or really any of the newer NoSQL databases) and
it's like night and day. It takes minutes to setup a replica set and there is
plenty of documentation and official support for any issues.

~~~
virmundi
It really is killing it in the Enterprise, and I'm trying to do my part to
remove it. I'm at a client that wants to use MongoDB. It's on the approved
product list. They have little to no experience with it.

Every chance I get, I advocate ArangoDB. It also is Mongo Done Right. You get
joins, graphs and a thoughtful future plan from ArangoDB team. To help bridge
the gap I've written an ArangoDB Hadoop connector [1]. Unlike the MongoDB one,
you can read and write.

I've also added better Clojure support to it: from a driver to a Ragtime
migrator.

Sadly as it stands Mongo has a better Ops story than ArangoDB. Until that
improves, I don't think that ArangoDB will make it into many Fortune 1000's
outside of some small prototype style applications. Maybe micro-services in
the enterprise will change this, but I don't think a large insurance company
wants to support multiple database standards in general, and definitely not
within a family.

1 -
[https://github.com/deusdat/guacaphant](https://github.com/deusdat/guacaphant)

~~~
tracker1
You should give RethinkDB a look... it has a great ops interface, and now that
it has automatic failover, is probably my first pick... I wanted to like
Arango, but they tend to lag behind in node support.

I happen to like MongoDB, warts and all.. that said, I would choose other
options over it, depending on the need.

~~~
virmundi
I did. At the time (haven't looked lately) RethinkDB didn't have GEO support
while ArangoDB. Turns out that I don't really need it right now (different
project). I stay because it's a great community. The Devs watch StackOverflow
for questions. They are respectful and helpful in the user group.

I know that RethinkDB has a good rep in those areas too. It's just that
ArangoDB is a good general fit for what I need even now. I guess you could say
I came for the GEO, I stayed for the warm hearted underdogs that are the
ArangoDB community.

~~~
tracker1
That's funny, my first production use of MongoDB was because their geo support
was better than ElasticSearch, which at the time, I couldn't get working
correctly... I'm toying around with RethinkDB today for a hobby projuct,
liking it so far.

------
yummyfajitas
Counting arguments very carefully? Nearly every SQL library does this for you.

    
    
        cur.execute("INSERT INTO a (b,c) VALUES (%(a)s, %(b)s);",
            { 'a' : a, 'b' : b })
    

Also, SQL is typed, so even if you did fail to count arguments there is a good
chance you'd just detect it the first time you ran it.

The article acts as if treating the DB like native structures is somehow
innovative and new - it's not.
[https://en.wikipedia.org/wiki/Object_database](https://en.wikipedia.org/wiki/Object_database)

We mostly abandoned object databases because they sucked. SQL was a huge
improvement over them. SQL is a great way to organize and preserve the
integrity of a lot of business data.

It's also a fantastic way to avoid repeated trips to the DB:

    
    
        SELECT * FROM employees AS e 
            WHERE e.department_id = (SELECT id FROM departments WHERE name = "engineering");
    

In Mongo, I'm pretty sure you need to first lookup engineering, then lookup
the employees in engineering. That could be O(# employees in engineering)
queries rather than 1.

~~~
lloyd-christmas
> In Mongo, I'm pretty sure you need to first lookup engineering, then lookup
> the employees in engineering. That could be O(# employees in engineering)
> queries rather than 1.

The problem with that summary boils down to bad architecture. The point of
document storage is storage with purpose; the intent being to make querying
EASIER. This could easily be structured to be a single query. You can
structure a document countless ways to represent that query, all of them would
likely be different based on the purpose of the app.

~~~
yummyfajitas
Whereas with SQL there is more or less a single canonical way to do it and
it's mostly independent of the app. I.e. the data design is minimally coupled
to the specific use cases.

Right now I'm building a data store and I _don 't know_ the app(s) that's are
going to be built on it.

It would be really great if computing could stop forgetting it's history.
Object databases failed for a reason.

~~~
lloyd-christmas
That doesn't make sense to me. A single query is faster than a join. Designing
things for your application's purpose seems fairly tangible to me. It's
application dependent. I use both for different B2B businesses I work with.
One works very well with Mongo and would simply be slower with SQL. The Mongo
app has one point of access for writes, while everything else is reads. It's
near impossible to become inconsistent. The other application would be an
absolute shit-show if it used Mongo. I'd never sleep at night with a fear of
it failing. Nothing is black and white, and choosing the wrong technology
isn't a failure of the technology.

~~~
yummyfajitas
Obviously a single query is faster than a join, but it's very hard to
guarantee that you'll always be making that single query.

Data inconsistency is not about concurrency. See this example on Wikipedia
illustrating why 3NF is necessary:
[https://en.wikipedia.org/wiki/Third_normal_form#.22Nothing_b...](https://en.wikipedia.org/wiki/Third_normal_form#.22Nothing_but_the_key.22)

In that example, "Al Fredrickson" can potentially have 2 birthdays even in an
entirely single threaded app.

I'm really curious to hear a use case where Mongo was actually significantly
faster than Postgres. Could you give a toy example that illustrates the flavor
of the problem?

~~~
lloyd-christmas
One of our applications I jokingly call "mad libs". It's effectively a
document generator. A single point of entry for constructing the document
skeleton (admin). The data structure for it is recursive, while user data is
meta driven. Any necessary data on any given page is a single query. The only
directly related multiple-query is the skeleton construction on the admin
side. That part is fairly complicated (and swarmed with tests). There are
potential inconsistency issues which could only come about if the admin was
putting in a fair amount of effort into destroying their own app (spam
POST/PUT wouldn't even cause it). Since the possibility exists even in the
near-impossible, there are eventual consistency tasks running. But again, it
would be pretty impressive to actually get it inconsistent. Given that the
user side is meta driven anyway, there aren't any lasting effects on their
side. Had we gone with SQL, the query for the (recursive) skeleton would have
been 3 tables, one of which would be recursive. My use case is where reading
drastically trumps writing. This tends not to happen in the unicorn industry,
so I fully agree that it's silly in many _publicly discussed_ applications.
But many B2B applications can fit the use case.

~~~
Jweb_Guru
You didn't respond to the question about speed. Probably because Mongo has
been considerably slower than Postgres in virtually every apples-to-apples
comparison I've seen.

------
krisdol
I don't understand the recent backlash against NoSQL here.

First off, almost all of the complaints would have been valid years ago.
Secondly, there is so much more choice out there today if mongodb wasn't the
right answer for your project, and so many NoSQL stores have had time to
mature and get polished APIs and docs.

We use various data stores for different purpose across microservices, mostly
ES, couchbase, and datomic, and "use the right tool for the job" and "do one
thing and do it well" feels like the right approach to take. For most
applications, a SQL DB feels like a really big hammer that is put to a lot of
things that don't look like nails.

~~~
yummyfajitas
The main problem MongoDB solves is "I don't want to learn SQL". The backlash
is against this use case.

(This article certainly seems to be appealing to this use case, c.f. "counting
arguments really carefully".)

~~~
bsg75
Alternatively, "I want to do everything in JavaScript", and not learn _any_
other languages.

A lot of recent "innovation" is mislabeled laziness.

------
rwmj
Just a note that in PG'OCaml (an OCaml interface to PostgreSQL), you _can_
write:

    
    
        "insert into foo (col1,col2,col3) values ($a, $b, $c)"
    

and it creates the safe prepared statement with ? placeholders. At compile
time. Type-checked against the database to make sure your program types match
your column types.

[http://pgocaml.forge.ocamlcore.org/](http://pgocaml.forge.ocamlcore.org/)

~~~
annnnd
I would be very careful with such SQL statements. I am guessing it relies on
some intrinsic fields' order? That could change anytime. Order of fields
shouldn't have any impact on you app, but I think in your case it does.

~~~
rwmj
The "..." wasn't literal. I have amended the post to make this clear.

------
ngrilly
I agree that the three areas outlined in the article are things that MongoDB
got right: a structured query language (instead of a textual query language),
replica sets, and the oplog.

But the lack of transactions over multiple documents (in the same shard at
least) and the lack of joins over multiple collections are a big showstopper
for the kind of applications I develop.

I note that solutions like YouTube's Vitess provide something similar to
MongoDB's replica sets.

I also note that PostgreSQL's logical decoding provide the same functionality
than MongoDB's oplog tailing.

~~~
progx
Always wonder what kind of simple apps most people must write, if they not
need joins?

I will be happy if i got such simple tasks :)

~~~
threeseed
How exactly do you think eBay, GMail, Facebook etc work ? They aren't relying
on relational database joins.

If you want to write a truly scalable application you structure everything
such that you do joins in your application layer.

[http://highscalability.com/ebay-
architecture](http://highscalability.com/ebay-architecture)

And in the case of MongoDB you avoid joins since it is a document database.
You embed data instead.

~~~
ngrilly
Not everybody works at eBay, GMail or Facebook scale.

Most applications fit very well in a single server. For example, Stack
Overflow runs on a single instance of SQL Server, replicated to a slave in
another data center. In such a case, the convenience of joins and transactions
is priceless.

And even at scale, it makes sense to rely on joins and transactions. The
perfect example is AdWords that runs of F1 and Spanner:

"Our users needed complex queries and joins, which meant they had to carefully
shard their data, and resharding data without breaking applications was
challenging."

[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/41344.pdf)

~~~
threeseed
Exactly. But many of us do have apps that are beyond the capabilities of a
single instance.

And in this situation the rule of thumb is to do joins in your application
layer so you can store different types data (e.g.
graph/document/relational/unstructured) in different systems and easily cache
where needed.

The fact that so many new databases have been created in the last decade
suggests that there are a lot of people who do fall into this camp.

~~~
theseoafs
A lot of us have apps that are beyond the performance requirements of Stack
Overflow?

~~~
Pyxl101
Yes. It's not an especially exotic level of performance. Especially if you're
talking about applications that handle traffic from other machines, as opposed
to humans. As websites go, Stack Overflow is of reasonable size, but as
_systems_ go it's small.

They have about 7.5m visits per day to Stack Overflow. That's about 86
requests per second, which perhaps at peak is several multiples larger than
the average. I wouldn't consider that a particularly gargantuan website. That
sounds like traffic that you could service with a reasonable fleet of web
servers and database fleet, given caching. Quite a lot of actions on the site
can be applied with eventual consistency, I'd imagine.

[http://stackexchange.com/sites?view=list#traffic](http://stackexchange.com/sites?view=list#traffic)

I don't mean to say that the system or the problem that it's solving is
trivial - I am sure it is difficult to get right. As websites go, it might be
large, but as _systems_ go it's not particularly high traffic among systems
that receive traffic from machines rather than humans. Imagine that you
operate a data center, and you want to sample CPU, memory, etc. from your
machines every minute. If you collect 50 samples per machine per minute, and
you have 258 machines, you'll be handling 86 samples per second. Storing 86
samples per second into a time series database is probably considerably easier
than SO's website rendering, but it goes to show that high-traffic or high-
frequency systems are common in companies beyond small to medium size. It is
easy for cross-cutting concerns like this to have _massive_ request volumes,
far greater than the human-generated traffic to any website.

~~~
theseoafs
7.5m visits per day. What does that mean, page loads? Stack Overflow isn't a
static website. One page load is a lot of requests to the service. Stack
Overflow is a very dynamic site, and a lot of requests are made after you
actually load the page. I'm not sure you're accurately characterizing the kind
of load that Stack Overflow is subjected to.

To anyone else reading: no, your use case probably isn't so special that the
solution Stack Overflow arrived at just doesn't work for you.

------
bsg75
> You can argue, and I would largely agree, that this is actually part of
> MongoDB's brilliant marketing strategy, of sacrificing engineering quality
> in order to get to market faster and build a hype machine, with the idea
> that the engineering will follow later.

Author nearly lost me here with this logic. Placing Marketing ahead of quality
in something that is supposed to store a _very_ valuable asset (data) is near
insanity.

I get the mindset of "break fast", "release often", etc. in terms of customer
facing _features_ , but in something that is supposed to be a core part of
your foundation, stability is if utmost importance. Otherwise nothing else
works - and you lose customers, business, opportunities - because you can't
look them up later.

Its not "brilliant marketing", its just marketing.

~~~
smacktoward
This is all true, but the success of MySQL shows pretty clearly that just
because something is insane doesn't mean it's not good business.

~~~
bsg75
I think the success of MySQL is due to there being fewer options for a period
of time (the "dot.com boom"), and thus it became a popular choice to avoid
commercial RDBMS costs.

I'm no MySQL fan when things like PostgreSQL are an option, but its probably
more sane than some other currently popular choices.

------
emilburzo
I have to agree with the author, especially since the points he raises are the
ones that helped me greatly on my first "serious" personal project[1].

Coming from postgresql land I would have never thought you can have such great
replication with automatic failover. I've had literally 100% uptime for the
past year.

And that's on commodity servers (one of them being in a room in my apartment,
the other two in a proper datacenter) going through the usual upgrades,
downtime, reboots, going from mongo 2 to mongo 3 and such.

Speaking of which, the migration from mongo2 to mongo3 was another pleasant
surprise: they've made it backwards compatible. So I could do the upgrade on
the servers, one by one, checking everything was ok and after that I could
focus on updating the drivers and rewriting the deprecated queries, no need to
have everything ready at once.

The accessible oplog was another gem that fit my project really well. Gone was
the need to poll the database, I could just "watch" the oplog. That, coupled
with long polling on the browser side meant I'd have very little chatter
between the db/server/web client when idle. Websockets would have been nice,
but adoption wasn't high enough that I'd be comfortable going forward with it.

And all this considering MongoDB was my first NoSQL experience.

I agree it doesn't fit every project, but when it does, it's a really nice
experience.

[1] [https://graticule.link/](https://graticule.link/)

~~~
ngrilly
I agree that MongoDB has a great replication story.

But I don't understand that part:

> The accessible oplog was another gem that fit my project really well. Gone
> was the need to poll the database, I could just "watch" the oplog.

Coming from PostgreSQL, you could do the same using LISTEN/NOTIFY?

~~~
emilburzo
> Coming from PostgreSQL, you could do the same using LISTEN/NOTIFY?

I have to admit I was not aware of this feature.

However, from the docs[1]:

> Commonly, the channel name is the same as the name of some table in the
> database, and the notify event essentially means, "I changed this table,
> take a look at it to see what's new".

From what I understand, you just know that _something_ has changed, the actual
change is not included in the event, so you need at least another query to see
what changed.

Did I understand correctly?

In MongoDB you get the operation (insert, update, delete), the document and
another few details right in the event.

[1] [http://www.postgresql.org/docs/9.4/static/sql-
notify.html](http://www.postgresql.org/docs/9.4/static/sql-notify.html)

~~~
ddorian43
When you notify, you can also include a payload (ex json) which can be
whatever you want.

~~~
anentropic
I looked into this before and the Postgres docs say somewhere the payload size
is limited and intended to be small metadata... in some cases you will not be
able to fit the contents of the update into it

~~~
danneu
Looks like the default limit is 8000 bytes.

------
_yy
RethinkDB took all the good parts of MongoDB and added proper engineering.

[https://www.rethinkdb.com/](https://www.rethinkdb.com/)

~~~
ngrilly
But still no transactions over multiple documents (at least in the same
shard)?

~~~
jmakeig
ACID transactions in a highly available distributed system are hard and often
fail in subtle ways when done wrong at the edges. Any implementation will take
years to mature in the lab and in actual production usage. This isn’t a knock
on the Rethink guys; their product looks pretty awesome and is moving quickly.

For a solution today, MarkLogic is a transactional distributed document
database. Cross-document and cross-partition transactions have been a key
tenet of the architecture from the beginning (like, 2002 beginning). Take a
look at [https://developer.marklogic.com/blog/how-marklogic-
supports-...](https://developer.marklogic.com/blog/how-marklogic-supports-
acid-transactions) for details.

Full disclosure: I’m a Product Manager at MarkLogic.

------
angelbob
I love the point about the Oplog.

There are a few equivalents for common SQL DBs (see LinkedIn's Databus for
Oracle and MySQL), but in general, getting access to the write log is really
hard. Even though it's sitting there!

It would be wonderful if there were some kind of established API or library
that would let you parse the MySQL write log without doing hideous, fragile
operations that change from version to version. Sure, change the format, but
at least version and document it!

------
sriku
When we chose MongoDB for a project, a dominant criterion was out of the box
geo queries. It helped that the storage and query approach had good impedance
match with NodeJS. From a query perspective, we wouldn't have benefited much
from SQL anyway, since much of the reading is free text or social graph or
location based search which we moved to Solr.

------
franzwong
It becomes much simpler to setup replication in PostgreSQL than before.

reference: [https://www.digitalocean.com/community/tutorials/how-to-
set-...](https://www.digitalocean.com/community/tutorials/how-to-set-up-
master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps)

