
Why you should never use MongoDB (2013) - wheresvic1
http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
======
chatmasta
My first introduction to databases was with PHP/MySQL, where normalization was
the name of the game. The whole point of normalization is that there is _no_
duplication of data anywhere. If it's possible for duplicate data to exist,
that's a symptom of a design flaw in the schema.

I've been using Mongo recently, and every time I raise criticism of it, the
counterargument I hear is "forget about normalization! Duplication is okay."

Like the author, I really cannot wrap my head around this. While I understand
that duplicating documents across collections may make querying faster, what
about when you want to change the document? You need to propagate the change
across every duplicate of the document in every collection where it exists.
This means that any "de-duplication" logic needs to happen at the application
level, rather than the database level.

Parse.com provided an interesting abstraction around that with "cloud code"
and beforeSave and afterSave triggers. To maintain consistency, they encourage
propagating updates to a document within its collection's beforeSave function.
So if a document changes in one collection, you write code to change all
instances of the document in other collections. That's nice in that it almost
_feels_ like you're writing the dedeuplication logic in the database layer,
because you can view the "beforeSave" and "afterSave" functions as extensions
of the schema. As long as those functions are up to date, the schema and any
pseudo-linked documents will stay up to date.

But I really don't buy it. The strengths of mongo encourage a design that
necessitates complexity for any significant write operation.

I think the real issue is that "if you have a hammer, everything looks like a
nail." Mongo and other NoSQL stores have some real use cases, but people who
are more familiar with Mongo than RDBMS are too trigger happy to employ it as
a solution to problems where a RDBMS is the clear solution.

~~~
lmm
> Like the author, I really cannot wrap my head around this. While I
> understand that duplicating documents across collections may make querying
> faster, what about when you want to change the document? You need to
> propagate the change across every duplicate of the document in every
> collection where it exists. This means that any "de-duplication" logic needs
> to happen at the application level, rather than the database level.

You're assuming that you need to change things. Almost all big-data approaches
work better when you abandon that and go for a log-structured model where you
only ever append.

> I think the real issue is that "if you have a hammer, everything looks like
> a nail." Mongo and other NoSQL stores have some real use cases, but people
> who are more familiar with Mongo than RDBMS are too trigger happy to employ
> it as a solution to problems where a RDBMS is the clear solution.

True enough, but I think the converse is more true. If you're using a
traditional RDBMS you're accepting a big series of constraints in exchange for
functionality you often don't use. Indices are updated synchronously on every
insert, slowing your writes, for the sake of transactional guarantees that
most applications aren't written to take advantage of. Queries have to be
passed over the wire as strings, so your application will spend a significant
chunk of its time building them (hopefully using a library without
vulnerabilities to send them over the wire so the database can spend a
significant chunk of its time parsing them), or else you get to deal with the
db-specific and per-connection quirks of prepared statements, for the sake of
supporting ad-hoc querying in a language that frankly isn't great for humans.
Tables, materialized views and the like occupy this awkward in-between
condition where it's not clear whether you're supposed to create and modify
them ad-hoc (and manipulate them programatically) or not; if you really do
need to do ad-hoc reporting then they're what you want, but often the
performance implications of that make it unacceptable to do in production.

I don't think a lot of cases are good fits for RDBMS. If you don't need ad-hoc
reporting then you're better off building a data processing pipeline where you
produce your results directly rather than the sort of semi-aggregating you end
up with with an RDBMS. If you don't need full ACID then it's not worth paying
the performance cost of it. And I don't know why there aren't RDBMSes without
better query languages and schema definition languages, but there aren't.

~~~
breischl
>>Queries have to be passed over the wire as strings, so your application will
spend a significant chunk of its time building them... or else you get to deal
with the db-specific and per-connection quirks of prepared statements...

Or you could use stored procedures. Think of it as making your RDBMS into a
microservice, if you like.

>>I don't know why there aren't RDBMSes without better query languages and
schema definition languages Huh... I always thought SQL was pretty
straightforward for most cases. It only gets really arcane when you get into
advanced cases (eg, recursive queries) and/or vendor extensions. It does take
a bit of a mental shift, but no more so than imperative-style to functional-
style, IMO.

There are certainly tradeoffs to using an RDBMS. But I think a lot of people
jump to the conclusion that they have a big-data situation when they really
don't. I also think a lot of people underestimate the value of very easy ad-
hoc querying and already-solved backup/restore.

~~~
lmm
> Or you could use stored procedures. Think of it as making your RDBMS into a
> microservice, if you like.

The trouble is testability, versioning, deployment, and SQL just not being a
pleasant language for expressing business logic. And the lack of a library
ecosystem. And poor IDE support. And...

~~~
breischl
>>SQL just not being a pleasant language for expressing business logic

No kidding - don't do that. Sprocs are great to present a query interface and
avoid sending queries over the wire every time. They can be handy to decouple
the actual storage architecture from the query interface, so you can do tricky
stuff in SQL without screwing up the clients. That doesn't mean you should
stick your business logic in there.

>>testability, versioning, deployment... library ecosystem...

And how would any of those problems be solved by using raw SQL strings or a
document DB? That just moves the problems back into the schema or the data,
where it's even harder to deal with.

>>poor IDE support

So your code editor has a better IDE for SQL? Or are we talking about a
document DB? I haven't seen any of them that were anywhere close to SSMS.

~~~
lmm
> And how would any of those problems be solved by using raw SQL strings or a
> document DB? That just moves the problems back into the schema or the data,
> where it's even harder to deal with.

I'm arguing for using a non-SQL interface. Either a structured binary query
protocol (which leaves you shipping a lot of data around sure, but at least
removes the constructing-and-parsing SQL overhead), or a map-reduce style
setup where you can run your queries where the data is but in a first-class
programming language.

~~~
breischl
It seems to me that you're advocating changing implementation details (binary
query protocol instead of SQL, map-reduce queries instead of sprocs) and
claiming that it is somehow fundamentally different. But I don't see how that
would be the case.

How is a binary query protocol fundamentally different than calling a sproc?

How is a map-reduce job in Javascript any different on the attributes you
mentioned than ad-hoc SQL or a sproc?

Obviously map-reduce is a fundamentally different approach to data processing,
but it seems to have similar traits in regards to source control, testability,
etc.

~~~
lmm
Binary query protocol is just a tweak, but it's a tweak that most RDBMSes are
missing, and it matters for some workloads.

Switching to a model where you supply your own map-reduce or pipeline is a
real shift, I think, from the DB as a framework that manages querying for you
to more of a library/toolkit you can use to write your own computations. The
indexed tables model is an incredibly effective compromise, but it's still a
compromise - if you know specifically what you need to do, you can do it
better.

Running a first-class programming language with full support for usual
programming language tools is a major difference for developability,
testability, libraries and so on. Deployment model depends on the datastore -
plenty of non-SQL ones have room for improvement here - but I think the
traditional RDBMS still has the worst of it.

~~~
breischl
Huh, interesting. I wonder if this is down to us having used different
RDBMSes?

Most of my experience has been with SQL Server, which I think does use a
(mostly) binary wire protocol, at least when calling sprocs. It also has a
pretty good query optimizer - I've spent a lot of time trying to beat it with
hand-tuned queries and only come up with something better about 50% of the
time. And there are a lot of tools for source controlling the DB scripting and
testing everything. It's not hard to do unit test style runs that set up and
tear down tables and DBs.

Has that not been your experience?

~~~
lmm
> which I think does use a (mostly) binary wire protocol, at least when
> calling sprocs

Hmm. I'm used to invoking storedprocedures via "select mysproc(param1, param2,
...)" which still has to be formed into a string and then parsed on the DB
side using the arbitrary-SQL parser (because there's no way for the DB to know
a priori that it's not a "regular" query). Does SQL server have some special
case binary protocol for invoking them directly?

> It also has a pretty good query optimizer - I've spent a lot of time trying
> to beat it with hand-tuned queries and only come up with something better
> about 50% of the time.

The query planner can usually run the best query possible with the indices
that exist, sure. But you have to fit your calculation into this model of
indices that are built on insert and everything else happening at query time.
Or you go down the route of lazy materialized views that make use of other
lazy materialized views, specifying indexing strategies... I mean I think you
can ultimately express any data processing pipeline in a RDBMS if you try hard
enough (though you have to use database-specific features that tend to be
less-well supported by the ecosystem), but at some point it's easier to just
have a first-class programming language that has access to the data, and write
the code that you want to run.

> And there are a lot of tools for source controlling the DB scripting and
> testing everything. It's not hard to do unit test style runs that set up and
> tear down tables and DBs.

Up to a point. It's easy to end up with "unit" tests that take a second for
each test, which mean it's not really practical to get good coverage of logic
there.

In terms of source controlling and so on I guess the big problem is that you
now have a distributed system written in two quite different technologies. So
you've got to figure out a release and deployment process that handles both,
and a lot of shops don't seem to bother. If you're already running a multi-
language microservice architecture then this is probably a lot less of an
issue.

~~~
breischl
TBH I've never really looked at how TDS (the SQLServer wire protocol) works.
From a code level, in ADO.NET you would create a query object, set the type to
"sproc", set the name of the sproc, and attach parameters as objects. I bet
the sproc name is going across as text, and maybe the param names, but that's
probably it.

>>So you've got to figure out a release and deployment process that handles
both, and a lot of shops don't seem to bother.

Yeah, that's true. Setting up something along the lines of Rails' migrations
isn't really _that_ hard, but many don't bother.

------
ThePhysicist
The post should probably be titled "Why you should not pick a technology based
on hype and without evaluating it first". While MongoDB has (and probably
still has) some flaws and is not the perfect DB system, there are valid use
cases where it can be a good choice. Building a social networking site that
requires rich queries along a relationship graph is most definitely not one of
them.

~~~
enraged_camel
The HN consensus seems to be that MongoDB is good only for toy projects, and
that you should switch to a _real_ database as soon as things start getting
more complex.

~~~
rch
Actually, when you have a toy project that needs a RDBMS, go ahead and use
SQLite.

The consensus is that relatively few applications are a good fit for MongoDB,
irrespective of scale.

~~~
alexchantavy
What's an example of an application that is a good fit for MongoDB?

~~~
kolme
I actually don't have any experience, but I heard this from wise people:

If you have a file store service, where you're storing relatively big files
(big images, videos, etc) and some metadata associated to them which need to
be queried.

It's easier to implement than "traditional" solutions (metadata goes in a
RDBMS and actual files go to some directory, NAS, or whatever).

Also, apart from being a nicer solution from a programmer's perspective,
you'll be able to easily scale horizontally because it supports sharding.

Like I said, I didn't tried this myself so take it with a grain of salt. Also,
I don't know how that compares to storing blobs in a RDBMS.

~~~
KMag
> Also, apart from being a nicer solution from a programmer's perspective,
> you'll be able to easily scale horizontally because it supports sharding.

Sharding file storage is pretty easy.

On the other hand, I'm not familiar with the MongoDB APIs, but I assume it
doesn't support handing a socket file descriptor from the webserver over to
MongoDB so Mongo can sendfile(2) the data directly from the kernel's page
cache to the TCP socket for locally resident data.

With the files stored directly on disk and metadata in an RDBMS, your
webserver can sendfile(2) those files that are permanently stored locally or
cached to local disk, and act as a proxy for other shards. Extra context
switches and coping your data one or two times more than necessary can add up
quickly.

------
geophile
More than a story about MongoDB, this is a story about hype. The
unquestioning, unthinking acceptance of some technology just because someone
enjoying his 15 minutes of fame has tweeted about it. The designers of
Diaspora set out to build a distributed database that needed to support
complex queries and transactional updates. Only they didn't realize it. And
they chose MongoDB because of hype. And of course it failed.

TFA says: "In 2010, when the Diaspora team was making this decision, Etsy’s
articles about using document stores were quite influential, although they’ve
since publicly moved away from MongoDB for data storage. Likewise, at the
time, Facebook’s Cassandra was also stirring up a lot of conversation about
leaving relational databases. Diaspora chose MongoDB for their social data in
this zeitgeist. It was not an unreasonable choice at the time, given the
information they had."

Yes, it was an unreasonable choice. This is basically saying that the cool
kids are using document stores, so we should too. There is no discussion of
requirements, no recognition of the obvious conflict between Diaspora's data
model and MongoDB's, and no discussion at all of the need for transactions,
let alone distributed transactions!

TFA says: "In this post I’ve talked about how we used MongoDB vs. how it was
designed to be used. I’ve talked about it as though all that information were
obvious, and the Diaspora team just failed to research adequately before
choosing. But this stuff wasn’t obvious at all."

Of course it was obvious. If you have a highly recursive data structure, and
you decide to basically store the result of a seven-way join in a single
document, then there is duplication. You should know, just from your earliest
programming days, that duplicating data in data structures leads to grief. How
are updates going to be handled? And if you decide against duplication, you
have to do the joins, and clearly MongoDB doesn't solve that problem. So even
before worrying about transactions it is _obvious_ that you have a problem.

Finally, they migrate to MySQL and (then also? in addition?) Postgres.
Although I don't quite see how this fits their database per pod architecture.
I'm guessing that each pod talks to -- what -- one centralized database? Maybe
they will eventually figure out what's wrong with that.

Oy.

~~~
enraged_camel
>>Yes, it was an unreasonable choice. This is basically saying that the cool
kids are using document stores, so we should too. There is no discussion of
requirements...

You hit the nail on the head. I posit that this problem is caused by the Agile
mindset that says "we don't need to collect or discuss requirements upfront.
They change too often! Instead, we will just start hacking and change things
as we go along." Two years later, the team is running into all kinds of
complex, hard-to-reproduce problems, and they realize that they have made a
terrible mistake picking MongoDB and they have to undertake a very painful
migration and rewrite of the backend.

~~~
geophile
Yes, I agree, it is easy to see how Agile could lead to this particular,
disastrous, dead end.

------
yongjik
> Here we have copies of user data inlined. This is Joe’s stream, and it has a
> copy of his user data, including his name and URL, at the top level. His
> stream, just underneath, contains Jane’s post. Joe has liked Jane’s post, so
> under likes for Jane’s post, we have a separate copy of Joe’s data.

> You can see why this is attractive: all the data you need is already located
> where you need it.

That doesn't sound attractive at all. That sounds more like a recipe for "we
use five hundred times more RAM than the deduplicated data that we actually
have" type of disaster.

...All for a production data set which could be "turned into about 1.2 million
rows in MySQL." I think the moral is "If you think your data is too big for a
relational database, you're almost certainly wrong."

------
electricEmu
This article has little to do with the specific failings of MongoDB at all.
The author takes offense with document databases and denormalization. I
disagree with the blanket statement.

Denormalization. It's not always the answer, but sometimes making reads easy
and writes more difficult isn't bad for the problem set. The funny thing is a
social network at scale isbt going to be able to use a traditional SQL
database without sharding it and killing joins. It will be a documebt DB with
limited SQL features.

As for Mongo, it is sold as the answer to everything just like any other
product. It actually fits some problems sets. I actually had great success
with it, at scale, in an Amazon division. It ain't the answer for a lot of
problem sets though.

I give this blog post two of five for bringing up a mildly interesting topic
HN already beat to the ground with misguided conclusions.

------
manishsharan
Not this FUD again! It is true that MongoDB created a lot of unexpected
problems for the early adopters who did not do enough due diligence and
testing. I encountered them during my POC with Mongodb but I was able to get
past them easily thanks to a robust user community. I have experiences no such
issues with current version with Mongodb.

Today I am fighting my enterprise bureaucracy to get MOngoDB for our
enterprise applications .FUD like this allows armchair architects to quote
this article to spook the management and we are stuck with stuffing json as in
a RDBMS tables.

~~~
hitchhiker999
Finally, great comment - had to scroll too far to get to this!

1) The developers clearly weren't experienced enough to know what document
storage is and when/where it shines/fails. 2) Author jumped on the ever-boring
'i hate mongo' easy train. 3) I am so sick of technology becoming a religious
war.

------
mack73
At this point in time, making a statement about MongoDB on HN should almost be
considered trolling. We already know about the short-comings of that code-
base. It will work fine until it doesn't and you'll loose some of your data.
Some folks are using it and enjoying it. Others aren't. "Your shouldn't use
it" is like saying you shouldn't use javascript for things other than UI.
Noone cares anymore.

~~~
brandur
To be fair, I think your comment shows why this sort of post _can_ still be
useful. You're very unlikely to lose data on a modern Mongo system using
default configuration (their troubles with durability have largely been
solved), but there are many other good reasons not to use it, and it's
enlightening to read about and understand what they are.

~~~
mack73
If "their troubles with durability have largely been solved" were true then
I'm sure HN would be flooded with posts about "MongoDB solves durability with
an entirely new architecture". I might be wrong here. How did they solve the
"your data is lost when a disk fails"?

~~~
brandur
I'm talking about "durability" here in the context of the "D" in "ACID".

Previously, Mongo had very serious problems in that are because its client
would assume that any message sent to the outgoing socket buffer was persisted
"well enough", which was an obvious untruth [1]. This was also one of the
somewhat underhanded techniques they used to achieve their early benchmarks.

As of version 3, they have defaulted their client's "write_concern" value to
"1", which means that it will wait for confirmation from a replica set's
primary before considering a value persisted [2]. This puts Mongo roughly on
the same level as any other database in terms of durability guarantees.

Disk failure is entirely tangential to your original premise that "Mongo loses
data". I'm not getting into it, but there a variety of techniques that Mongo
(and every other known database) can use to protect against that.

[1] [http://hackingdistributed.com/2013/01/29/mongo-
ft/](http://hackingdistributed.com/2013/01/29/mongo-ft/)

[2] [https://docs.mongodb.com/manual/reference/write-
concern/](https://docs.mongodb.com/manual/reference/write-concern/)

~~~
mack73
MongoDB was initially designed to beat other nosql systems in benchmarks, is
what I take away from reading about it. Someone took issue with that and wrote
an article. "MongoDB lies" and "is slow" were some of the claims made from
your link #1.

Now that these issues are gone, by having reasonable default settings for
write concern and journaling, how well does MongoDB do in the benchmarks
today?

Rewgarding default settings, how is "w=1" considered a safe write? The data
exists in a single node and has not been propagated. If you only have one node
then I guess it's as safe as can be. Is MongoDB suitable as a single node
installation though? I would have thought "w=2" or "w=majority" would be the
"safe" setting.

~~~
brandur
> Now that these issues are gone, by having reasonable default settings for
> write concern and journaling, how well does MongoDB do in the benchmarks
> today?

Reports differ by benchmark, but the answer can be summarized as "not well".
From [1] above:

> MongoDB is now a lot slower compared to v2.0. On the industry-standard YCSB
> benchmark, MongoDB used to be competitive with Cassandra, as seen in the
> performance measurements we did when benchmarking HyperDex. Ever since the
> change, MongoDB can no longer finish the entire benchmark suite in the time
> allotted.

I'm not sure I'd call what they were doing "cheating" per se because I
honestly don't think they understood what they were doing, but it's fair to
say that even if performance has improved since those benchmarks were run,
Mongo definitely doesn't have any secret sauce.

------
franciscop
Is it a joke?

> Error establishing a database connection

Or is it truly an error? In a page that from the title is bashing another DB

~~~
KirinDave
Unfair comparisons aside, no one is "bashing" MongoDB. These criticisms have
been levied for years (as the article demonstrates by its byline). These
complaints have been levied not just because we should expect more reliability
from corporate-backed database products, but because _they run counter to the
advertising and technical literature_ that 10gen produced.

It is not "bashing" to say, "I do not like this and by the way, it's unsafe."
Especially when the Mongo project has been so resistant to fixing these core
issues and instead are happy to sell expensive consulting and training on
workarounds for the issue rather than addressing their core technology's
issues.

~~~
franciscop
Sorry didn't read the article due to the error and I truly wasn't sure if it
was an error or a joke; so from my limited knowledge I wrongly assumed it was
an article _bashing_ MongoDB

------
avitzurel
I think it's ironic that the page says "Error establishing a database
connection".

As someone who used MongoDB extensively in the past (8TB+ of data) and also
managed the devops side of things, I can tell you straight out that MongoDB
has a place in a lot of startup stacks.

I would likely __not __use it as a main source of truth for any application.
However for a lot of things, it 's a good database.

Since I can't read the post I can't really address the points made in it, so
at this point enough said.

~~~
devishard
> I would likely not use it as a main source of truth for any application but
> for a lot of things, it's a good database.

I really don't understand this. In what case is it every acceptable for a data
store to lose data? And that's not even MongoDB's only problem: it memory
leaks!

~~~
avitzurel
I have to say that I did not experience a single data loss that was a result
of the database misbehaving.

Memory leaks weren't a huge issue for us as well. After stabilizing the setup
I have to say it was basically a fire and forget part of the stack for us.

The role of MongoDB was to act as a fast-insert and aggregation framework for
other parts of the system.

So, we would insert BIG amounts of data at a time and aggregate it into a K/V
store where we pulled the data from.

After a while, the setup became to expensive to run, at this point we turned
it off for a cheaper solution, but in terms of functionality, it functioned
pretty well.

~~~
mack73
Fire-and-forget is awesome for when you do not care at all about your data.
I'm sure those types of writes are super (duper) fast. What exactly is the use
case for that? Serious question.

~~~
vidarh
Don't know about the guy you replied to, but e.g. consider any application
that regularly crawl feeds, api's etc. where the data is rapidly changing and
only a portion of the data is necessary to give good output. There are lots of
applications like that where you just need "enough" data to give good results
and/or where any loss will auto-heal next time you crawl the original source.

~~~
mack73
That makes sense to me. If your MongoDB cluster under preasure will only
actually persist 90% of your writes and this is something you anticipate, then
MongoDB seems like a good choice, if writing in this style is faster than
other nosql systems (that make grander promises about persistance) that is.

~~~
vidarh
Exactly - the important thing is you need to actually understand the risk, and
make an informed decision what level of loss is ok to you (and you should
understand whether or not it's actually saving you anything - as you say, it
makes sense _if_ it is faster; there's no point losing data if you don't gain
something from accepting the risk).

This is also perhaps the biggest problem with MongoDB: It's fast but unsafe
"out of the box", and not everyone will know that when they use it. I think
that's a large part of the problem a lot of people have with it.

------
20years
My company got hired for 2 very large MongoDB to MySQL migration projects this
year alone totaling over $170k. These apps should have never been on MongoDB
in the first place. Total mess that cost them lots of $$.

I am okay if others continue to use MongoDB. It will keep my team gainfully
employed ;)

~~~
icc97
Wow, that is truly impressive. I'd be surprised if there's any examples of
people prepared to spend the same amount of money to go the opposite way.

~~~
20years
To be fair, the projects pretty much required a full code re-write too.
MongoDB in both cases was one of many bad decisions. Choosing all hipster
stuff is what got them into trouble.

------
cromulent
Previous discussion:
[https://news.ycombinator.com/item?id=6712703](https://news.ycombinator.com/item?id=6712703)

------
hitr
I ran into exactly the same problem as the author . I started working on mongo
as it was already chosen as backend for a startup .i tried to avoid it but
they did not listen. We understood duplication is the only way forward and you
need to unlearn your RDBS skill.Then features crop in and you realize that you
need joins badly.Then you would like to have some kind of transaction support
and its high time you realize you are really screwed. I always used to wonder
what is the real use case for mongo and that is explained nicely in the blog

 _The only thing it’s good at is storing arbitrary pieces of JSON.
“Arbitrary,” in this context, means that you don’t care at all what’s inside
that JSON. You don’t even look. There is no schema, not even an implicit
schema_

------
BukhariH
Cached copy:
[https://webcache.googleusercontent.com/search?q=cache:RFqcOb...](https://webcache.googleusercontent.com/search?q=cache:RFqcOb8xm2EJ:www.sarahmei.com/blog/2013/11/11/why-
you-should-never-use-mongodb/&num=1&hl=en&gl=uk&strip=0&vwsrc=0)

------
SmellTheGlove
This is from 2013 - honest question from someone that doesn't know, is it
still relevant?

~~~
hobs
If you read the post, the problem wasn't Mongo as much as they chose the new
hotness (document db/graph model) when in reality their problem was already
well solved by a RDBMS and tabular model, they had classic problems like
duplicate data and performance issues when they tried to use a document db as
a RDBMS.

They changed over and everything was fine.

~~~
weddpros
before: mongodb is bad because it imposes we handle the case where we'd like
to shard the db (which breaks joins).

after: mysql is sooo much easier with joins. Who cares about sharding anyway.

I hope people understand distributed systems a bit better today, but I don't
have high hopes.

~~~
toast0
Without any knowledge of mongodb; if using it means you have to handle
sharding right away, and using MySQL means you have to handle sharding when
you outgrow a single instance, isn't that a big deal? MySQL scales pretty well
these days, I think, so you can get a huge box -- a lot of stuff will fit in
2TB of ram.

~~~
weddpros
MySQL can't scale horizontally if you're using it like it's a relational
database.

No joins between shards means Sarah Mei could throw away her data model if she
needs sharding.

She would then need to denormalize her data to avoid joins, which is the main
argument in the article.

The project they were working on was a social network, a free alternative to
Facebook. So I guess it's safe to say a single box, even the biggest you can
imagine, could not ensure scalability. Beyond certain limits, sharding is the
only option.

Facebook has implemented exactly the denormalization despised in the article,
and they're using MySQL, albeit just as a dumb key/value store ;-)

~~~
toast0
I meant scale on a single box, sorry for confusion. It used to not be a great
idea to get a really big SMP box to run MySQL because locking would kill
performance; I think the locks are fine grained enough these days.

------
pmelendez
Tldr; They tried to use a relational data model in a document database and
failed.

I am not sure the author realize that a relational model is actually a graph.

------
nailer
I like document oriented databases. It seems odd I should need to change the
data structure of an item in order to persist it.

After about 5 years of the many/varied cases of Mongo losing data, I switched
to rethinkdb for my current project. It's an uninteresting, surprise-less
database and I like it very much.

------
gwbas1c
It takes that article a long time to get to the point. What I'd like to know
is, "why is emulating joins bad?"

Specifically, why is it bad to load a MongoDB document, and then load
downstream documents that it links to? What kind of problems does this lead
to?

Granted, when I worked with MongoDB, I encountered problems due to its lack of
transactions. (It's surprisingly unreliable if you end up needing to update
multiple documents; when a relational database can do that easily in a
transaction.)

But, assuming you can pick an application design that does single-document
updates; why is following links in a document bad?

~~~
phamilton
Lack of isolation and multi document atomic updates is one of the biggest
difficulties. If you are updating multiple associated documents under a
semantic transaction, it's possible there will be a window in which the
partial update is visible. Sometimes this is harmless, other times quite
harmful (changing constraints and budget on a bidding platform can result in
overpaying for undesired inventory for example).

This can be worked around somewhat. If you only ever update the parent
documents, then you can create new associated documents and atomically update
the parent document to point to all the new ones. It's a bit of a song and
dance, and definitely has limitations and gets complicated fast.

~~~
gwbas1c
I'm painfully aware of the "Lack of isolation and multi document atomic
updates is one of the biggest difficulties" problem, as I encountered it when
I tried working with MongoDB. (I chose MongoDB because I was working with very
fluid requirements and needed a very flexible schema.)

But that's not really what the article complains about!

The context is that I met some of the Diaspora leads in the summer of 2010. At
the time, they were ambitious, but very inexperienced. (They did teach me some
valuable lessons about encryption!) The reason why I say this, is without some
kind of data, it's hard to know if "don't emulate joins in MongoDB" is a
conclusion of inexperience; and something that a more experienced developer
would understand how to do correctly.

For example, in a message board application, if a join is emulated between a
discussion and user objects that just have username and avatar; the penalty of
seeing an incomplete update is inconsequential. Either the user sees the old
name / avatar, or the user sees the new name / avatar. So the incomplete
update problem (in theory) can be handled by restricting "joins" to only when
it's okay to see old data.

So, assuming that the incomplete update problem is solvable, at what point do
emulated joins stop scaling?

~~~
phamilton
> by restricting "joins" to only when it's okay to see old data.

So what do you do when it's not OK to see old data? That's when it breaks
down.

There are definitely ways around the issues, but you have to make compromises
in how you build the application. If you don't make those design decisions
early, you find yourself doing it wrong.

In describing large scale systems, a friend of mine said "Imagine a scenario
that has a one in a billion chance of happening. On a 10k qps system, it
happens almost daily."

That's how this breaks down at scale. Awkward corner cases that don't really
seem likely suddenly become daily events.

~~~
gwbas1c
So, would you say that it's a solvable problem, but it's so time consuming or
difficult that a traditional relational database is a more appropriate
database for average developers? (Remember, the Diaspora guys were right out
of college and inexperienced.)

It's as if MongoDB is great for prototyping, and great for lossy data at
scale, but it can't hit the middle.

~~~
phamilton
Mongo's problem isn't really the lack of schema or joins or anything. It's the
mountain of false assumptions people make around its behavior. If you
understand the limitations of the libraries and systems you use, you can solve
most problems. The issue is that most people don't understand those
limitations. Mongo gets an especially bad rap because it actively pushed
misleading information as part of its sales pitch.

------
codingdave
I've always found that any statement that generalizes why you should never use
a tool is actually just someone who has not grokked the tool enough to
understand its appropriate use cases.

------
coding123
ArangoDB would have been a good fit, it's basically mongo db with graph
features.

By the way 99.99999% of the time a WordPress blog can't establish a database
connection is because they used mysql.

~~~
feld
The database backend of a Wordpress blog has nothing to do with this
conversation. Any software can be poorly tuned/configured. NoSQL wouldn't
magically solve this problem.

------
nnain
However, MongoDB seems to be marching on fine in comparison to other popular
NoSQL DBs (viz. CouchDB and RethinkDB) for the past decade -
[https://www.google.com/trends/explore?date=2006-08-16%202016...](https://www.google.com/trends/explore?date=2006-08-16%202016-08-16&q=MongoDB,CouchDB,RethinkDB)

------
ChicagoDave
I would have switched to a graph database like neo4j. Most if not all of your
problems would be solved.

------
GickRimes
"Shards are the secret ingredient in the webscale sauce, they just work.." :)
[https://www.youtube.com/watch?v=b2F-DItXtZs](https://www.youtube.com/watch?v=b2F-DItXtZs)

------
spynxic
What about TokuMX? -- [https://www.percona.com/software/mongo-
database/percona-toku...](https://www.percona.com/software/mongo-
database/percona-tokumx)

------
simonebrunozzi
I don't know if the author is reading this, but the "fault tolerance" picture
of the two coffee machines is something that I have used when I was at AWS.

Curious to know if he was somewhat inspired by it.

------
jmccay
This article was posted almost three years ago and went through the same
love/hate cycle on Hacker News then. Why the repeat? I am sure a number of
technical points could be outdated by now.

------
cabalamat
I got "Error establishing a database connection", maybe this wouldn't have
happened if they had used MongoDB :-)

~~~
danpalmer
Yeah, the post would have never even have been saved to disk in the first
place :P

------
jMyles
> Error establishing a database connection

...but this is not a reason to stop using a RDBMS, just a reason to use it in
the right places and for the right things. Using it to retrieve objects to
fulfill every request is not the proper way. There are caching solutions which
shine at performing this task.

...just as there are applications for which a variety of key-value stores
shine.

------
danjc
Non-relational database for a million rows of data? Madness.

------
andrewclunn
An interesting article. I'm always wary with no approaches in technology due
to the "trendy" factor that often pushes for the novel for no reason other
than its novelty. That said, I am curious if there are use cases where non-
relational databases like MongoDB make sense as data caches for quick read
only access, since the data could be formatted in the way that API calls would
already expect it to be post extraction.

------
dangerboysteve
I stopped reading when "Babylon 5" was used in an example.

------
Ridikule
"Error establishing a database connection" The irony is thick.

------
mattlondon
Blank page with "Error establishing database connection."

Is that why you should never use MongoDB? :-)

------
NathanKP
Ironic: [http://i.imgur.com/RGApTVz.png](http://i.imgur.com/RGApTVz.png)

------
tszming
> Error establishing a database connection

Why you should never use _database_

~~~
virmundi
I know there is a sense of biting irony here, but it really a valid comment
for relatively static sites. Look at the Git based systems out there. If you
can generate your site without a tier, generate it without a tier.

------
zippoxer
I stopped reading where it said "never".

~~~
devishard
I'm usually against absolutes, but I'm really in agreement here. It's almost
always better to use a relational database, and in the rare cases where you
wouldn't want a relational database, MongoDB is the _worst_ of the major
options. It literally memory leaks and drops data without warning or
provocation. If you're not going to go with a relational database, RethinkDB,
Cassandra, Redis, or BerkeleyDB would all be better choices (which one depends
on situation).

~~~
andyana
Do you know which role each of those (RethinkDB, Cassandra, Redis, and
BerkeleyDB) excel at?

~~~
devishard
I'd usually start from a problem and choose a store rather than looking at
what stores excel at. So I won't speak to places where I haven't seen
something solved, but here are some examples:

1\. Redis is great for caching in front of a relational database, and also for
running task queues. I've used it personally in both cases. Unlike MongoDB it
drops data based on cache invalidation criteria, rather than dropping it
randomly. Caching is one of the cases other people are proposing MongoDB for,
but it's ridiculous to use MongoDB here when Redis exists.

2\. Cassandra is used by Reddit and Twitter--search around and you can find
lots of good writing about how they use it. Personally I've only used it
indirectly via Stream.io.

------
weddpros
aka. The Mongodb hater's bible

EDIT: guys, be honest... if this article appears again on HN today, it's
because mongodb haters are still alive and well. Downvoting my comment doesn't
make the arguments in this article more valid. At least my comment downplays
the importance of this article which is on HN's first page, which makes it
relevant.

