Hacker News new | past | comments | ask | show | jobs | submit login
Why You Should Never Use MongoDB (sarahmei.com)
568 points by hyperpape on Nov 11, 2013 | hide | past | favorite | 337 comments



> Seven-table joins. Ugh.

What? That's what relationship databases are for. And seven is nothing. Properly indexed, that's probably super-super-fast.

This is the equivalent of a C programmer saying "dereferencing a pointer, ugh". Or a PHP programmer saying "associative arrays, ugh".

I think this attitude comes from a similar place as JavaScript-hate. A lot of people have to write JavaScript, but aren't good at JavaScript, so they don't take time to learn the language, and then when it doesn't do what they expect or fit their preconceived notions, they blame it for being a crappy language, when it's really just their own lack of investment.

Likewise, I'm amazed at people who hate relational databases or joins, because they never bothered to learn SQL and how indexes work and how joins work, discover that their badly-written query is slow and CPU-hogging, and then blame relational databases, when it's really just their own lack of experience.

Joins are good, people. They're the whole point of relational databases. But they're like pointers -- very powerful, but you need to use them properly.

(Their only negative is that they don't scale beyond a single database server, but given database server capabilities these days, you'll be very lucky to ever run into this limitation, for most products.)


People hate joins because at some point they get in the way of scaling, and getting past that is a huge pain.

Or at least, that's where the original join-hate comes from.

In reality of course, most of us don't have that problem, never had and never will, and it's just being parroted as an excuse for not bothering to understand RDMS's.

Relational database design is a highly undervalued skill outside the enterprise IT world. Many of the best programmers I've worked with couldn't design a proper database if their lives depended on it.


People hate joins because at some point they get in the way of scaling...

No, in fact, they don't.

Poor relational modeling gets in the way of scaling, and that can be geometrically exacerbated by JOINs. A JOIN, in and of itself, is neither good nor bad. It's just a tool, and like all tools, how you use it is what makes it "good" or "bad" — just like you can build a house or bash in a skull with a hammer.


In most relational database implementations, joins stop scaling after 10-50 million rows or so assuming an online transactional site.

A time series data warehouse could go into the billions of rows with scalable joins with partitioning and bitmap indices ... but is also only applicable in the unlikely case you could afford oracle at $60-90k/CPU list price

Also, most databases that aren't Oracle don't have high performance materialized views to "preprocess" joins at upsert time, therefore people resort to demoralized tables and their own custom approach to materializing those views.

Then even denormalized tables begin to stop scaling at around 250 million to 500 million rows. So people resort to sharding managed in a custom way.

I haven't even begun to express the scalability impacts of millions of users on a LRU buffer cache used in most RDBMS - that usually is resolved through an in-memory cache (Memcached, Redis) whose coherency is also managed in a custom manner. Or you could spend $$$ for Coherence, Gigaspaces, Gemfire, etc. but that's also unlikely in most web companies.

At the end of all this, even if you bought a cache, you wonder why you're using an RDBMS at all since you're so constrained in your administrative approaches. Cue NoSQL.

of course in practice many devs ignore all of this history and "design by resume" assuming their new social-mobile-chat-photo-conbobulator will be at Facebook scale tomorrow.


This is not true at all. I've worked on several databases with billions of rows in several tables. A good solution for improving your query performance is to use a multi column index http://www.postgresql.org/docs/9.3/static/indexes-multicolum...


What part isn't true?

I'll restate my narrative: Single instance, normalized, unpartitioned databases run into scaling problems the several-hundred million row range especially when under heavy concurrent load.

But once you start moving to multi-instance, partitioned databases, you start to lose the benefits of the relational model as most databases have to restrict how you accomplish things -- e.g. joins are severely restricted.


Oracle will handle anything you throw at it, assuming you have the $$$. Ebay uses it for 2 Petabytes of data:

http://www.dba-oracle.com/oracle_news/news_ebay_massive_orac...


That's a link discussing an analytic database from SEVEN YEARS ago. eBay has moved on.

Please understand what I am saying:

- Traditional database architectures have limitations on what you can express in SQL for highly available and scalable online transaction processing once you introduce partitioning and clustering.

- Oracle has probably the best support for partitioning and clustering out of all RDBMS, but even that has limits in the billions of rows

- Many companies do not use Oracle for business reasons (licensing/sales/pricing practices)

What I am not saying:

- Oracle sucks (it's the most feature complete and robust RDBMS out there and is );

- Oracle is not used (Amazon, Yahoo, eBay, etc. all use Oracle in various contexts);

- Oracle does not scale (it does, though it requires you, the SQL developer, to intimately know the database physical design at a certain point of scale, which defeats much of why SQL exists to begin with)


I routinely deal with joins on a 100 million row table and they work just fine. Other than that, I also use a 10 billion row table for searches. This is in Oracle.


I've also used databases with more than a 100 million rows in a single table and received realtime query performance in multi-table joins. And this is using Sqlite! No expensive DB licenses - but it was using a high-end SAN since we actually ran thousands of these multi-million row databases in parallel on the same server.


"demoralized tables" :)


For me, it was a database schema with 38 joins (and 2 additional queries) to effectively get the data to display a single page. For that use case, mirroring the data on save to MongoDB was a no-brainer... with geospacial queries out of the box, and a few other indexing features it made a lot of sense.

I wouldn't even think to use MongoDB for certain use cases... but for others, it's a great fit. I think that Cassandra, Riak, Couch, Redis and RethinkDB all have serious advantages and disadvantages to eachother and SQL.

I do find that MongoDB is a very natural fit for node.js development, but am not shy about using the right tool for a job.

Another thing that tends to irk me, is when people use SQL in place of an MQ server.


> For that use case, mirroring the data on save to MongoDB was a no-brainer

I think you just confirmed the OP's point -- MongoDB makes a good cache, not a good primary store. I'm guessing you didn't do updates into that MongoDB store, and always treated the SQL source as "authoritative" when it became necessary. Am I right?


I no longer work at the company in question, but the plan was to displace SQL for the records that were being used in MongoDB, for mongo to become the authority. NOTE: this was for a classified ads site for cars. Financial and account transactions and data would remain in the dbms, but vehicle/listing records would have become mongo authoratative.

The transition was difficult because of the sheer number of systems that imported/updated listing records in the database... there wasn't yet a 100% certainty that all records were tagged on update properly, so that they could be re-exported to mongo... each day, all records were tagged ... took about 24 minutes to replicated the active listings (about 50K records), and we're not talking "Big Data" here, but performance was much better doing search/display queries against MongoDB.


> In reality of course, most of us don't have that problem, never had and never will

Maybe you never had any problems, but I don't believe "most of us" can say the same. At least me, I'd encountered problems derived from join-abuse in almost every job I've had.


That's funny, because I've mostly encountered problems with people who prefer to nest SQL queries inside a sucession of loops in their code, rather than learn how to use SQL properly.


Yeah, me too. But having said that, I've also seen problems with mongodb and they're much, much, much, much harder to solve.


7 isn't necessarily nothing. Each join is O(log(n)), so I believe you're stuck with O(log(n)^7) as a worst case, although in practice it will probably not be so bad since one of the joins will probably limit the result set significantly.

The other problem is that with 7 joins, that's 7! permutations of possible orders in which the database can perform the join. That's a lot of combinations, and often you can run into the optimizer picking a poor plan. Sometimes it picks a good plan initially, and then as your data set changes it can choose a different, suboptimal plan. This leads to unpredictable performance.

I think that in practice, you're best off sticking with only a few joins...


If you're regularly doing 7 joins it's a good sign of an over normalized databased.


Nonsense. It very much depends on the problem domain.


> A lot of people have to write JavaScript, but aren't good at JavaScript, [...] they blame it for being a crappy language, when it's really just their own lack of investment.

I think it's pretty much an accepted fact that JS has its problems. Even Brendan Eich has been quoted as admitting it.

(Note: I am a JS developer myself)


This is true, but the "wtf js is such a fucked up language" meme is outsized compared to the actual problems of javascript. Having worked full time in python for a couple years I could easily show you just as many weird python semantics that will inevitably bite you[1]. I think the grandparent's point has merit, that people expect to invest in their primary language for a project, but when circumstances dictate that they need to use a bit of javascript they find it annoying.

1] What does this program do?

    print object() > object()


  >>> print(object() > object())
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: unorderable types: object() > object()
Yet another reason to upgrade to Python 3! :)


About your code snippet, it prints a "random" boolean value

You're creating two objects (with random addresses, which affect the __str__ method result, which in turn result in a string comparison that returns False or True)


That's nothing compared to the "Perl is Satan!!1" meme that us Perl programmers usually have to put up with ;)


Actually, Perl is many different Satans, depending on the particular stylistic quirks of the programmer in question.


TMTOWTDI: There's more than one way to damn it!


Not so much with the usual coding standards (see "Best Practices" book etc).


IMHO, it's well deserved. Start by stop having those $, @, % identifiers for variables and then we 'll talk again about how many more daemons you need to impale.


Sigils are what make Perl standout, in syntax and behaviour, to (most) other languages so it would be silly to get rid of them!

i.e. it's a differentiator to what is or isn't Perl.


That one is somewhat reasonable (and relatively obscure code you'd probably never write).

A more realistic example: inner classes can't see class variables from their enclosing classes. (Why enclose classes? - builder pattern)


Every language has its quirks. Python is not perfect either. But JS shows clear signs of bad design decisions, such as the behavior of the == operator.


What == does is pretty simple and easy to understand. If you have a hard time with it, use ===. Problem solved.


The problem is that JavaScript does not exist in isolation, there are other languages that use this operator. If you're familiar with any other C-derived language, the way == acts in JavaScript is very unexpected.

Yes, you can learn to deal with it, but that doesn't mean it wasn't a bad design choice. If both forms of equality are required to be operators, == and === should have been swapped. Too late to do it now, woulda, coulda, shoulda, but it certainly is, IMO, a "bad design" smell in the language, and hardly the only one that still bites people.

Another example: the way 'this' scoping works is similarly busted in that while the rules for it are reasonably straightforward in isolation, it is different enough compared to other languages that share the same basic keywords and syntax that it should have been called something else.

To be fair, I don't think much of this has to do with Brendan Eich being a bad PL designer as much as it has to do with the odd history of JavaScript that is still represented in the name of the language ("Take this client language you made which has no connection to Java, and make it look kinda like Java, please!").


I agree with you to a point. Joins are your friend. But trying to pull out all of the information about a graph of 'objects' using a single query with multiple one-to-many and many-to-many joins is just as foolish in SQL as in Mongo.


Do you have any resources you would recommend to understand or at least give an overview of indexes?

I learned basic SQL once-upon-a-time and understand the relational algebra side of things, but only truly picked up the finer details and specific engines in piecemeal manner, as needed in various projects.


http://use-the-index-luke.com/

This is a really good resource for understanding how the queries you do relate to the actual actions that the database engine takes.


http://use-the-index-luke.com/ is a great online book (free) targeted at programmers and developers. It's practically required reading in my opinion.



Star, constellation, snowflake, flat.. Developers (not the author) would benefit from database introductory course even if they are not using databases. I think Stanford did one that was open to everyone.


This article ends up agreeing with you at the end, by the way.


When program errors pass silently, that is a legitimate problem in the toolchain.


There is a good reason that relational databases have long been the default data store for new apps: they are fundamentally a hedge on how you query your data. A fully-normalized database is a space-efficient representation of some data model which can be queried reasonably efficiently from any angle with a bit of careful indexing.

Of course relational databases being a hedge, are not optimal for anything. For any particular data access pattern there's probably an easy way to make it faster using x or y nosql data store. However as the article points out, before you decide to go that route you better be pretty certain that you know exactly how you are going to use your data today and for all time. You also should probably have some pretty serious known scalability requirements otherwise it could be premature optimization.

Neither of these things are true for a startup, so I'd say startups should definitely stay away from Mongo unless they really know what they are doing. Being ignorant of SQL and attracted by the "flexibility" of schema-less data stores powered by javascript is definitely the wrong reason to look at Mongo.


I actually started to laugh as i was reading because i knew what problems they were going to run into. I was basically drawing up my schema for a mongodb app(yes you still need a schema), when i started scratching my head and started reading through the mongodb guides. I quickly realised that i should use a relational store and my problems were solved quickly with postgres.

The title of this article should be honestly changed as it does not do mongodb justice, there are a lot of uses for it but relational is not one. Regarding the TV example, this is a classic relational solution and i enjoyed this exact example in a pycon tut,SQL for Python Developers - http://www.youtube.com/watch?feature=player_embedded&v=Thd8y...

I see a lot of people thinking MVP ---> schemaless to save time -----> mongodb but you will always need a schema unless you are just dumping a list of stuff. I would like to say that another cool solution is an RDF data store, i have been using Fuseki with SPARQL.


I use SPARQL a lot although not Fuseki. I really like the flexibility it gives in the schema (flexible not less schema). As well as that one query language can be used on radically different implementations i.e. I don't need to change my datamodel or queries to try different storage models.

Although we also use BerkleyDB/je+lucene for indexing as well as a number of existing relational databases. Yet, considering the youth of the SPARQL eco system (1.1 of the standard is only out since the beginning of this year) there is some fantastic performance possible for both hard and easy queries. I think it will be a bit like Java, not pretty, but fast enough and extremely robust in the long run. With a similar marketing pitch "Query Once Store Anywhere".

I also evaluated MongoDB, and I understand the value of a document store. I just don't think that MongoDB is a good document store, imho it just a slow /dev/null.


> Being ignorant of SQL and attracted by the "flexibility" of schema-less data stores powered by javascript is definitely the wrong reason to look at Mongo.

It's usually the only reason. And 10gen were good at marketing it.


I used it for exactly one production app and it was a huge success. The reason I used it was because the data we needed to represent was actually a document, in this case a representation of fillable form fields in a pdf document. The basic structure was that documents had sections and sections had fields and fields had values, types, formatters, options, etc.

Initially trying to come up with a schema in SQL was somewhat painful as what I was really looking for was an object store. Switching to mongo gave me a way to do a very clean, simple solution that worked quite well for the problem at hand (representing pdf forms). That said, we also played it very safe and used mongo for only the document portion, with every other part of the system being in an sql database. But for the doucuments mongo worked really well as a basic object store without the complexity of something like Neo4j.


Of course, a better choice now would be to use PostgreSQL's new JSON support. Postgres has also had XML document types for a long time, though I'm not sure of their indexing story.

If you don't need indexing into the document, you can easily just store it as serialized bytea data. I've done this quite frequently and it works wonderfully.


Sorry but PostgreSQL's JSON query syntax is insanely complicated compared to MongoDB.

And that is a big deal for a lot of developers.


I just took a look, and if I'm getting this right, it looks about as simple as it gets:

SELECT json_data FROM people WHERE json_data->'age' > 15

vs

db.people.find( { age: { $gt: 10 } } )

Personally, I prefer the postgres syntax. It's much clearer. I also don't buy your claims below about performance. Can you provide a real benchmark? Are you running with the safeties off meaning you lose data?


What is the syntax on finding an array value matching some key? E.g. given a user with field "favFoods":[String], how do I determine that pizza is in there?


select ... where "pizza" in person -> favFoods


Sorry I'm stupid, I was thinking Python

select ... where person -> favFood in ("pizza")


Your ORM will give you whatever syntax is natural, eventually.


For all the talk about how "MongoDB totally has SOME use cases", I've never before heard of a use case where it would be unambiguously better to use a document store. Thanks for explaining that so well.


I've used it well in the past as well.

MongoDB was 5-10x faster than PostgreSQL, Cassandra etc.

If your domain model is structured like a document then MongoDB is a pretty great fit.


Even if it is better, why Mongo? Why not Riak?


In my opinion, datalog (via Datomic) strikes a good balance between schema flexibility and queryability. It's my preferred way of working with data now, after using Mongo for a year (I was also initially attracted by the flexibility of schema-less data stores).


I think a lot of people (at least in the Node community) love the interfaces provided for it and specifically Mongoose which allows you to enforce a schema and thus relational data.

I'll admit that I originally got into it because I didn't really know SQL and I'm still not very talented with it but for example.. Joins in Mongoose? Say there is a comment.. this comment has an author. If that author is type ObjectId when I run a query I can do this:

model.comment.find({_id: <some id>}).populate('author').exec(e, result) { // author will be populated with that author's data instead of just the ID }


I use mongodb only when i have to write an app in Node. Not because it's the better solution , but because mongoose is the only library that deals with data that is mature on npm. The rest is beta at best and drivers for RDBMS are not mature enough.


The mysql driver on NPM is indeed mature. Been using it with out issue for a couple years now in a large app. https://npmjs.org/package/mysql


I had thought the mysql drivers were all more mature or is it that you lean more toward postgresql?


Putting it another way:

Document stores are supposedly "more agile", but by conflating queries, the logical model and the physical structure, they are actually less agile. You've mixed the three things together and ossified around a single model of the domain. When the required view or model changes, you have to write workarounds.


spot on. MongoDB is a step backwards in abstraction. The RDBMS geeks had this figured out in the 70s. Hence the "relational MODEL" vs "document STORE" transparent step backwards in generalization.

MongoDB is just locking you into a specific materialization of an ill-specified data structure. PostgreSQL's team hacked out JSON support in about a year or so, since they are working at a higher level of abstraction and could insert the "MongoDB model" at the proper place in their system. Now if you really need to store "documents" in your database and query lazily-defined fields you can do that for those edge cases (and lets face it, those are edge cases) and use proper relational modeling for the rest of your model.


I once tried to insert a screw using a hammer. I'll be writing my article "Why You Should Never Use A Hammer" shortly.

And here's the crux of the problem with this article, and of so many articles like it:

"When you’re picking a data store"

"a", as in singular. There is no rule in building software that says you have to use 1 tool to do everything.


>I once tried to insert a screw using a hammer. I'll be writing my article "Why You Should Never Use A Hammer" shortly.

More like: "I once tried to insert a screw with a fish-shaped, peanut butter and jelly covered, see-through tv set".


In his defense, the Kickstarter campaign for the TV was impressive.


I always find comparisons to tools disingenuous because people take simple tools (a hammer) and compare them to complex software tools that if you misunderstand can ruin your company.

Your database isn't a hammer. It's closer to 19th century industrial machine with hundreds of buttons and levers that will cut your hand off if you use it incorrectly.


I think this is the first post on HN I wish I had a downvote button for, just for the reason you list. There is a reason there are different flavors of databases, and MongoDB most definitely would not be my choice for representing graph like relationships.

It's also scary that it has 217 points because it bashes Mongo.


I think you are missing the point of the article. If you read down to the Epilogue it explains how the "perfect" application still didn't work with MongoDB once the clients started asking for more features.

My read was that even when you think you don't have "graph like relationships" in your data, you actually do.

The original author did say this, but I would like to add: if you don't have "graph like relationships", then your data is pretty trivial and any data store will do.


From another comment I made, on why I don't think is a good article even using the proposed thesis of "mongo doesn't work for graph like relationships":

Even though their data doesn't fit well in a document store, this article smacks so much of "we grabbed the hottest new database on hacker news and threw it at our problem", that any beneficial parts of the article get lost.

The few things that stuck out at me:

* "Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production." - So you did absolutely no research

* "What could possibly go wrong?" - the one line above the image saying those green boxes are the same gets lost. Give the image a caption, or better yet, use "Friends: User" to indicate type

* "Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars." - Yep, and since users are indexed by their ids, this is extremely easy.

* "What happens if that step 2 background job fails partway through?" - Write concerns. Or in addition to research, did you not read the mongo documents (write concern has been there at least since 2.2)

Finally, why not post the schemas they used? They make it seem like there are joins all over the place, when I mainly see, look at some document, retrieve users that match an array. Pretty simple mongo stuff, and extremely fast since user ids are indexed. Even though graph databases are better suited for this data, without seeing their schemas, I can't really tell why it didn't work for them.

I keep thinking "is it too hard to do sequential asynchronous operations in your code?".


I'm pretty ignorant of MongoDB so I'm genuinely interested in your response: How would you solve the problem in the epilogue, namely "a chronological listing of all of the episodes of all the different shows that actor had ever been in"?

Did Sarah model the data poorly ("We stored each show as a document in MongoDB containing all of its nested information, including cast members").

Or is there an easy way to extract that information that Sarah just doesn't know about yet?

Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

The last part seems like a really straightforward relational critique to me. If you don't break the actors out into unique entities then you can't compare them across shows. But if you do break them out into unique entities, then how to you present the show information without doing joins?


  > Did Sarah model the data poorly ("We stored each show as a 
  > document in MongoDB containing all of its nested 
  > information, including cast members").
Yes, they modeled the data poorly.

In this example, we have a TV Show, which is modeled as an entity (document). This TV Show has a list of cast members, each one modeled by a nested object.

In a relational database, this type of relationship would be modeled by having a TV_SHOWS table, a CAST_MEMBERS table with a foreign key to the TV_SHOWS table, and a CASCADE DELETE relationship to ensure that if a TV_SHOW is deleted, the related CAST_MEMBER records are also deleted.

This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS. (In OO we'd call this a "component" relationship, that is, we're saying that a tv show is composed of cast members, and if we destroy the tv show we destroy the cast members as well.)

They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

  > But if you do break them out into unique entities, then 
  > how to you present the show information without doing 
  > joins?
You must join, albeit in MongoDB you do this in the application layer, not the database, so:

1. Query the cast members collection to find the cast member id. 2. Query the tv shows collection to find all tv shows with cast member id in the cast members set.

Those of us who sharpened our teeth using relational databases have trouble seeing past "two trips to the database" in the above strategy, and that's probably why there's an urge to embed documents rather than to query two collections sequentially. Resist this urge, as it's as as bad as the urge to denormalize, i.e. there'd better be a damn good reason to do it.


> This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS.

... huh?

> They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

So instead of a one-to-many relationship, they should use a one-to-many relationship expressed in a different notation?


MongoDB doesn't forbid you from having entities and relations. It just doesn't support them in the same way that SQL databases do. Ditto for CouchDB, etc.

You end up having to do some joins yourself still, but this is often appropriate. Imagine that the "actor" entity contains a complete bio, including family history with relationship to other actors, links to wikipedia & fan sites, etc. When you're displaying the page for episode #202 of "Everyone Loves MongoDB", you don't want to retrieve all that data for all the actors. You're not going to display it all on the episode page anyway. Instead, you just need an ID (to href an a and src an img) and probably a small amount of denormalized stuff (name, for the img alt ...). Since that's what you need, that's what you store.

There's a limit to how far you can denormalize schemas before it is no longer helpful. The author explores this limit, and finds that MongoDB doesn't make the limit go away.


You're basically saying: don't use mongo. It's trivial to emulate a blob of data in a relational database; just use a... blob of data. Or any of the many, many other options at you fingertips. Conversely, manually implementing efficient joins is a total hassle and it'll probably end up slow and brittle. At the very least you'll need indexes and that means an (implied) schema.

So in the normal mongo usecase for storing (as opposed to caching) data with relations, let me see if I can summarize:

- you can have relations, it's just mongo won't help you deal with then: you just need to implement them yourself.

- you can have (actually need) a schema, it's just mongo won't help you deal with that; you'll need to implement that yourself. Have lots of fun with schema-changes, especially because...

- Since you're changing decoupled entities, you need to keep them in sync. You can (and probably should) use transactions, but mongo won't help you with that. You also probably want foreign keys, but mongo won't help you with that either. Migrations on mongo are a special kind of terrifying.

But hey, on the upside, it can store structured blobs, and it's probably hardly any slower that your filesystem, which could do that too.


You could absolutely do the same thing with Postgres (or SQL Server) and computed indexes over JSON (or XML) blobs. Of course, then you'd have exactly the same schema migration issues.

My point was more that a lot of the time, if you structure your data right (and get the right balance of denormalization) you don't need joins very much and so the lack of them isn't really a big disadvantage.


> Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

As others have pointed out, it requires two trips to the database. Given their architecture (distributed nodes), network latency is minimal, so this is essentially two calls to the database.

show { _id, title }

actor { _id, appearedIn : [id] }

db.find({"title":"awesomeshow"},{"_id":1}) db.find({"appearedIn" : showId})

Each actor is unique in the database, when you query, you get back unique actors. I'm not sure why they're scared of joins (or multiple queries in mongo).

The question you ask yourself is not whether you're joining, but how often you're joining. If you're not joining often on actors and shows, document databases can work better, since you represent the show and all its episodes without having to join.


Another "issue" occurs to me. It seems likely that the data coming in about TV shows, especially old ones with decades of episodes, would be a bit "dirty". This sort of thing just slides right into a document store, but a relational one would have some problems with that. How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors? Of course these things can be fixed with enough manual (or, even better, user) intervention, but the time and place for that are after you've got the data in the system, not before.


> How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors?

In the USA, the various professional creative guilds enforce uniqueness.

Your general musing is right, but the problem of source-data quality is generally considered to be distinct from the design of schemata.


Yeah, the comment on graph databases seemed a bit too flippant.


I often upvote articles because I'm interested in the discussion. It does not always indicate agreement.


Well said sir. I only skimmed the article, but afaict the author still has not discovered graph stores, an appropriate way to store social graphs.

I remember downloading Disapora back in the day. The idea behind it was great. But the code looked quite awful and insecure.


From the article:

> Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production.

Have you used a graph database to good effect? Which one, and for what?

I have a friend who as a learning exercise wrote a toy search engine implementing PageRank — inherently a graph problem. We paired on setting up Neo4j, the only open-source graph database we could find with a working Python API, but found it fiddly and hard to get help. She then switched to SQL (Postgres, I think) and reported faster progress.

Facebook themselves use MySQL[1], so between that and my own first/second-hand experience, I'd call it far from obvious that a graph database is the most appropriate way to store social information. If you're going to criticize the OP for not considering them, it would be nice to offer some justification.

1. https://www.facebook.com/notes/facebook-engineering/mysql-an...


>>Have you used a graph database to good effect? Which one, and for what?<<

I played around with several. But project never got off the ground due to layoffs that killed projects.

I know lots of people who have implemented graph stores with great success. One example:

http://www.bbc.co.uk/blogs/bbcinternet/2012/04/sports_dynami...

Another is a multibillion dollar retailer (not sure if it's public so I'll leave the name out) uses stardog to good effect. LOTS more out there.

>>We paired on setting up Neo4j, the only open-source graph database we could find with a working Python API, but found it fiddly and hard to get help.<<

The Graph Stores do seem to play better with Java. Neo's getting a lot of ink these days but they are far from the only game in town.

>>Facebook themselves use MySQL[1], so between that and my own first/second-hand experience, I'd call it far from obvious that a graph database is the most appropriate way to store social information.<<

They aren't using MySQL the traditional way. They undoubtedly would have made different choices had they started when Disapora did. And they also use TAO, a homegrown graph store of sorts, FYI:

http://dl.acm.org/citation.cfm?id=2213957

It is sitting on top of MySQL at some level, as this is where objects are stored as "source of ultimate truth".

IIR the quote correctly when I invited couple FB DBAs (pre-Mark Callaghan) to speak at a meetup, "I don't think there's a single join in the facebook codebase". That might have been a slight exaggeration, but MySQL at Facebook is not because their recent needs are for a relational db.


Thanks, this is why I read all the bad comments on HN: in hopes of seeing a very informative one like this :)

I'll just point out that this:

> The Graph Stores do seem to play better with Java.

was likely a dealbreaker for Diaspora, since they were a small team without, I'd assume, Java experience. Also the nature of the project virtually requires an open-source database so Stardog would've been out. With SQL you have not one but several free and open-source implementations that are battle-tested and work well with just about any programming language out there. That makes SQL a better choice for many projects even if a graph store would map more neatly onto their problem domain.


True. I think an (even more) ambitious attempt could have attracted core developers to the project, which could have solved all the technical hurdles. That said, I was pretty excited by the idea, and I hope something new along those lines gains momentum one day.


As I recall, FB mostly use MySQL as a glorified K/V store. So I'm not sure if this is a win for relational algebra.


Reddit does that as well with PostgreSQL. It surely doesn't show a win for NoSQL if two of the biggest sites on the internet would rather traditional SQL RDBMS as KV stores.


In my experience, MySQL works better as a K/V store than Mongo under load - another point against Mongo for very simple data.


It's more like using a very good screw driver instead of a swiss army knife that does an OK job at everything.

Yes there is no rule that one tool has to work for everything, but there is a rule in Agile that you should push off making assumptions about the future as far as possible, because you will never know less than right now


I actually liked the article, thought it was interesting. But the title is a complete clickbait. It does not even says that "you should never use mongodb", it points some situations where MongoDB is a good match. I know a title "Think well if mongodb applies to your case" is not attractive, but it is less sensationalist.


I know very little about MongoDB, or NoSQL in general, but I'm very interested in it. Are there any good sites/articles I should start looking at to see where it would be the right tool?


The difference is that many people are trying to insert a screw with this particular hammer today.


I don't know much about MongoDB, but I've been using a lot of CouchDB for my current project. Am I correctly assuming that MongoDB has no equivalent for CouchDB views? Because if it had, all these scenarios shouldn't be a problem.

Here's how relational lookups are efficiently solved in CouchDB:

- You create a view that maps all relational keys contained in a document, keyed by the document's id.

- Say you have a bunch of documents to look up, since you need to display a list of them. You first query the relational view with all the ids at once and you get back a list of relational keys. Then you query the '_all' view with those relational keys at once and you get a collection of all related documents - all pretty quickly, since you never need to scan over anything (couchDB basically enforces this by having almost no features that will require a scan).

- If you have multiple levels of relations (analogous to multiple joins in RDBMs), just extract they keys from above document collection and repeat the first two steps, updating the final collection. You therefore need two view lookups per relational level.

All this can be done in RDBMs with less code, but what I like about Couch is how it forces me to create the correct indexes and therefore be fast.

However, if my assumption about MongoDB is correct, I have to ask why so many people seem to be using it - it would obviously only be suitable in edge cases.


Spot on about CouchDB. I haven't used MongoDB for anything of decent scale but I must say I was shocked to read in the OP that they store huge documents like from the Movie example in MongoDB. In CouchDB you can use Views to sort of recursively find all of the other documents that your current doucment has a document ID for. This takes advantage of CouchDB's excellent indexing. I'm not trying to start a CouchDB vs MongoDB war here but again, I just say I'm surprised at the types of documents OP was storing in MongoDB.


What I still don't understand about MongoDB is where it actually shines compared to Couch. The performance advantage would have to be quite big to offset the loss in flexibility as a general purpose DB. I'm also not trying to start a war but I'd like to get a picture about why Mongo seems to be used more often than Couch.


> What I still don't understand about MongoDB is where it actually shines compared to Couch

Marketing. They shipped with unacknowledged writes for a long time and it made them look really good in write benchmarks. Couch was actually trying to keep your data safe. But it didn't look fast enough so those that didn't read the fine print on page 195 from the manual where it tells you how to enable safe data storage for MongoDB, jumped on the bandwagon.

Oh and mugs, always the mugs. I have 3 I think.


My one and only reason to use Mongo over Couch is geo indexes. As far as I can tell this doesn't exist natively. I'm also not sure how Geocouch comes in worth this.


Cloudant will soon offer geo-spatial queries. Its in beta now: https://cloudant.com/product/cloudant-features/geospatial/.


As the original article said, I think where MongoDB shines is as a glorified, souped up cacheing tier, competing directly with Redis, Couchbase, and similar. It's not really a good general purpose DB.

> I'd like to get a picture about why Mongo seems to be used more often than Couch.

Very good marketing from 10gen on the one hand. On the other, CouchDB is older (and we techies love the new hotness), and the CouchDB/Couchbase split confused a lot of people. Having your original founder found a different and incompatible project with almost the same name but very different goals would cause almost any project to stutter.


> CouchDB/Couchbase split confused a lot of people

Yes, that really didn't inspire confidence in the longevity of the project


Interesting question, I can point out a few interesting differences I know of. Take note, I have more experience with Couch and its ilk than MongoDB, but I know some of Mongo's feature set.

tl;dr: You'd probably see the most difference with how a) the data is distributed and replicated and b) how you query data.

CouchDB (as of the 1.5 release) offers master-master (including offline) replication. It does not offer sharding. Cloudant's BigCouch does implement a dynamo-like distribution system for data that is slated for CouchDB 2.x iirc. Mongo on the other hand does support sharding via mongos, and you can build replica sets within each shard. It does not as far as I know support master-master. This is probably the biggest data-distribution difference between the two.

MongoDB support a more SQL-like ad hoc querying system, so you could query for drink recipes with 3 or less ingredients that have vodka in them, for instance. You'd still need indexes on the data you are querying for performance.

CouchDB queries are facilitated via javascript or erlang map reduce views, which serve as indexes you craft. An additional 'secondary-index' like query facility is to use a lucene plugin and define searchable data. Cloudant has this baked into their offering, and their employees maintain the plugin on github (https://github.com/rnewson/couchdb-lucene)

MongoDB has the ability to do things like append a value to a document array. In Couch, you'd likely read the entire document, append to the array in your app, and put the document back on Couch. It does have an update functionality that can sometimes isolate things more than this, but I haven't seen it used as much. Mongo can also do things like increment counters, while Couch cannot (though CouchBase can).

There's a host of other differences. Mongo has a much broader API, while CouchDB takes a more simple http verb like approach (get, put and delete see the heaviest use). Depending on your situation, one might be a better fit, or you might simply grok one more than the other.

As far as why Mongo gets used more often, I think the closer-to-SQL ad-hoc queries made more sense to people transitioning from stores like MySQL. The CouchDB view/map-reduce stuff is a bit more of a mindset shift (see the View Collation wiki entry for an example of this at http://wiki.apache.org/couchdb/View_collation). CouchDB was also taken hits for being slower than Mongo, but I suspect it was the map-reduce stuff that really steered some folks the other way.


Querying, and querying immediately after insertion. If you want queries after insertion (which require views), this can be slow in couch. Also, if you want to query, but don't want to add a view, wait for it to populate (causing a performance hit while it builds, plus while it is up), then remove it.

If you're doing primarily insertion with querying via id, and using views in which stale data is ok, then couch is far superior to mongo. But that's not a use case everyone has.


Also CouchDB has better safety. Its append-only files allows you to make hot backups and safely pull the plug on your server if need be without worrying corrupting data.

Plus change feeds and peer to peer replication are first class citizens in the CouchDB. Once you start having large number of clients needing realtime updates, having to periodically poll for data updates can get very expensive.


Offline capable peer to peer replication was the main reason I chose CouchDB - we needed something that would realistically run on clients, even mobile devices. NoSQL we mainly chose because we needed schema-less data (the whole system relies on ad-hoc design updates). It's basically an information system IDE with rapid application development.


I immediately wondered why Diaspora didn't try CouchDB, since replication seems to be one of the key features they were after.


In Diaspora as it exists now, replication - really, federation - is between pods. There's a protocol for transferring data between pods that is deliberately database-agnostic:

https://wiki.diasporafoundation.org/Federation_protocol_over...

So CouchDB's replication doesn't really help.

If the day comes that any single pod is big enough to need replication between clustered machines within it, then CouchDB should certainly be a contender for storing its data.


It looks like the CouchDB vs. MongoDB in the document store world is the equivalent of the Postgresql vs. MySQL debate in the relation world.


Not really. They handle querying and aggregation much differently.

For people coming from SQLServer/MySQL/Postgresql, the functionality differences between the NoSQL flavors is something they don't expect, and often don't explore. There are a number of heavily used NoSQL solutions because they're focused on specific use areas.


db.find({"field":"value"},{"field":1,"someotherfield":2})

Finds all documents with field having value, returning only field and someotherfield. That part is similar to the map portion of a CouchDB/Couchbase view. No reduce portion though.

If field is what the index is built off of, it should be similar performance wise to a view. Just like views have to be created beforehand, so do mongo indices.

The difference is the find of a mongo document will happen much more quickly after insertion than the find of a couch value by view. Views require rebuild in couch which is not instantaneous.


If I understand couch correctly, it will run all map/reduce functions on a DB after insertion, thus updating all views right away - except if a view has never been queried, in which case it would happen at the first query. I don't quite understand how mongo could do a better job there - do you mean because mongo's indices are less complex than couch views, so the updates after insertion are quicker? I guess if that's the case it would perform better in insertion heavy cases, but then again I could just not use many map/reduce operations in couch and thus reduce the insertion overhead.


Almost right!

For various complicated reasons, CouchDB update views on read, not on write. So you write some data, then you query a view, CouchDB notices the view is stale, recalculated everything, and then gives you the updated data. That can be a problem if your view is quite heavy because every time you write, the next read will be slow.

However! You can query with "stale=ok" (which means "just give me the old data, and don't kick off a view update"), and then update your views manually (eg, cron job that hits your view every so many minutes, or if you want to be smarter, a very lightweight daemon that monitors the _changes feed and hits your view every X updates, or whenever some key document is touched, or whatever).


From my tests with couch, the view isn't populated immediately after a document has been inserted, and may take some time. I think I tried this doing insert bulk, wait for view, insert 1, query, but I'd have to double check.


You and Lazare are right, I just checked the documentation and Couch indeed updates on first view query after a write.


Cloudant (based on CouchDB) automatically triggers map-reduce and auto-compacts your database for you. This is my second post about Cloudant - note that I am employed by Cloudant. :)


Make it a product, and I'd be more interested. A number of companies have their own hosting, so the hosting part is not only unneeded, but is also usually a non starter.


I must have read a dozen (conservative estimate) articles now all called "Why you should never use MongoDB ever" - or permutation thereof. Each and every one of them ought to have been called "I knew fuckall about MongoDB and started writing my software as if it was a full ACID compliant RDBMS and it bit me."

There are essentially two points that always come up:

1. Oh my God it's not relational!

Well, you could argue that if you move from a type of software that is specifically called RELATIONAL Database Management System to one that isn't, one of the things you may lose is relation handling. Do your homework and deal with it.

2. Oh my God it doesn't have transactions!

This is, arguably, slightly less obvious, and in combination with #1 can cause issues. There are practices to work around it, but it is hardly to be considered a surprise.

I keep stumbling on these stories - but still these are the two major issues that are raised. I'm starting to get a bit puzzled by the fact that these things are still considered surprises.

In either case, I'm happily using MongoDB. It has its fair share of quirks and limitations, but it also has its advantages. Learn about the advantages and disadvantages, and try to avoid locking too large parts of your code to the storage backend and you'll be fine.

FWIW, I think the real benefit of MongoDB is flexibility w/r to schema and datamodel changes. It fits very, very well with a development process which is based on refactoring and minor redesigns when new requirements are defined. I much prefer that over the "guess everything three years in advance" model, and MongoDB has served us well in that respect.


> I must have read a dozen (conservative estimate) articles now all called "Why you should never use MongoDB ever"

Strange statistical oddity if you ask me right? How many "don't use PostgreSQL" or "don't use Cassandra" or "don't use SQLite" have you seen? Not as many. It is just very odd isn't it...

So either everyone is crazy or maybe there is something to it. I lean towards the later here.

> 1. Oh my God it's not relational! ... > 2. Oh my God it doesn't have transactions!

Maybe those, you forget about:

3. Claim "webscale" performance while having a database wide write lock.

4. Until 2 years ago shipped with unacknowledged writes as a default. Talk about craziness. A _data_base product shipping with unacknowledged send-and-pray protocol as a default option. Are you surprised people criticize MongoDB? Because I am not at all. Sorry but after that I cannot let them within 100 feet of my data. They are cool guys perhaps and maybe having beers with them would be fun, but trusting them with data? -- sorry, can't do.


> A _data_base product shipping with unacknowledged send-and-pray protocol as a default option.

MongoDB had a default initial fire-and-forget mentality, but that was on purpose for their initial use cases. Just because someone else uses the tool for a different purpose doesn't mean the software is to blame.

Also, if you're complaining about the default settings and you were running this in production, RTFM.


> MongoDB had a default initial fire-and-forget mentality, but that was on purpose for their initial use cases

Yes I call that deceitful marketing. It wasn't an accidental bug or an "oops". I don't know how someone can be considered honest or trusted with data when they ship a d_a_t_a_base product with those defaults. Call it random storage for 'gossakes, that would ok, anything but "database".

> Also, if you're complaining about the default settings and you were running this in production, RTFM

Yes and I also don't expect to read the fine print on last page of a manual to enable the brakes when I buy a car. I expect cars to have brakes enabled by default, even if it somehow makes them not go as as fast in benchmark tests.


It still boils down RTFM and don't trust marketeers, right?


Mostly, "don't trust marketeers with you data", which I don't.


Mongo is absolutely TERRIBLE for schema changes. It is a terrible fit for refactorings and minor redesigns. (I have implemented and watched several mongo migrations and refactorings)

Because you have a schema, but mongo doesn't model it, you're left to your own devices to implement the migration. If you have real data, and a normal legacy situation, you can't assume all data will necessarily follow the "schema" you think you have - after all, it's implicit. But that means that writing the migration can be quite tricky. There are no validations, no foreign key checks, no constraints you can use to validate your migration did what you think it did. You'll need to do all that in your own code. This is short for: you're going to be lazy and not check it quite as well as you would have otherwise, and the checks you do implement might be buggy.

Furthermore, if a particular entity does fail a migration... what then? In postgres, which supports transactional DDL, you can rollback schema changes - so even if the last entity failed to migration because your assumptions were wrong - and even if the validation had to be in code, not in the database - you can revert to the initial situation. In mongo? Uh-oh; you're in trouble. You better be in a copy of your production database; but if you are, that means that your main database needs to be offline or in read-only mode so that writes aren't lost. Does mongo have a read-only flag? By contrast, postgres (and other Sql databases) are transactional and support snapshots - you can do the safe migration with rollback support all while online for as long as there aren't any conflict; and when there are, it's detected, and you have a range of options from locking to retries to avoid the conflicts.

In practice, I can't imagine a worse tool for schema and datamodel changes. If your schema change is trivial, it's not so bad; but then, if you're only renaming a property that has no external references to it or adding a property or whatever - sql is trivial too.

Mongodb for schema-changes is sort of like writing an automated refactoring for a dynamic programming language codebase that's too large to manually inspect at all and without unit tests tests nor a VCS. You won't necessarily know what goes wrong or even if anything goes wrong; you won't get system support for guaranteeing at least minimal consistency; and if something goes wrong you'll have a corrupted half-way state.


I'm not sure what these two strawmen have to do with the article. Perhaps you've been reading another article?


Given it is relatively new tech. Who is qualified to write about? Presumably by the end the authors did know a bit. Don't you think such articles, assuming they are objective, might be useful to others that are thinking of dipping their toes in the water?


SQL is actaully a rediculously elegant language at expressing data, and relationships. NOT a good general purpose language. So I tend to always favor sql for relational data, actually most data.

queues, caches, etc = nosql solution. They tend to have much more features around performance to handle the needs of these problems, but not much in terms of relational data.

If you study relational databases and what they do, you will quickly find the insane amount of work done by the optimizer and the data joiner. That work is not trivial to replicate even on a specific problem, and ridiculously hard to generalize.

And so this article's assertion that mongodb is an excellent caching engine, but a poor data store is very accurate in my eyes.


No. SQL is actually pretty third-rate at expressing data and relationships. My preferred way of expressing data and relationships is the programming language I am writing in.

The problem with SQL is that it is not an API, it's a DSL. Which usually means source-code-in-source-code, string concatenation/injection attacks, and crappy type translations ('I want to store a double, what column type should I use? FLOAT? NUMERIC(16,8)?'). Even as a DSL it's pretty low-brow: just look at how vastly different the syntax is between insert and update, or 'IS NULL'.

For all those who love SQL, consider having to address your filesystem with it. Directories are tables, foreign-keyed to their parent, files are rows. There's a good reason why this nightmare isn't real: APIs are preferred over DSLs for this use case. And so too for databases, because they are the same abstraction.

Don't get me wrong, I love relational algebra and the Codd model, but SQL just aint it. SQL has survived because of its one and only strength: cross-platform. And like all cross-platform technologies, such as Java bytecode and Javascript, its rightful place is a compilation target for saner, richer, more expressive technologies. This is why I always use an ORM and have vowed to never, ever, write a single line of SQL again.


I like your comparison of SQL to JavaScript. However, personally I love SQL and always use an ORM. My vow is to never have a line of SQL in my application source code. This is perfectly doable with SQLAlchemy, though not with crappy ORM such as ActiveRecord.

Indeed, I blame ActiveRecord for making NoSQL popular. When your ORM doesn't create foreign key for you, it is a slippery slope to blatant denormalization and eventually NoSQL.

EDIT: The other party to blame would be MySQL with its painfully slow "must-make-a-copy-of-everything" ALTER TABLE.


I like ORMs but in my experience the best approach is to use a hybrid of ORM, views and sprocs. Ideally each sproc will return the results of querying a view or at a minimum the identical columns, then the views become 1st class entities in your ORM like anything else (except for updatable views which I shy away from).

So personally I vow never to write an insert, update or delete again, but I am certainly happy to write queries and tune them if necessary.

The one thing that trumps nosql / denormalisation in my opinion is materialised views. Materialised views are a thing of beauty that allow for the design integrity of normalized data and the performance of denormalised data. It seems most people don't use them / understand them because they use b-grade free database engines.

You never stop hearing people complain about nulls,types/precedence and joins in SQL, but seriously it isn't that hard to learn. These are the main things that people complain about and regurgitate endlessly, so a little effort would be a big reward.


How about hybrid sql-builder / data grouper solution?

Not limited to ORM methods - get the full power of SQL instead. Not string concatenation - get the full power of the language to build queries. Also, the ability to get join results in either flat or grouped form.

For example https://github.com/doxout/anydb-sql (shameless plug)


Nice. This is exactly what I mean when I talk about ORMs. See how everything's nicer when its an API?


I wrote a bayesian document classifier for one project I was working on - in SQL. Training the system took one INSERT and a small word-split function. Classifying the documents took one SELECT. Even in F#, I couldn't have written a more elegant or more performant solution. In a procedural language it would have been a mess of loops and roundtrips. Good SQL is almost a pure description of your desired result, with none of the "this is how you should do it" cruft.

I don't have anything against ORMS - they're almost mandatory due to O-R-impedance mismatch - but too often I see them used instead of operations that should rightly be server-side. And none of US have injection problems, because we're binding our parameters, right? ;-)


This is a reasonable comment, but nosql databases do nothing to address it. Nor do ORM libraries.


What rubbish. NoSQL is only good for queues and caches ? Who on earth uses a database for this ?

NoSQL works well when you are modelling your data in ways that fit their particular use cases. Cassandra is great with CQRS, Riak is for key/value, MongoDB document.


As others have pointed out, this article can basically be summarized as, "don't use MongoDB for data that is largely relational in nature."

Mongo (or most document stores) are good for data that is naturally nested and silo'd. An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use.

A distributed social network, I would have assumed, would be the antithesis of the intended use-case for a document store. I would have to imagine a distributed social network's data would be almost entirely relational. This is what relational databases are for.


> As others have pointed out, this article can basically be summarized as, "don't use MongoDB for data that is largely relational in nature."

Disclaimer: I'm a founder of RethinkDB[1] (which is a distributed document database that does some support joins).

The fact that traditional databases use the term "relational" has probably caused more confusion than anything else in software. In this context "relational" doesn't mean "has relationships". The term is just a reference to mathematical relations[2]. This is an important distinction because almost all data has relationships, whether it's hierarchical data, graph data, or more traditional relational data.

To me it's pretty clear that ten years from now, every general purpose document database left standing will support efficient joins. It helps to frame the debate from this perspective.

[1] www.rethinkdb.com [2] http://en.wikipedia.org/wiki/Relation_(mathematics)


Totally agree. I was more using "relational" to mean "cross-relational". I.e. Consider plotting your data on a 2-dimensional space, connecting your "related" data with lines. If your data looks like a spiderweb, probably some graph-type database is most appropriate. If your data resembles an inverted funnel (hierarchical) more than a spiderweb, then a document-store probably is more appropriate. More traditional relational databases are probably more appropriate somewhere in between (which is probably why they're still the most popular type of database being used).

Of course, I can't think of any real-world scenario where your data wouldn't resemble a bit of both. Even very hierarchical data usually has some cross-relationships between un-nested documents, which is why it's still awesome to have a document-store database that supports join-type relationships.


It's funny how people after all that hype "nosql everywhere, for everything" start discovering that maybe those guys in the 70s were onto something when they invented relational databases, and not were just too stupid to come up with key-value store. Some data is relational and answering "why don't you use latest fashion nosql" with "because our data is relational" is a perfectly fine answer.


Document stores != Key-value stores. Well, I guess they are similar but I prefer to separate DBs like MongoDB and CouchDB from other key-value databases like Riak and Redis.


I think your summary is only half. The other half is, I think, "Think real hard about whether your data is relational or not."


Indeed, the article makes the point that most interesting data is relational, or at any rate contains valuable relations. Discarding efficient relationship management may be a mistake.


"An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use."

Until you want some analytics.


Many people already build data warehouses for analytics purposes, you don't want to be running reports against your live datatbase if you don't have to. Why add extra load?


> Until you want some analytics.

I can actually respond to this specifically, as we recently had a project that needed us to build some decently-sized and complex analytics into their app. I spent about a month researching how most analytics solutions are structured and work, and became very familiar with the codebase for FnordMetric, which is one such open-source analytics solution.

You wouldn't initially think it (I certainly didn't), but Mongo is actually a great use-case for analytics data. Here's why...

Most analytics platforms don't query live-data and build reports on the fly. It's terribly inefficient and doesn't scale. If something like Google Analytics did this, it'd take forever for your Analytics dashboard to load, especially at their scale.

What most analytics platforms do, is they know before-hand what data you want to aggregate and at what granularity, and they perform calculations (such as incrementing a counter) and then store the result in a separate analytics database/table. In fact, there are several presentations and articles about doing things like this with Mongo:

http://blog.mongohq.com/first-steps-of-an-analytics-platform...

http://www.10gen.com/presentations/mongodb-analytics

http://blog.tommoor.com/post/24059620728/realtime-analytics-...

And then, this is an interesting article that discusses the difference between processing data into buckets on the way in, and creating an analytics platform that does more ad-hoc processing on the way out:

http://devsmash.com/blog/mongodb-ad-hoc-analytics-aggregatio...

Let's take something as simple as aggregate pageviews for example (for simplicity's sake, we'll say you want total pageviews for your app, not per-page). Normally you'd think, simple, I'll just store my pageview events, and then when I want to view pageviews, I'll issue a `COUNT` command on the database. Even this gets terribly slow, for a couple reasons:

* You may just have a ton of pageview event entries to query.

* Each pageview has a datetime-stamp, and you have to query not just one `COUNT` query for a given time-range; rather, your analytics dashboard needs to show a graph of counts over time, e.g. pageviews per day for the last week, or pageviews per week for the last year or pageviews per hour for the past day, etc. Each of these would require several distinct COUNT queries (or one more-complex GROUP query), which is even slower, especially for large datasets.

So generally, analytics platforms will have different aggregate buckets for pageviews in the database, which each keep a different granular tally. For example, I'd have a bucket for each day, which keeps tally for pageviews that day, and a bucket for each week, which tallies pageviews for that week, etc. When a pageview comes in, they'll increment each bucket (which is a really fast process with Mongo, since it actually has an `INC` command (aka UPSERT) which can easily increment multiple buckets with one really fast query.

So why is Mongo pretty good for analytics? Because 1) each time-interval bucket is a silo of data for that time-interval, and 2) usually analytics are for patterns and aggregate data, so they don't normally require extremely high reliability (i.e. it's usually okay if an event is dropped here or there).

Of course neither #1 or #2 above are always the case, so this doesn't always apply, but my point was just that Mongo is actually a better fit for analytics than you might imagine.


I haven't done the kind of analytics you're talking about, but it sounds like the implementation is basically a round robin database.


thanks for the great references

i really needed some good resources on doing analytics in mongodb


but then you can take it out of mongo into something made for analytics. this is a challenge i'm currently facing, but I feel the flexibility mongo has offered in letting us iterate on our data collection is paying off in the end.


I'm using CouchDB hopefully for the right reasons. Each user is storing and accessing only their own data. I need that data to be easily stored offline in localstorage in the browser (sqlite/indexedDb not being supported in all browsers), and similar key/value stores for iOS/Android apps. On top of that I need synchronization when the user does come online. This is the type of app you'd want to use on the go as well as on your home computer, so easy synchronization is very important, which the CouchDB changes feed provides.


I haven't used CouchDB yet, but I have a good friend who's an amazing developer, and he swears by CouchDB, mainly for the reasons you mentioned. So, I don't have any context for your app, but it sounds like you picked a well-suited database to me.


That sounds like a good use for CouchDB; I'm doing something similar. CouchDB shines at that stuff (and as a bonus, avoids some of the issues the OP was having with MongoDB. CouchDB views aren't magic, but they're powerful and functional; more than capable of doing some basic joins).


Or maybe something like this: A Graph Database

http://en.wikipedia.org/wiki/Graph_database


Yep, the "graph databases are too niche to be put into production" bit urked me- Neo4j et al are in plenty of large production systems. OTOH maybe, due to the distributed nature of the project, they didn't want to distribute a less-known database?


I guess in 2010, Neo4j did not have that much exposure as they have today. Still I concur that the author should not brush graph databases aside for something like a social network - they seem a better solution than a RDBMS.


I suspect there isn't actually a lot of need for graph operations in a social network. At least, not in implementing features for the users. A distinctive thing about social networks is that although the users form a network, they are primarily social - they're interacting with their friends.

They will end up interacting with friends of friends via their friends (eg having a flamewar with your cousin's neighbour in the comments of your cousin's post about potatoes), but not with friends of friends of friends or any degree of separation further out. The queries needed are overwhelmingly local, and a boring old relational database will handle them fine.

Where a graph database might shine is in analytics over the whole network, looking for trends, hubs, clusters, etc, lthough i'm not convinced it would be any better than a relational database which supports recursive queries (as PostgreSQL does). However, this is exactly the sort of privacy-busting awfulness that Diaspora was built to escape from!


I think the distinction is captured in "largely". The author seems to be saying that unless you need only the absolute most minimal relational queries, don't use Mongo. That's more extreme than what I realized (and I can't tell if you're agreeing or not).


> An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use.

You dont need mongo db to store todo list datas. My opinion is , in some plateforms, like nodejs,where orms and rdbms drivers are not mature , it's quicker to stuff your app with a mongodb rather than a relational database, because they both use javascript and json data structures. But does mongo db scales easier than a mysql database ? is it even easier to manage ? i dont think so.


Relational just means tabular in the context of relational databases. For storing large-scale social networks I would think of specialized graph databases before anything else.


Linkbait title aside, it's actually a helpful example for directing a database novice on when to not use a document store. I could have used this post a few times in the past few years.


Agreed. Despite the unfortunate title, this is an informative, well written and entertaining article that I might refer to in the future. It would be better if there was a followup on when it would in fact be appropriate to introduce a document store to a project.


I agree that the OP is lengthy, and putting together this well-illustrated post is no easy feat. However, I don't think the OP should be the one to write about when you should use a document store.

Maybe I'm too annoyed by the poorly chosen title. Or that I read that entire post and was thinking where's the punchline? On one hand, I credit the author for thinking things through. On the other, the fact that she unequivocally attributes this issue to MongoDB shows that she currently lacks the domain knowledge to consider appropriate use cases. It's not a MongoDB problem, it's a problem inherent to this data structure, and someone more well-versed in this topic would not conflate the issue...just as a decent IT person would not blame "Windoze" for the fact that she can't get good Wifi reception in the office.

OK, to be even more petty...I think what really aggravates me is how the OP says she's not a database expert -- which is a good disclosure, but self-evident -- but attempts to assert authority by saying "I build web applications...I build a lot of web applications"...Uh, OK, so what you're saying is that it's possible to be an experienced web developer and yet be a novice at data design?

If that was the angle of the OP, I'd give it five stars. Such sentiment cannot be overstated.


Well you're right of course that web developers (and business analysts, and politicians, etc.) can absolutely get by for a staggeringly long time with novice-level abilities. That problem is only getting worse as the tools get better. Luckily I don't have to judge the OP on that basis since that's what markets are for.

And maybe someone else, who has tackled enough difficult problems over time to evolve a nuanced and technically informed opinion of various data modeling and management options, should write the response I mentioned. I'd argue there are plenty examples of that material available already.

The OP, on the other hand, would be writing from the perspective of a professional user who might choose a tool off the shelf at the recommendation of a colleague, and whack it against the problem du jour to see if it works or not. This is a common enough approach that there is at least a chance that a followup would have some value. I can't really expect everyone who makes a living writing web applications to understand CS fundamentals, any more than I would expect it from chemical engineers or physicians. It is nice to be able to point representative members of that audience to an article that resonates with them, and not have to try to translate my opinions into similar language (with or without cat gifs).

Edit: I actually think Journeyman would be a more appropriate term than novice.


Aren't there things other than 'experts' and 'novices'?

It is possible to be an experienced web developer without being an expert at databases, for some reasonable definition of 'expert', sure. I think so anyway. Do you find that aggravating?

Whether it's possible to be an experienced web developer while being a novice at either 'databases' or 'data design' (are those the same thing? you said the second, OP said the first) is open to debate I suppose, but is not implied by the OP.


Even though their data doesn't fit well in a document store, this article smacks so much of "we grabbed the hottest new database on hacker news and threw it at our problem", that any beneficial parts of the article get lost.

The few things that stuck out at me:

* "Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production." - So you did absolutely no research

* "What could possibly go wrong?" - the one line above the image saying those green boxes are the same gets lost. Give the image a caption, or better yet, use "Friends: User" to indicate type

* "Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars." - Yep, and since users are indexed by their ids, this is extremely easy.

* "What happens if that step 2 background job fails partway through?" - Write concerns. Or in addition to research, did you not read the mongo documents (write concern has been there at least since 2.2)

Finally, why not post the schemas they used? They make it seem like there are joins all over the place, when I mainly see, look at some document, retrieve users that match an array. Pretty simple mongo stuff, and extremely fast since user ids are indexed. Even though graph databases are better suited for this data, without seeing their schemas, I can't really tell why it didn't work for them.

I keep thinking "is it too hard to do sequential asynchronous operations in your code?".


Really, in some places it hurts

* We stored each show as a document in MongoDB containing all of its nested information, including cast member*

I've seen this in people using MongoDB and the bough the BS that because "it's a document store" there should be no link between documents.

People leave their brain at the door, swallow "best practices" without questioning and when it bites them then suddenly it's the fault of technology.

" or using references and doing joins in your application code (double ugh), when you have links between documents"

1) MongoDB offers MapReduce so you can join things inside the DB. 2) What's the problem to have links between documents? Really? Looks like another case of "best practice BS" to me


Links in mongo aren't really links though; its up to the application to handle the "joins", which really means making an extra query for every linked item. It's like SQL joins except without any of the supporting tools or optimizations that exist in RBDMS.


Yes, it is manual

But you can query for a list of ids for example, using the 'in' operator and a list. http://docs.mongodb.org/manual/reference/method/db.collectio...


Isn't this done client side? Without joins in the db engine itself locality is much worse along with lost opportunities for optimization leading to much worse performance.


Yes, you have to build the list of IDs to pass to the $in operator and then send out a second query but grandparent post said you had to make an extra query for each linked item which is incorrect.


at mongo training we were told map/reduce did not offer performance and to avoid it for online use. You must use the "aggregation framework".


> What's the problem to have links between documents? Really? Looks like another case of "best practice BS" to me

I think the main problem is that it becomes difficult to maintain consistency, due to Mongo's lack of transactions.


Do NOT use MongoDB unless you understand it and how your data will be queried. Joins like the author mentions by ID is not a bad thing. If you aren't sure how you are going to query your data, then go with SQL.

With a schemaless store like Mongo, I've found you actually have to think a LOT more about how you will be retrieving your information before you write any code.

SQL can save your ass because it is so flexible. You can have a shitty schema and make it work in the short term until you fix the problem.

I wrote many interactive social apps (fantasy game apps) on Facebook and it worked incredibly well and this was before MongoDB added a lot of things like the aggregation framework.

The speed of development with MongoDB is remarkable. The replica sets are awesome and admin is cake.

It sounds like the author chose it without thinking about their data and querying upfront. I can understand the frustration but it wasn't MongoDB's fault.

This is a big deal for MongoDB: https://jira.mongodb.org/browse/SERVER-142.

Let's say you have comments embedded on a document and you want to query a collection for matches based on a filter. If you do that, you'll get all of the embedded comments back for each match and then have to filter on the client. IMO, when the feature above is added, MongoDB will become more usable for more use cases that web developers see.


I've seen a fair number of articles over the last couple of years comparing the strengths and weaknesses relational/document-store/graph databases. What I've never seen adequately addressed is why that tradeoff even has to exist. Is there some fundamental axiom like the CAP theorem explaining why a database like MongoDB couldn't implement foreign keys and indexing, or why an SQL couldn't implement document storage to go along with its relational goodness?

In fact, as far as I can tell (never having used it), Postgres's Hstore appears to offer the same advantages as a document store, without sacrificing the ability to add relations when necessary. Where's the downside?


> why an SQL couldn't implement document storage to go along with its relational goodness? (…) Postgres's Hstore appears to offer the same advantages as a document store, without sacrificing the ability to add relations when necessary. Where's the downside?

PostgreSQL can store arbitrary unstructured documents just fine: hstore, json, … Each come with the possibility to actually index arbitrary fields within the documents using a BTREE index on an expression, and arbitrary documents wholesale using GIST index.

Besides the need to know a thing or two on query optimization, the only downside I can think of is that ORMs are usually broken (Ruby's Sequel is a notable exception). But this isn't a problem with Postgres itself; it's a problem with ORMs (and training, admittedly).


Typically as your data model complexity ("relatedness") increases, it's more difficult to scale. I'm not sure about anything like CAP, but I do know that in graph-database land we have to remind ourselves that general graph partitioning is NP-Hard, and that our solutions will need to be domain-specific.



>> Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production.

Is this really true? It sounds like both relational DBs and document DBs are a poor choice for the social network problem posed. I've actually dealt with this exact problem at my last job when we started on Mongo, went to Postgres, and ultimately realized we traded one set of problems for another.

I'd love to see a response blog post from a Graph DB expert that can break down the problem so that laymen like myself can understand the implementation.


At my current employer, we're working on a product that relies heavily on a graph DB (Titan, in this case). Performance characteristics vary dramatically based on the type of query you're trying to run so you have to be careful about how you use it. There are certain types of things you might do in a relational DB with no worries but that would perform horribly in Titan. The converse is also true, of course. For example, a query along the lines of "give me a list of friends of friends of person X" is very fast indeed on a graph database, whereas a query like "give me a random person" tend to perform horribly. But we've been able to get impressively fast, real-time performance on graphs with millions of vertices and tens of millions of edges. They're still niche products compared to NoSQL systems like Mongo, Redis, etc. But I don't see any reason to think think that Titan or Neo4J aren't production ready.

Here's a good intro to Titan and how it works: http://www.slideshare.net/knowfrominfo/titan-big-graph-data-...


I would look at Neo4j. I originally came across it when vetting Grails (it has a Grails plug-in) and it seems to be one of the heavy contenders in terms of a production-ready graph DB. People (this article's included) seem to say that production-ready graph DBs don't exist. Maybe these projects are still trying to gain traction? I expect some stable builds will be out there soon if they aren't already...

http://www.neo4j.org/


My experience with Neo4j (this year) was abysmal. The take-away I had was: it's only good for very small graphs.

Generally, I'd spend some time writing a script to load data into it, start loading data, respond to it crashing a few hours later, increase the memory available to the process, start up again, and respond to it crashing a few hours later. I was never able to get any reasonably-sized graph[0] working reliably well without using an egregious amount of memory, and knowing that I would continue to face memory issues, I gave up on Neo4j and found another way to solve my problem.

It may be that I simply was not competent at setting it up properly, but no other data store I've worked with has been as hard to get stable over a moderately sized data set. I spoke with some other people who had worked with Neo4j at the time, and they expressed the same issues - they couldn't make it work for any reasonably-sized dataset and had to find another solution.

[0] Not big, mind you, just reasonably-sized. E.g. 4 million nodes, with each node having an average of 5 edges and 2-4 properties.


Hm, I assume you reached out to the mailing list and what not? I know a number of installations with numbers well above that. Were you using the batch insertion API?


No, I'm sure there are some great running instances out there - but I was put off by the difficulty of getting it reliably running without being an expert in its configuration. Additionally, the fact that I'd have to spend at least $12k/year to have only 3 nodes in a cluster, knowing we'd need a lot more than that as time went on sealed the deal.

We found that we could do everything we needed with secondary processing against our document store at runtime for so much less without adding another layer of complexity to the architecture.

Edit: forgot to mention - no we weren't use batch-insertion in all cases, IIRC, we had issues with duplication and had to do check-if-exists -> create-if-not as we were reading from raw data sources that were heavy with duplicates.


Many heavy duty production customers of Neo4j run with just a 3 node cluster, no need to scale out as with other NoSQL datastores. And actually they replaced larger clusters with a small Neo4j one.

I would love to learn about your Neo4j setup, and the issues in detail, I want to make it easier for people in your circumstances in the future to get quickly up and running with Neo4j in a reliable manner. If you're willing to help out, please drop me an email at michael at neotechnology dot com.


And I remember in flipping through a book on graph DB engines that some can be mounted as an extra layer on top of relational stores, so there is always that backdoor back into it.


Yeah, FlockDB (https://github.com/twitter/flockdb) comes to mind. I think Titan (https://github.com/thinkaurelius/titan) should / will be able to handle this too.


I can attest that Neo4j is production-ready- I know they're being used at companies like Adobe and Cisco, and we were happy with it at Scholrly.


More than 30 of the Global 2000 now use Neo4j in production for a wide range of use cases, many of them surprising, even to us! (And we invented the property graph.)

A partial list of customers can be found below:

www.neotechnology.com/customers

The "too niche" comment might have been true a few years ago. I won't speak for all graph databases, since many are clearly very new and haven't had much time to mature yet. But Neo4j has been in 24x7 production for 10 years, and while the product has of course evolved significantly since then, it'd built on a very solid foundation.

Most the companies moving to graph databases--speaking for Neo4j, which is what I know about-- are doing so because either a) their RDBMSs weren't able to handle the scope & scale of their connected query requirements, and/or b) the immense convenience and speed that comes from modeling domains that are a graph (social, network & data center management, fraud, portfolios, identity, etc.) as a graph, not as tables.

For kicks, you can find a number of customer talks here, from the four (soon five) GraphConnect conferences that were held this year in major cities around the world:

http://watch.neo4j.org/

If you're in London, the last one will be held next week:

www.graphconnect.com

You'll find a summary below of some of the technology behind, with some customer examples.

www.neotechnology.com/neo4j-scales-for-the-enterprise/

One of the world's largest postal delivery services does all of their real-time package routing with Neo4j. Several customers have more than half of the Facebook social graph running 24x7 on a web application with millions of members, running on a Neo4j cluster. Railroads are building routing systems on Neo4j. Some of the world's largest customers are using them for HR and data governance, alternate-path routing, etc. etc.

The best way to really understand why graph databases are amazing is to try. Check out the latest v2.0 M06 beta milestone of Neo4j (www.neo4j.org) which includes a brand-new query environment. I've seen connected queries ("shortest path", "find all dependencies", etc.) that are four lines in the Cypher query language and 50-100 lines in SQL. I've seen queries go from minutes to milliseconds. It's convenient and fast. Glad to see you exploring graphs!


> Is this really true?

Facebook's TAO is a giant graph database built on top of MySQL [1]. I'd say it's pretty production-ready, because Facebook's social graph probably has at least hundreds of vertices.

[1] https://www.facebook.com/notes/facebook-engineering/tao-the-...


There's a difference between ready-for-production and ready-for-production-if-you-have-the-entire-team-of-developers-that-wrote-it-on-hand-all-the-time.


See http://blog.neo4j.org/2013/11/why-graph-databases-are-best-t... on how to use Neo4j for the mentioned Diaspora cases (Neo4j was actually proposed back in 2010 to the team). Comments are very welcome.


Exactly what I thought. Mongo has its purpose. But it's a tree. If your data is a graph with many nodes, it's going to take some elbow grease. Don't use mongo in that case, use something that is built for that, like Neo4j ...


what's wrong with symlinks on transactional filesystem?


Millions of small files is the worst workload for pretty much every filesystem. Data locality and fragmentation can end up becoming real problems too.


I hate link bait like this.

The real title should be "Why you should never use a tool without investigating it's intended use case"


But the point is that there is no use case. Relational databases and normalisation didn't arise because a load of neckbeards wanted bad performance and extra complexity.

The point of the article is that the world is relational, and because Mongo isn't, it'll bite you in the ass eventual. Sure, that's a specialisation of what you said, but still a useful one, as it allows you to immediately know you shouldn't use Mongo (unless your data is all truly non-relational, and you know you'll never integrate it with any relational data, which, without a crystal ball, you can't know, so don't use it.)


There is a use case, but internet hype has gotten everyone wanting to use Mongo when there's no real reason to. Postgres scales nearly as well as Mongo while being a lot more flexible. That said, Mongo has some real benefits for non-relational computing (see mapreduce) that could make some of the abstraction headaches and lack of data model flexibility worth it for very large data sets.

But I sort of agree; Mongo tends to be overused by startups who are trying to solve a scalability / performance problem before they have one. In the process they often end up running into data model limitations because stuff moves fast early on and you can't foresee what you'll need in a year.


As soon as you have users, you'll want to handle relationships between users, whether that's outward-facing or for internal analytics. All products have users by definition. Therefore...


Yes. There seem to be a lot of people with quite poor reading comprehension commenting here. The case that the article makes is something like:

1. Document stores are no good for data with non-strictly-hierarchical structure 2. All interesting data has some non-strictly-hierarchical structure

The first point is common knowledge nowadays. It's really the second point that is interesting. Moreover, interesting and correct.


Can anyone explain what are some actual real-life good uses for MongoDB?


I was on a team that built a web app for primary school standardized testing. The amount of data presented and collected per student per test is large and perfect for a document store. MapReduce operations allow the app to quickly produce cacheable reports across cross-sections based on requested criteria.

Event the tests themselves are composed of multiple parts that randomize for each student, and lend themselves to the document structure that MongoDB provides. Individualized tests could be assembled from components based on student criteria and stored uniquely for a user as of that time, a thing which would be unnecessarily complex within a relational system.

Could this all have been done with a relational database? Yes, I suppose, but I cringe at the complexity of relating test questions with test answers with users with other data elements ad infinitum using JOINs on both read and write. And this doesn't even touch the topics of sharding and replication, which Mongo made easy in comparison to MySQL or MSSQL.

Choosing MongoDB was the correct decision for this dataset and application. I don't advocate it for every app, but for this one, it was the appropriate fit.


How would you go getting out data that answered a question like "give me the average score for all maths questions of female students aged 6-7"?


The aggregation framework is ideal for answering these kinds of issues. Mongo has a bunch of aggregation routines that are useful for producing reports on demand, but not on the fly. The trade-off is possible because we know that the collected data (test answers) won't change after it is finalized, and the output of any individual report can be cached virtually indefinitely. (See http://docs.mongodb.org/manual/core/aggregation/)

Also keep in mind that unlike something like a web analytics package that gives you the option to filter and sort your data on any combination of criteria imaginable (for no good reason), the questions that academics/educators tend to need the answers to are generally the same for every new set of tests.

In other words, it's not necessary to enable every possible combination, filter, and sort of output, but merely (ha!) to optimize for the specific results that we know we will need (with a nod toward those results we might expect people to want in the future), and to codify the formulae that will produce those results.

Working with not a school but a testing research company (think of a company like "College Board" vs a "Smallville School District") leads you to produce reports that are significantly more detailed and statistically more valuable than "average score" questions like this, all of which is possible within MongoDB. Though, obviously this treads a little more into work product than would be comfortable to expound upon here. ;)


Sounds reasonable but not ideal. Probably some sort of ETL into a data warehouse would be needed for adhoc analysis.


That's what the aggregation framework is great for. I don't have the code handy, but the free MongoDB for Developers course covers almost this exact use case.


Something like Imgur might be a good use-case for MongoDB. There's basically no relations between images, so each image can easily be thought of as a lone document.

That said, even if somebody was building something like Imgur, I would still advise that the start with a SQL database. SQL is very well-understood, and you will have no problem finding developers that have deep experience in your SQL engine of choice.

More importantly, by the time you hit the point where you need a NoSQL solution to handle scaling issues, you will have achieved product-market fit, and can make a sane technology decision based on your vastly greater understanding of the business needs.


> by the time you hit the point where you need a NoSQL solution to handle scaling issues

See, people keep saying that NoSQL databases give you a performance boost over traditional relational solutions (MySQL and Postgres), but exactly where does this performance boost come from? I can understand the appeal of in-memory databases or using caching (Memcached) to supplement the relational solution, but it seems like the vast majority of Mongo's performance benefits come from eschewing ACID guarantees rather than document databases being inherently faster.


Imgur is set up to have almost the same structure and features as reddit. Users have images, there are sections duplicating the subreddit structure, images have comments, comments have votes and voters. That's a lot of related info that isn't strictly hierarchal, and I believe you'd run into the same problem described in the article - the need to manually do joins and associate types of data in your application code to ensure consistency and lack of duplication.


except there are relationships in imgur. For example when it groups images from different subreddits or creates albums.


Seconded. I see a lot of "But there are use cases!" and, so far at least, not even a little bit of "Here's a use case..."


Check my comment history, I've given several use cases. Basically, Mongo is nice if you don't need a lot of relational stuff but do have lots of arbitrary data to store. A good example is time series data where the format changes over time -- often it's not a good idea to go back and convert old data (sometimes it's not even possible). Mongo makes it really easy to support multiple schemas if your business requires it, rather than having to maintain arbitrary numbers of different tables.


Sure. But why use Mongo for that, instead of a PostgreSQL table with a JSON column for the data, and perhaps a denormalized version identifier so your application code knows what to do with the format of a given row's data field? I can see a speed argument, but I can't see how that militates for pure Mongo, instead of Mongo as a cache in front of something that provides a reliably (not "eventually") consistent backing store.


Alright, so I am not the only one who's been very curious too. Please, someone writes up a real use case where MongoDB is used as the only/main data store, and not a persistent cache in front of relational database?

Thanks.


TL:DR; Mongo works for me ('us'), and when I get some time I'll write a post on what we're doing with it, and why it works for us.

I am ('we are') using Mongo for a public transit planner in South Africa. It's not yet production-ready, but beta testing is going well.

Let me paint the picture before I go on to justify our use of Mongo.

In South Africa there are trains, buses, minibus and other services (metered taxis, shuttles). Trains run on stop-by-stop schedules, all of the bus services just a normal departure-based schedule, and minibus completely dynamic. In order to implement a well-organised integrated planner, you have to view all of these as one 'type' of service. There's then how pricing is calculated for each of the services. There are many different ways in which pricing is calculated, being: (distance-based, zone-based, pricing matrix, fixed minimum with variable charge etc.), then there's ticketing, discounts etc. That too we needed to represent in a simple structure.

Now, my SQL is pretty good, I don't frown upon indices nor joins, but what I can say is that in my initial implementation of the whole idea, I faced a number of problems, being:

(1) what level of normalisation/denormalisation is necessary? (i.e. what should I join, what should I keep in the same table)

(2) I'm essentially using a graph, except that nodes aren't always connected, so how do I traverse the graph when it's actually 'broken'? (excuse me if I get the terms wrong, I'm actually a technical accountant, programming is my second love)

(3) I'm working with location data, I can't expect to use the Haversine formula or equivalents, how can I index both locations, and routes? How do I even store routes? (blob of serialised arrays?)

(4) How do I reduce development time, to reduce the amount of time I spend refactoring schemas/code when I want to implement some new 'shiny' feature?

These were my main concerns, as they were the problems that I had with MySQL. PostgreSQL would have done a good job for (1) and (3), and maybe (2), but I was still concerned with (4) as working with PHP/MySQL isn't the friendliest of things. Doing 'in()' queries is one example, as I have to parse an array to a string before using an in() function.

Mongo initially appealed to me because it was marketed as 'schemaless', but even someone with little knowledge as I knew that it should be taken with a grain of salt. The benefit here was that I could store different services with different attributes in one collection. If a service is a minibus, I add all the fields that I need for the minibus, and omit the ones for a train for example. Similarly with the pricing structures.

At first it was difficult grasping the 'store everything in one document' model, but I got the hang of it, and now my schema is 'frozen', so (1) has been solved. I don't use joins or any simulation thereof. Yes, because storing I store what I need in one document, I don't need to go back to Mongo to find joining data.

(2) A graph database wouldn't work well for me, because even though services link together, there are instances where the commuter will have to walk to join another service. How do I do that? I initially created manual walking links in MySQL but that was naive and stupid (trying to avoid Haversine). Obviously PostgreSQL distance-based queries would also work here. Another thing against a graph database is that my project doesn't just rely on traversing graphs all day, there are other things which I need to do, like analytics.

(3) To be honest, even though I'm confident with my SQL knowledge, all the PostgreS/PostGIS functions felt a bit intimidating at first. I can't afford a $250k/year DBA, so I have to know what I'm doing on the database as well as the client/server. I find learning how to use Mongo to be easy, even though people say their query 'language' is (insert bad word), I find it quite user-friendly. Mongo's geospatial support, and GeoJSON as of 2.4, made things like storing a route, and running typical queries on it, very easy.

One other thing that we discount is the transparency of data in Mongo, yes they should use field compression and other space-saving techniques, but until them, I see the benefit in being able to do a db.collection.find() and being able to look at the whole document without having to join any tables (it's just a bit quicker I guess). Which brings me to (4): the most fun thing that I had to implement was scheduling, bearing in mind that there are scheduled services, and variable/random ones. Let's say I have a separate table for schedules, and I want to find: - x services, - that start/pass at [lat,lon] - which are operating during ab:cd AM - which have the next schedule/estimate within m and n.

Sure, you could do it with a JOIN, but why do that when you can do it all from a single document?

Lastly, experienced programmers tend to take for granted the benefit of simplifying certain things for novices/beginners. The reason why JSON is taking over, besides that it's a more compact and readable expression than XML, is that it's also easy to work with. Why should I worry about converting associative arrays to strings in order to do an in()? Instead of saying something like "in(array)" directly? Even though I was forward thinking in my schema design, there were a few changes along the road when I realised that something wasn't working. Making a change in the schema was quick, and I didn't have to spend a lot of time making sure that my data is still fine.

Please note that I didn't talk about 'web scale, speed' or all those other things. Under my current hardware I would need to cover 3 countries' transit systems before I need to shard. I am running a single-node replica so I can enable backups.

That's just my view, I wrote this in pieces, so I might appear to be all over the place. I'll write a thought-out post detailing why Mongo is currently working for us/me.


We have a CMS application that supports creating custom web forms, which each have a different set of fields which hold different types of data. Email addresses, multi-select radio buttons, text areas, etc. Some forms only gets submitted once or twice, others are submitted many thousands of times. To store this in a normal relational database you either need many tables, or you need to normalize your data (probably Entity-Attribute-Value (EAV) style). We didn't want hundreds of tables, and designing a good EAV system can be tough (Magento anyone?), so we looked at other options.

We've settled currently on using MongoDb, with one collection to hold the (versioned) definition of each form's fields (data type, order, validation rules, etc), and another collection to hold all of the submissions (this is a laaarge collection). There _is_ a "relation" between the form definition and the submissions, but because you always query for submissions based on the form (by form_id), you don't really need to do "JOINs" (you just query out the form definition once, before the submissions). Also, because the forms are versioned, and each submission is associated with a particular version of a form, there is no need to retroactively update the de-normalized schema of past submissions (although this does limit your ability to query the submissions if a form is updated frequently - or drastically). It's not perfect, but this use case for MongoDb has been working well for us so far.

My answer to this prompt was starting to get long, so I actually wrote it up in more detail on my blog (the first update in months!). Included are some other drawbacks and tradoffs there. Check it out here if you are interested:

http://chilipepperdesign.com/2013/11/11/versioned-data-model...

I would love to hear how other folks have solved similar issues like this? Or if anyone sees a way to improve on our current solution? Feel free to respond here or on the blog post. Cheers


Reminds me of the NYT "Stuffy" app, which was built in a similar way: http://open.blogs.nytimes.com/2010/05/25/building-a-better-s...


A good simple use-case for a document database (could be MongoDB, but not necessarily) is configuration and system "schema" type data. For example, storing all of a user's settings and preferences into a document keyed by the user's Id.


We use it for event storage in the event sourced parts of our app.

For the rest of our data, we're currently migrating off of Mongo to Postgres due to an experience similar to the OP's.


This is ridiculous linkbait bullshit.

Anyone who dismisses document stores entirely has lost all my respect. It wasn't the right solution for your problem, but it might be the right solution for many others.


> but it might be the right solution for many others.

The author made the example of the movie database and explained why it was a good idea when they started, and why it didn't work out. Can you point out an example of data you would store in a document database, which is not purely for caching purposes?


Collecting structured log data like monitors or exception traces or user analytics. Lots of documents, no fixed schema, they're all self-contained with no relations. Map reduce makes query parallelism crazy magic.

A content management system. Some stuff may want data from across relations (who owns this thing, and what is their email), but that's pretty infrequent and having nice flexible-schema documents that contain all relevant information that's being CRUD'ed simplifies things hugely - particularly in MVCC systems like Couch that put stuff like multi-master/offline-online sync and conflict resolution in the set of core expectations.

Edit: That said, Postgres is also MVCC, and hstore makes schema an option the same way that relations and transactions were already, so I think it could do pretty well. I haven't gotten the chance to play with it in recent history, unfortunately.


> Map reduce makes query parallelism crazy magic.

Isn't that only true if you have lots of shards? Otherwise, you have one process doing the mapping.


> Some stuff may want data from across relations (who owns this thing, and what is their email), but that's pretty infrequent

That might be a shaky assumption. Speaking as someone who works on a CMS, content usually has an author, and people accessing that content might be interested in them.


Yeah, but in most of those cases, it's as easy as get the author based on a key from his content.

It's only when you want joins (e.g. give me all of the titles of all the content and their author's information at the same time) that things get hairy.

Agreed it's not always going to be true for many CMSes. I meant it as a particular CMS, not the general class of CMSes but didn't make that clear at all.


>Can you point out an example of data you would store in a document database, which is not purely for caching purposes?

I've been a pretty vocal critic of document databases in the past[1] (indeed, I get a little bit of a chuckle recalling the prevailing HN wisdom a couple of years ago and comparing it to now), however I recently had a project where added data was immutable and additive and non-relational: MongoDB was the perfect choice, and provided a zero friction, easy to deploy and scale solution.

[1] - http://dennisforbes.ca/index.php/2010/03/24/the-impact-of-ss... -- this went seriously against the prevailing sentiment at the time, and there was this strong "only relics use SQL" sentiment, including here on HN.


> I recently had a project where added data was immutable and additive and non-relational: MongoDB was the perfect choice

Technically that sounds a lot like the TV database example in the OP. MongoDB was the perfect choice until a feature was required that required a relation.


Agreed. Also, this looks way more like a case where the author mis-structured his data for his intended use case, and is blaming the tool instead of the skill level used to implement it. Nesting deeper than one level in a document is rarely going to result in sufficient query capability with respect to the nested items. Even MySQL can't nest rows inside other rows, which is what he seems to have wanted. Maybe he chose MongoDB because he wasn't ready to think around the architecting issues that an SQL-based database would require, which happen to be, although not immediately obvious, similar to those in Mongo.


Despite being a programmer, I believe Sarah is a woman.


The more reason this article should be taken with a grain of salt.


Seriously, dude? It's attitudes like that which makes women hesitant to become coders and computer engineers. It doesn't matter what's between a person's legs; just that a person can code, enjoys doing it, and knows what they're talking about.


Cut the guy some slack, he obviously hates women because his mother named him "Pear Juice". Either that, or he's yet another insecure man-child hiding behind a pseudonym.


For years men were dominant in computer science and engineering and I really see no reason why this should change. Only in recent time with this whole third wave of post-feminism certain groups of females think it is their task to overthrow men supremacy in said fields. They are way too emotional for this profession and this results in drama and absolute shit code in production.

I am not saying the author of the article submitted has fallen victim to described wrong doings, I am only saying that at all cost unknown teritorry should be approached with extreme care. You don't know what is subliminally hidden until you realize. Too late.


This attitude is both hateful and harmful to our profession. It is not ok and I wish more of our peers would step up to tell you that this is not acceptable.

You are also making wildly inaccurate statements to justify your abuse and I hate to think that other readers might accept them uncritically. We have been actively pushing women out of computer science for the past 30 years (and doing a fine job or excluding and ignoring the contributions of other minority groups in the process). Suggesting that men "were" dominant and misrepresenting the direction of this change in willful ignorance of history at best.

I know relying to trolls is not particularly effective and other users have called this out for being hateful but I don't want to see us accept either the premises or the tone presented here.


> They are way too emotional for this profession and this results in drama and absolute shit code in production.

So that is where all of the shit code in production (which is the majority) comes from? it's really insidious, because the commits are made using male names. This must have been going on far longer than i imagined, because I've dealt with really old legacy code that is shit.

Since, you made a provable statement, I'm sure we will soon see a tremendous number of papers documenting this coming gynepacalapse of bad code.


Get out of my profession, you repugnant sexist asshole. This kind of rhetoric is not acceptable.


> third wave of post-feminism

Hint: /r/TheRedPill is going to make your life worse, not better.


Oh my god, I did not need to know that existed. I need brain bleach now.


Yeah, stay away from that stuff... can mess you up.


how does it feel to be a piece of shit?


Fuck you.

...Was that emotional enough? Or not emotional enough? It's so hard to tell.


I don't understand why your comment is being buried. It's absolutely idiotic and backwards in substance, sure, but sweeping it under the rug doesn't help anyone.


While your namesake is sweet and delicious, your opinions are questionable.


They say never jump in an argument late... but here goes...

There is a lot of people arguing in the positive for polyglot persistence. The arguments sound pretty appealing - hammer, nail, etc - on the face of it.

But as you dig deeper into reality, you start to realize that polyglot persistence isn't always practically a great idea.

The primary issue to me boils down to safety. This is your data, often the only thing of value in a company (outside the staff/team). Losing it is cannot happen. Taking it from there, we stumble across the deeper issue that all databases are extremely difficult to run. DBAs dont get paid $250k/year for nothing. These systems are complex, have deep and far reaching impacts, and often take years to master.

Given that perspective, I think it then makes the decision to use a single database technology for all primary storage needs totally practical and in fact the only rational choice possible.


I'm going to reiterate as others have done - this is an area where a good graph database would blow all the others out of the water. I am currently using neo4j for a web app and find it to be extremely good in terms of performance. There is really only one downside to using a graph database - they are not really scalable horizontally as you might want. They need a fair bit of resources. But in terms of querying, they would be unparalleled in this particular use-case.

They are also not in infancy - they are in use in many places where you wouldn't expect them and which aren't discussed. One big area is network management - at least one major telecom uses a particular graph db to manage nodes in real-time.


> they are not really scalable horizontally as you might want

Seems like this would be a huge drawback for a project whose entire raison d'etre is horizontal scaling.


This is a well known and well document "down side" of mongo. Frankly the analysis of your article is jeopardized by your first line stating, "I am not a database designer". Mongo has its downsides that are well known, but there are also very good reasons to use Mongodb too. although its a lengthy article with good examples it states nothing more than an obvious caveat mongo has which is well known and documented.


My advice to the OP: Re-jigger this article and retitle it: "The Data-Design of Social Networks". That would be a worthwhile read and I appreciate the detail that the OP goes into.

One of the subheads should be: "Why we picked the wrong data store and how we recovered from it"

And not to be snarky about it, but an alternative title is: Why Diaspora failed: because a Ruby on Rails programmer read an Etsy blog and thought they understood databases


OP should have been using a graph database. Ranting about MongoDB because it doesn't support what it's not designed to support is a bit silly. A RDBMS would have been just as poor of a choice here.


Facebook seems to be doing quite fine by combining SQL and Memcached.


They probably don't use relational databases the way you do for smaller projects that don't need to scale to the millions.


Though title sounds like a link bait, this is actually eye opening article for database layman like me. Very clearly written.

Now, what is MongoDB fit for? Most of the web applications are what author gives example, complex and having inter-relationships. Can someone light up?


Like the article says, it can be suitable as a caching layer in front of a DB, especially for web apps that deal in ephemeral JSON documents.


So in other words, they misused MongoDB and because of that are telling people not to use it? Wow. Seems to be a case of "a bad mechanic will always blame his tools".

In the right hands MongoDB can be a great asset. The problem here is that Diaspora chose MongoDB when it was very immature and it seems the choice was based on hype more so that mapped out requirements. This is where proper planning for a large scale application will spot these kinds of problems before they get to the development stage.

Later versions of MongoDB are much better and the upcoming planned changes will take it many steps in the right direction towards being a viable alternative to a traditional RDBMS. Having said that, it's not a silver bullet and MongoDB is not for everything.

10Gen are exceptionally great at marketing Mongo and I get the feeling they have kind of trapped Foursquare who have been using it in production for a couple of years or so now. Having said that with exception of that one 11 hour outage Foursquare encountered, MongoDB seems to be working really well for them and they seem to be capturing more than just check-in data.

I still think with proper planning pulling off a social network with MongoDB is possible. I am currently building a social networking type application, not on the scale of Facebook but it does share some parallels. I've planned and mapped out a viable structure and how it all connects, prototyping and testing seems to indicate that MongoDB is up to the task, but we'll see.


"So in other words, they misused MongoDB and because of that are telling people not to use it? Wow. ... The problem here is that Diaspora chose MongoDB when it was very immature"

So the specific misuse of MongoDB was to use it? I guess that was also the point of the story.


Yes, exactly. They chose the wrong tool for the job, that's not a fault with MongoDB. It's like eating a soup with a fork, you'll eventually eat the soup, but if you used a spoon in the first place there would never have been a problem. Is the company who made the soup to blame or are you to blame for not consuming it correctly?

The problem with Diaspora was that it was a poorly executed good idea. Had they actually sat down and mapped out their requirements and chosen an appropriate database they would have realised that a traditional relationship was the right choice to make.

Databases are hard, for a project as ambitious as Diaspora proper planning is key and evident by the issues that Diaspora had when it debuted (delete controller with no auth checking...) it's apparent they had no clue what they were doing not from just a database planning point-of-view but a code one as well.

I would take whatever anyone who had anything to do with Diaspora had to say with a grain of salt. The real lesson here is to not get caught up in hype and trends. NoSQL isn't a magical solution that will make scaling problems disappear, traditional databases like MySQL have been battle-tested over a very long period of time. If MongoDB and <insert X NoSQL database here> were so great, the likes of Facebook and whatnot would use them on the scale Diaspora tried to use them on.

People have to realise when Diaspora used MongoDB it was a very early version of the database. A lot has changed since Diaspora used it, it is suitable for a large-scale Facebook clone as the sole database? Definitely not, but using it for aspects like messaging and notifications I think would definitely be a good use case for it.


If MongoDB was the wrong tool because it was still too immature, i.e. broken, wasn't the problem not that they were eating soup with a fork, but with a broken spoon.

But yes, they were wrong to use a broken tool for the job.


The best thing about this article is it demonstrates the problem with pg's "The submission must match the title" policy.


"You should never use MongoDB [the way we did]". For some use cases, it would be a terrible decision, as this project learned. In others, it works fine, even with the "eventually consistent" sort of thing. I knew it would go bad when the author immediately started talking about web apps from the get-go, because (as unusual as this may seem to some subset of developers) not everything is a webapp.


I'm not arguing against this article, they seem to have made some poor choices along the road, but to say "you should never use MongoDB" is silly - even if you add "...for relational data". It has it's obvious drawbacks, of course, but MongoDB is way more than just a caching layer.

Here's some social networks that are running MongoDB: http://www.mongodb.org/about/production-deployments/#social-...

The list includes Foursquare. They have been running Mongo as their main storage engine for about four years now. They migrated away from MySQL, as discussed in this video:

http://youtu.be/GBauy0o-Wzs?t=2m30s


This should read one of;

1. why you shouldn't use things you don't understand. 2. I wish I'd rtfm with Mongo 3. schema design for mongodb, what I wish I'd known


Funny thread!

MongoDB's query performance is typically sub millisecond, and rarely as much as 5ms. I don't think you can through the postgres/mysql parser in 5ms, much less optimizer, planner, and execution stack. Couple that speed with a dead simple API, and you've got a thing of beauty.

So yeah, if you don't care about ms latencies, if you've got a fixed rectangular schema, if want to write queries with lots of joins, if you need ACID guarantees ... etc ... then by all means pick your favorite RDBMS.

OTOH, when you need to manipulate relatively self-contained objects quickly, when those objects don't have a fixed schema, when availability is more important to you than immediate consistency, then why would you choose anything other than MongoDB?


> I don't think you can through the postgres/mysql parser in 5ms, much less optimizer, planner, and execution stack.

Yeah... except no. I just set `log_min_duration_statement` to 0, and can see that PostgreSQL typically takes less than 0.1 ms to parse a query.

Quickly parsing a query to come up with an optimized plan is actually a great strength of PostgreSQL, when compared to other RDBMS. MSSQL, for example, has this complex query plan caching mechanism to compensate for its slow parsing. PostgreSQL doesn't need that.

Also, with EXPLAIN ANALYZE, I can see that PostgreSQL typically takes less than 0.1 ms to do an index lookup as well.

You seem to believe that MongoDB has some kind of magic that makes it the only database that can perform sub millisecond query. 10gen is doing a great job there.


Your argument here is in favour of premature optimisation. You are wagering performance now against the assumption that your schema will never change, that will never need joins, that there are no, and will never be, entities that appear in common in multiple points of the document graph ...


Many comments attacking the author of original article, but not a single one addressing the arguments.


That's because her arguments don't make a lot of sense. It's like titling a blog post "Never use a hammer!" and follow it with line after line of "I'm not a refrigerator repairman but my refrigerator uses Phillips screws so I used a hammer to remove them and it kind of worked but it took far long to do the job and damaged some of the screws..."

Short of "use the right tool for the job" (which a lot of people here have already said), what do you expect?

Mongo is a document store. If she'd used it to store documents then I think her blog post would have been quite positive.


Her point was that things that often look like documents on the surface (eg. activity streams, TV show listings) are anything but once you dig into the problem area. It's worth remembering that the next time you're faced with something that looks like a document.


If you're right, it would be helpful to point to an unambiguous description of what constitutes the kind of 'document' that mongo is suited for, and what does not.


That's like asking for an unambiguous description of what makes a language good. Even if I were to write a book it would be oversimplified and unsatisfying.

SO, OK, here's the oversimplified version... basically, if you have large amorphous chunks of data, easily denormalized so very few joins required for typical queries, then you have a document-friendly dataset. Medical records, court docket entries, things where a SQL representation has many tens of often-NULL columns and only a few foreign keys, those are probably good document-based data.

What Sarah describes is exactly the opposite: small, deeply nested, tightly joined, difficult to denormalize, nuggets of data. Expecting a document-oriented database to handle this is like expecting a SQL database to handle complex graph data: it can be done, but it's slow and the workarounds are not pretty. It doesn't make a lot of sense to complain about that.

Hope that makes sense.


It makes sense, although Sarah seems to suggest that any linking between documents will lead to the problems she describes.

I am willing to accept that for a sufficiently stable set of 'documents' this might be surmountable, but it does seem like a major limitation of mongo as a general purpose store, and the quick dismissal of her point seems unwarranted.


Her quick dismissal of Mongo is unwarranted. For many datasets it works great.

Here's my question: are the (hundreds? thousands?) of production Mongo deploys wrong, or is Sarah wrong? Given her blog title it's gotta be one or the other.

I'm not using Mongo now, but I'm happy the site I created with it a few years back is still running strong. Once I got used to denormalizing the heck out of everything I had zero complaints. But it was for medical records: easily denormalized, mongo-friendly.

It did lead to this: https://github.com/bronson/valid (probably never get around to polishing it, alas...)


Her dismissal is anything but quick. It's a carefully crafted argument which deserves real rebuttal.

I'm not in any way trying to bash mongo. I have a consumer facing production that has been running it since 2011 without great performance and no problems, but my data model is carefully simplified, and I've been uneasy about how well it would work for a more complex schema.

I found Sarah's analysis useful, but I'd find more precise guidelines for the conditions of what does and doesn't work even more useful.


"Why You Should Never Use MongoDB" is about as quick as it gets. If that's true, there's not much point to reading the rest of the article is there? Don't use Mongo.

You might as well be asking for precise guidelines on the right amount of normalization to have in a SQL schema, or which language or editor to choose. You could spend weeks reading articles and blog posts and come away with a really lopsided view of the subject, or just dive in and figure out what works for you.

I found Sarah's analysis trite and one-dimensional. About as useful as "never eat peanuts" with a lot of scare words about allergies. That's great linkbait, interesting to anyone who's never eaten a peanut before, and might even have a thing or two to be learned. Unfortunately, it overstates its point so far that it just isn't useful in the real world.


The headline is provocative, and certainly not absolutely true. It would be better phrased as 'Why mongodb is unsuitable for most use cases'.

However she then wrote a full piece justifying her position. We might disagree but she provided valuable insight.

Most people here did nothing of the sort. And the argument that 'it would be better to dive in rather than read blog posts' is a generic argument against anyone reading or writing technical blog posts.

Surely you can't really mean that.


Her headline is a more accurate than yours. Why attach a reasoned headline to an unreasoned article?

If you truly believe that mongodb is unsuitable for most use cases then I hope you're basing that on more evidence than this blog post. Citation needed please. And recognize that your statement doesn't agree with my experience. (unless you're saying, "(insert any database technology) is unsuitable for most use cases" -- that's probably true but not useful)

SO, is it better to dive in rather than read linkbaity one-sided blog posts? Yes. Yes, I mean exactly that. Surely you don't disagree?


I give Diaspora and the author kudos for the effective hatebait title, though I would have preferred "Study the documentation/source of all mission critical components carefully."

Reading through the article, I saw similarities between their issues and ones I've encountered. However, having skeptically studied MongoDB's docs and tested its behavior, I had a very good idea early that MongoDB preferred idempotent, append-only, and eventual-consistent data, with two-phase commit being the nuclear option (and that boxen-wise, the high-memory ones are preferred).

Regarding denormalization storage, invalidation is indeed a tricky issue. A CQRS approach with a durable event queue between the command/domain and an idempotent denormalization system elegantly yields an exceptional scalability:complexity ratio. In the end, I built a performant, flexible app that's been a joy to work on and operate.

I see pain similar to Diaspora's as a result of "move fast, break things" culture taken to the extreme. No planning or basic research of any kind is expected anymore. I know (from experience watching people pick technologies they barely understand and try to IntelliSense their way to delivery) that I'm in the minority, but I've always preferred to read a book cover-to-cover (or traverse a website depth-first) before adopting a technology in earnest, because I know I don't know what I don't know.


All things aside, the main problem I had with MongoDB when I've tried to use it in one of my projects (in NodeJS) was the way to query data. I know, there are some libraries for that, but it's just pain in the ass to make complex queries - something which would be an easy JOIN in SQL, in MongoDB is a big pile of callbacks and long, chained method calls. It was just hard to maintain, and the data structure flexbility didn't make it easier which led to a completely messed up database.


The whole post hinges on the statement:

"When MongoDB is all you have, it’s a cache with no backing store behind it. It will become inconsistent. Not eventually consistent — just plain, flat-out inconsistent, for all time."

OK, here is a chance to share a little insight. At what point did Mongo become hopelessly inconsistent? Were you ever able to determine why? Why bother with cute pictures and verbose explanations of simple schemas when the conclusion is just that Mongo breaks no matter what without further explanation?


Probably because the project was forced to keep copies of data all over the place, and was distributed to boot. There was probably a reasonable bug or edge case which caused some copies to conflict with each other, and since writes are destructive, it became impossible to reconcile the conflicts.


I thought this statement was confusing as well. I can see how the given schema could cause problems with inconsistency, but there seem to be a few logical steps missing in the jump from that to "Mongo will never be consistent and we need to migrate to mysql".


That's not a property of Mongo, it's a property of any non-trivial caching layer that you can't regenerate. Something will go wrong, and then you're stuffed.


The 'main' app at my place of work uses MongoDB this way; it even implements a 'relationships' collection that is used for one-to-many 'joins' (ugh, just thinking about it makes me feel ill). Unfortunately, I joined when the initial write was nearing completion and was unable to steer the team in the right direction in time.

I just submitted a proposal for a ground-up rewrite; unless it's accepted I'll be leaving promptly.


In fairness to MongoDB, I've seen this done in the relational world. I once worked at a place who had a schema where everything joined through a single table -- "TableRow_TableRow", which had six fields. Two IDs (which were varchars) and metadata_1 through metadata_4 (also varchars).

They couldn't understand why it was so slow. But hey, it's super flexible, right?


Link bait title aside, I'm a little bored with these "We don't like x, so you shouldn't use it" articles.

The main downside of MongoDB is that it's new. This means less knowledge of best practices, incomplete or missing support in third party integrations, and feature-lacking tools. It also takes a different approach to architecting systems than you would take when using a SQL approach.


There's also the fact that what they were trying to do in Mongo was not what you should do in Mongo. Use a relational or graph database for data that is best represented as a relation or graph.

Nothing like using a hammer to paint a wall and then say you should never us a hammer.


Over the past year we have from scratch built a significant web-application with MongoDB. We also have a social graph. We build infinite-scroll activity feeds on the fly. We handle multiple writes per second and 10s of reads per second. We have hundreds of thousands of users. We have 100 million documents, growing nicely. We've forced square pegs into round holes on occasion, but nothing we were too surprised about.

It seems like you walked into to your database technology choice with your eyes fully shut. Given even a modicum of of preparation - e.g. reading the MongoDB documentation - you would have asserted the social graph use case to be a challenging one with a data-store in which relations are unnatural.

Then - because you realized you may have a sub-optimal solution, you optimize it by changing technology. And then decide to join this ridiculous anti-MongoDB internet bandwagon.

For somebody who builds "4-6 web-applications per year" and has deployed "most of the data-stores you've heard about - and some you haven't" this seems surprising.

Or perhaps not, actually.


The most interesting part is to make a difference between your primary data store and your secondari(es) ones. For "friends of friends" queries you could definitely use graph dbs, for "tree hierarchical" data then a document based is good. Secondary dbs are chosen depending on queries. For analytics then choose columnar dbs or map/reduce friendly data stores, etc...

But primary db needs powerful querying possibilities, strong consistency and durability,as well as really good admin toolings. That's the "fall back", to be used as a last resort, but that's also the point of truth of your system.

Then comes optimizations, and pipe-like data processing to move data from primary to secondary dbs ( or to both in parallel).

That's why i've never been found of using new technologies such as mongodb or couchdb or anything like that for my primary db.


There are very few use cases that can be solved with mongo that can't be solved equally well using nosql inspired features of Postgres and other simple tools.

Data often has relational characteristics we don't anticipate. That's why we build flexible schemas. Mongo schemas have none of that of flexibility.


+1 on all the linkbait comments. MongoDB was not a good fit for this project; but there are a great many projects where MongoDB is a great fit. If you were dumb enough to just start using MongoDB (or mySQL, or whatever) without matching it to your data model, that's on you.


I was about to go all in with mongodb (w/ node) on my next project, including for the user table. But after this, I'm going have to re-evaluate it.

"Once we figured out that we had accidentally chosen a cache for our database, what did we do about it?"

"The only thing it’s good at is storing arbitrary pieces of JSON. “Arbitrary,” in this context, means that you don’t care at all what’s inside that JSON. You don’t even look. There is no schema, not even an implicit schema"

"I’ve heard many people talk about dropping MongoDB in to their web application as a replacement for MySQL or PostgreSQL. There are no circumstances under which that is a good idea."

"I suggest taking a look at PostgreSQL’s hstore"


.. For relational data. Is what the title should be. But that is far less effective linkbait.


What a FANTASTIC write up! So much clarity here.


(1) The author knew social data is a graph topology. (2) Also knew there's no true production level graph database solution in the market.(I doubt this but anyway, the author did) (3) And tried to store it into many ISOLATED trees - what MondgoDB offers. (4) Realized that's impossible. (5) Blame MongoDB for lack of graph connectivity feature. (6) Back to RDBMS for consistent graph connectivity.

None of this procedures does make sense.

I want to ask to the author. Did you really knew what the GRAPH is?


This seems to be a case of incompetent programmers doing jobs which are way over their heads. Just saying. Well, if it doesn't kill them, they'll learn from it.


Not sure why it has to be one or the other.

It seems perfectly reasonable to store the users profile information and media in mongodb and your likes and comments in a relational db.


Putting together a social network schema and keeping performance in check is less than evident. It's a problem that many have to tackle such as facebook, twitter, linkedin, flickr etc. Here is a blog article discussing some of the issues and approaches: http://www.kylehailey.com/facebook-schema-and-peformance/


I hope that Mozilla guys will also learn that lesson. They helped Microsoft to kill WebSQL. So instead of SQLite, we have terrible IndexDB inside browsers.


TL;DR would be nice =/

all in all i do not see all problems u see, we running mongo with elastic search on 30k unique page per day, and we do not have big problems.


This was the best explanation of MongoDB I've read.


How about this for an article: Why you should never use a Microwave.

I tried toasting my toast in the microwave and it didn't work so good.

How about you use a document store for documents, and a graph db for graphs? Of course mongo won't work for data with loads of relationships in it, because it's not meant for that.

Next time take your bread out of the microwave and put it in the toaster - I guarantee you'll get better results.


I wonder why this is such a surprise and all this hoopla ... http://docs.mongodb.org/manual/faq/fundamentals/#what-are-ty...

The big advantage of mongo (as other nosql dbs) is the dynamic schema. Try adding a new column to your sql database ..


Many boosters of NoSQL forget that relational database engines are not all the same. They evolve, and are no longer the same as they were 10 years ago when people started building NoSQL engines. In particular, PostgreSQL is not the same as Oracle or MySQL, and PostgreSQL has been evolving quite a lot in recent years to make it a stronger competitor wih NoSQL.


Consider Neo4J.

The fail mode here for MongoDB (complex non-hierarchical relationships) is a win mode for Neo4J. Hierarchies are just a special case of graphs. Use a proper graph database, and you can actually represent the relationships based the domain model, without the clumsiness of hierarchical NoSQL or even worse clumsiness of RDBMS.


It's certainly not a one size fits all solution, but saying NEVER use it for anything is a little strong. I've been using it in production now for 6 months on a fairly well trafficked site without a single disruption that wasn't the fault of application layer code.


The article mentions that they considered a graph database, but considered it too niche for production. Is that the general opinion on graph databases (like neo4j) at this point? Not production ready? This project seems like a perfect application for a graph database.


To be fair, I work in SQL constantly, but someone saying "7 table joins ugh" all I can think is really?

Maybe I have spent too much time in the TSQL world, but I regularly see things with 20+ joins without blinking.


The problem described here is about the lack of transactions only... The relational nature of pgsql or mysql is not what makes a difference when you want "all or nothing" insertions or updates...


Like mongo queries but use SQL with node.js? Check out https://github.com/goodybag/mongo-sql


I was never fully sold on the advantages of jumping on the NoSQL bandwagon. This article echoes a lot of the concerns I had when people were first getting excited about MongoDB.


You thought you wanted to use a document store until you realized that redis+postgres was actually what you needed :P


someone misused mongodb, and now they're blaming mongodb for it.

A social network will have a lot of relations -- so you should use a relational database.

Other web apps may not need as many relations (most mobile apps), in which mongodb is a superior solution


you keep using that word "relations". i do not think it means what you think it means.

http://en.wikipedia.org/wiki/Relational_algebra


Is it also the case of all other nosql databases like Cassandra or CouchDB etc?


This is just a naive article from a naive developer who shouldn't be responsible for choosing data stores in any project. 4-6 projects is a lot? I think you need to get your head out of your a (like most ruby devs ;).


we dont treat well the kind who write 1000 lines of codes for adding two integers over here mister.


Don't blame it on MongoDB if you are using the wrong tool for the job.


The point of the article was that there isn't a job that MongoDB is the tool for. (See the last example where the model fit perfectly until a feature was needed that blew everything up.)


Wasn't the point actually that even in that second example, they didn't have the foresight to ask the client about the need to cross reference tv shows ahead of time?

And for the record, how on earth is this MongoDB specific? Almost all NoSQL solutions fall into this same situation.

I was at least expecting more complaints about the early days of mongo's writes not being durable... At least that argument has merit.


If MongoDB promotes itself for the wrong jobs, then the shoe fits.


Absolutely Agree.


>Seven-table joins. Ugh.

Where does this attitude come from in the first place? Even when I was just first learning SQL the notion of doing multiple joins was never off-putting or scary. Quite the contrary, the fact that joining two relations produces a relation which I can then use in more joins seemed like a perfectly elegant abstraction.

>On my laptop, PostgreSQL takes about a minute to get denormalized data for 12,000 episodes

http://www.postgresql.org/docs/9.3/static/sql-createindex.ht...


I'm not sure either but I think it's the procedural mindset. People learning about relational databases hear about how a join is equivalent to cartesian product, then imagine the nested-loops implementation and think that k-way join query fundamentally has performance O(n^k).

They haven't wrapped their brain around the declarative mindset that a join doesn't mean looping any more than multiplication implies a loop of additions.

I think it would help if relational databases acted less like black boxes and exposed worst-case performance guarantees for particular queries. AFAIK no DBMS actually promises that an equi-join on an indexed column has constant-time overhead despite being implemented that way.


While we're dreaming of database ponies, I'd love to see this taken even further - a database that lets you register queries (read-only sprocs) in a language that allows specifying worst-case asymptotic run-time. (Eg, "this should be logarithmic in the size of table foo".) The database would be responsible for establishing any indexes needed to meet the requested performance bounds.


It's not quite what you're asking for, but SQL Server Tuning Advisor will recommend indices. Hell, SSMS recommends a missing index if you display the execution plan for a query. Oracle has a tuning wizard too (though I've no experience with that one).


In my limited experience, the Oracle version is a lot more difficult to use than SSTA. (I think) it requires setting up a sampling of a particular query or table, through Grid, and then analysing the results offline.

I love the "missing index" feature of the explain plan in SSMS also - and missed it heavily when I moved to a shop that does Oracle.


It doesn't help when a co-worker writes a query that just left outer joins every table on the server and uses the where clause to filter out the excess...

(found one of those this morning)


Just make sure they don't find out about recursive common table expressions.


Throw a "distinct" in there for good measure


Joins are potentially explosive (cartesian explosion :). So they have to be treated with respect, just as other explosive substances should be. That doesn't mean they should never be used - modern industries use thousands of dangerous things routinely. But proper handling is necessary - indexes, watching the data consistency, etc. So it is a good instinct to feel uneasy when there are seven joins around. It's not necessary fatal but it is something that would require attention.


I've never written an application that didn't involve multiple queries with 7+ joins. There is absolutely no reason that should make anyone feel uneasy, it is perfectly normal and expected.


>On my laptop, PostgreSQL takes about a minute to get denormalized data for 12,000 episodes

I have seen numerous relational databases where people complained they were slow and they had zero indexes. 12k records to any rdbms is nothing. Which touches on another pet peeve of mine when people talk about how much data they have so they have to go nosql, yet they only have thousands or millions of records.


It's because so many of us have seen joins go bad, either due to poor code or other factors. I regularly hear people talk of never doing joins again. Not that I agree with this attitude, but it is certainly out there.


I've been using relational databases for 15 years and I've never "seen joins go bad". I don't even know what that means. I know some people avoid joins, but from my interactions with those people it has always been because they heard the myth that joins are slow, and foolishly figure writing a naive join implementation in PHP will be faster.


Oh, God, an advisory article by people who have just learned what a relational db is. What's next?


LMFAO at startups using hot new technology so they can throw back the technical debt back in their willfully ignorant investors' laps. Then the business suits come back thinking "yeah we want a feature that makes us as useful and powerful as IMDB, that site looks so dated, we'll steal their market easy. Yeah just add actors. Piece of cake right? Shouldn't take you more than a day." Hahahahhaha.


I find it fascinating that people misunderstand data so much even 2013 that they end up implementing a totally inefficient solution without actual data modeling and design. The most missed point is usually querying. Focusing on the how do I write data first can lead to disasters like this. Why don't everybody start with the "how do I query this dataset in this shape?" question first?


These days I tend to use SQL for data, and Redis for complex relationships and graphs. It's good enough for me to combine those. Usually the relationships are stored with ids on ordered sets. And I just do an id lookup on postgresql.


I Stopped clicking on titles like this... Titles with NEVER/ALWAYS/something in that direction are just there to make you click and mostly are not good articles.

Maybe this is i don't know... i don't clicked on it...


It is a flaming title, but an actual good read if you want to know what kind of problems you might have when using Mongo.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: