Hacker News new | comments | show | ask | jobs | submit login
Why You Should Never Use MongoDB (sarahmei.com)
568 points by hyperpape on Nov 11, 2013 | hide | past | web | favorite | 337 comments



> Seven-table joins. Ugh.

What? That's what relationship databases are for. And seven is nothing. Properly indexed, that's probably super-super-fast.

This is the equivalent of a C programmer saying "dereferencing a pointer, ugh". Or a PHP programmer saying "associative arrays, ugh".

I think this attitude comes from a similar place as JavaScript-hate. A lot of people have to write JavaScript, but aren't good at JavaScript, so they don't take time to learn the language, and then when it doesn't do what they expect or fit their preconceived notions, they blame it for being a crappy language, when it's really just their own lack of investment.

Likewise, I'm amazed at people who hate relational databases or joins, because they never bothered to learn SQL and how indexes work and how joins work, discover that their badly-written query is slow and CPU-hogging, and then blame relational databases, when it's really just their own lack of experience.

Joins are good, people. They're the whole point of relational databases. But they're like pointers -- very powerful, but you need to use them properly.

(Their only negative is that they don't scale beyond a single database server, but given database server capabilities these days, you'll be very lucky to ever run into this limitation, for most products.)


People hate joins because at some point they get in the way of scaling, and getting past that is a huge pain.

Or at least, that's where the original join-hate comes from.

In reality of course, most of us don't have that problem, never had and never will, and it's just being parroted as an excuse for not bothering to understand RDMS's.

Relational database design is a highly undervalued skill outside the enterprise IT world. Many of the best programmers I've worked with couldn't design a proper database if their lives depended on it.


People hate joins because at some point they get in the way of scaling...

No, in fact, they don't.

Poor relational modeling gets in the way of scaling, and that can be geometrically exacerbated by JOINs. A JOIN, in and of itself, is neither good nor bad. It's just a tool, and like all tools, how you use it is what makes it "good" or "bad" — just like you can build a house or bash in a skull with a hammer.


In most relational database implementations, joins stop scaling after 10-50 million rows or so assuming an online transactional site.

A time series data warehouse could go into the billions of rows with scalable joins with partitioning and bitmap indices ... but is also only applicable in the unlikely case you could afford oracle at $60-90k/CPU list price

Also, most databases that aren't Oracle don't have high performance materialized views to "preprocess" joins at upsert time, therefore people resort to demoralized tables and their own custom approach to materializing those views.

Then even denormalized tables begin to stop scaling at around 250 million to 500 million rows. So people resort to sharding managed in a custom way.

I haven't even begun to express the scalability impacts of millions of users on a LRU buffer cache used in most RDBMS - that usually is resolved through an in-memory cache (Memcached, Redis) whose coherency is also managed in a custom manner. Or you could spend $$$ for Coherence, Gigaspaces, Gemfire, etc. but that's also unlikely in most web companies.

At the end of all this, even if you bought a cache, you wonder why you're using an RDBMS at all since you're so constrained in your administrative approaches. Cue NoSQL.

of course in practice many devs ignore all of this history and "design by resume" assuming their new social-mobile-chat-photo-conbobulator will be at Facebook scale tomorrow.


This is not true at all. I've worked on several databases with billions of rows in several tables. A good solution for improving your query performance is to use a multi column index http://www.postgresql.org/docs/9.3/static/indexes-multicolum...


What part isn't true?

I'll restate my narrative: Single instance, normalized, unpartitioned databases run into scaling problems the several-hundred million row range especially when under heavy concurrent load.

But once you start moving to multi-instance, partitioned databases, you start to lose the benefits of the relational model as most databases have to restrict how you accomplish things -- e.g. joins are severely restricted.


Oracle will handle anything you throw at it, assuming you have the $$$. Ebay uses it for 2 Petabytes of data:

http://www.dba-oracle.com/oracle_news/news_ebay_massive_orac...


That's a link discussing an analytic database from SEVEN YEARS ago. eBay has moved on.

Please understand what I am saying:

- Traditional database architectures have limitations on what you can express in SQL for highly available and scalable online transaction processing once you introduce partitioning and clustering.

- Oracle has probably the best support for partitioning and clustering out of all RDBMS, but even that has limits in the billions of rows

- Many companies do not use Oracle for business reasons (licensing/sales/pricing practices)

What I am not saying:

- Oracle sucks (it's the most feature complete and robust RDBMS out there and is );

- Oracle is not used (Amazon, Yahoo, eBay, etc. all use Oracle in various contexts);

- Oracle does not scale (it does, though it requires you, the SQL developer, to intimately know the database physical design at a certain point of scale, which defeats much of why SQL exists to begin with)


I routinely deal with joins on a 100 million row table and they work just fine. Other than that, I also use a 10 billion row table for searches. This is in Oracle.


I've also used databases with more than a 100 million rows in a single table and received realtime query performance in multi-table joins. And this is using Sqlite! No expensive DB licenses - but it was using a high-end SAN since we actually ran thousands of these multi-million row databases in parallel on the same server.


"demoralized tables" :)


For me, it was a database schema with 38 joins (and 2 additional queries) to effectively get the data to display a single page. For that use case, mirroring the data on save to MongoDB was a no-brainer... with geospacial queries out of the box, and a few other indexing features it made a lot of sense.

I wouldn't even think to use MongoDB for certain use cases... but for others, it's a great fit. I think that Cassandra, Riak, Couch, Redis and RethinkDB all have serious advantages and disadvantages to eachother and SQL.

I do find that MongoDB is a very natural fit for node.js development, but am not shy about using the right tool for a job.

Another thing that tends to irk me, is when people use SQL in place of an MQ server.


> For that use case, mirroring the data on save to MongoDB was a no-brainer

I think you just confirmed the OP's point -- MongoDB makes a good cache, not a good primary store. I'm guessing you didn't do updates into that MongoDB store, and always treated the SQL source as "authoritative" when it became necessary. Am I right?


I no longer work at the company in question, but the plan was to displace SQL for the records that were being used in MongoDB, for mongo to become the authority. NOTE: this was for a classified ads site for cars. Financial and account transactions and data would remain in the dbms, but vehicle/listing records would have become mongo authoratative.

The transition was difficult because of the sheer number of systems that imported/updated listing records in the database... there wasn't yet a 100% certainty that all records were tagged on update properly, so that they could be re-exported to mongo... each day, all records were tagged ... took about 24 minutes to replicated the active listings (about 50K records), and we're not talking "Big Data" here, but performance was much better doing search/display queries against MongoDB.


> In reality of course, most of us don't have that problem, never had and never will

Maybe you never had any problems, but I don't believe "most of us" can say the same. At least me, I'd encountered problems derived from join-abuse in almost every job I've had.


That's funny, because I've mostly encountered problems with people who prefer to nest SQL queries inside a sucession of loops in their code, rather than learn how to use SQL properly.


Yeah, me too. But having said that, I've also seen problems with mongodb and they're much, much, much, much harder to solve.


7 isn't necessarily nothing. Each join is O(log(n)), so I believe you're stuck with O(log(n)^7) as a worst case, although in practice it will probably not be so bad since one of the joins will probably limit the result set significantly.

The other problem is that with 7 joins, that's 7! permutations of possible orders in which the database can perform the join. That's a lot of combinations, and often you can run into the optimizer picking a poor plan. Sometimes it picks a good plan initially, and then as your data set changes it can choose a different, suboptimal plan. This leads to unpredictable performance.

I think that in practice, you're best off sticking with only a few joins...


If you're regularly doing 7 joins it's a good sign of an over normalized databased.


Nonsense. It very much depends on the problem domain.


> A lot of people have to write JavaScript, but aren't good at JavaScript, [...] they blame it for being a crappy language, when it's really just their own lack of investment.

I think it's pretty much an accepted fact that JS has its problems. Even Brendan Eich has been quoted as admitting it.

(Note: I am a JS developer myself)


This is true, but the "wtf js is such a fucked up language" meme is outsized compared to the actual problems of javascript. Having worked full time in python for a couple years I could easily show you just as many weird python semantics that will inevitably bite you[1]. I think the grandparent's point has merit, that people expect to invest in their primary language for a project, but when circumstances dictate that they need to use a bit of javascript they find it annoying.

1] What does this program do?

    print object() > object()


  >>> print(object() > object())
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: unorderable types: object() > object()
Yet another reason to upgrade to Python 3! :)


About your code snippet, it prints a "random" boolean value

You're creating two objects (with random addresses, which affect the __str__ method result, which in turn result in a string comparison that returns False or True)


That's nothing compared to the "Perl is Satan!!1" meme that us Perl programmers usually have to put up with ;)


Actually, Perl is many different Satans, depending on the particular stylistic quirks of the programmer in question.


TMTOWTDI: There's more than one way to damn it!


Not so much with the usual coding standards (see "Best Practices" book etc).


IMHO, it's well deserved. Start by stop having those $, @, % identifiers for variables and then we 'll talk again about how many more daemons you need to impale.


Sigils are what make Perl standout, in syntax and behaviour, to (most) other languages so it would be silly to get rid of them!

i.e. it's a differentiator to what is or isn't Perl.


That one is somewhat reasonable (and relatively obscure code you'd probably never write).

A more realistic example: inner classes can't see class variables from their enclosing classes. (Why enclose classes? - builder pattern)


Every language has its quirks. Python is not perfect either. But JS shows clear signs of bad design decisions, such as the behavior of the == operator.


What == does is pretty simple and easy to understand. If you have a hard time with it, use ===. Problem solved.


The problem is that JavaScript does not exist in isolation, there are other languages that use this operator. If you're familiar with any other C-derived language, the way == acts in JavaScript is very unexpected.

Yes, you can learn to deal with it, but that doesn't mean it wasn't a bad design choice. If both forms of equality are required to be operators, == and === should have been swapped. Too late to do it now, woulda, coulda, shoulda, but it certainly is, IMO, a "bad design" smell in the language, and hardly the only one that still bites people.

Another example: the way 'this' scoping works is similarly busted in that while the rules for it are reasonably straightforward in isolation, it is different enough compared to other languages that share the same basic keywords and syntax that it should have been called something else.

To be fair, I don't think much of this has to do with Brendan Eich being a bad PL designer as much as it has to do with the odd history of JavaScript that is still represented in the name of the language ("Take this client language you made which has no connection to Java, and make it look kinda like Java, please!").


I agree with you to a point. Joins are your friend. But trying to pull out all of the information about a graph of 'objects' using a single query with multiple one-to-many and many-to-many joins is just as foolish in SQL as in Mongo.


Do you have any resources you would recommend to understand or at least give an overview of indexes?

I learned basic SQL once-upon-a-time and understand the relational algebra side of things, but only truly picked up the finer details and specific engines in piecemeal manner, as needed in various projects.


http://use-the-index-luke.com/

This is a really good resource for understanding how the queries you do relate to the actual actions that the database engine takes.


http://use-the-index-luke.com/ is a great online book (free) targeted at programmers and developers. It's practically required reading in my opinion.



Star, constellation, snowflake, flat.. Developers (not the author) would benefit from database introductory course even if they are not using databases. I think Stanford did one that was open to everyone.


This article ends up agreeing with you at the end, by the way.


When program errors pass silently, that is a legitimate problem in the toolchain.


There is a good reason that relational databases have long been the default data store for new apps: they are fundamentally a hedge on how you query your data. A fully-normalized database is a space-efficient representation of some data model which can be queried reasonably efficiently from any angle with a bit of careful indexing.

Of course relational databases being a hedge, are not optimal for anything. For any particular data access pattern there's probably an easy way to make it faster using x or y nosql data store. However as the article points out, before you decide to go that route you better be pretty certain that you know exactly how you are going to use your data today and for all time. You also should probably have some pretty serious known scalability requirements otherwise it could be premature optimization.

Neither of these things are true for a startup, so I'd say startups should definitely stay away from Mongo unless they really know what they are doing. Being ignorant of SQL and attracted by the "flexibility" of schema-less data stores powered by javascript is definitely the wrong reason to look at Mongo.


I actually started to laugh as i was reading because i knew what problems they were going to run into. I was basically drawing up my schema for a mongodb app(yes you still need a schema), when i started scratching my head and started reading through the mongodb guides. I quickly realised that i should use a relational store and my problems were solved quickly with postgres.

The title of this article should be honestly changed as it does not do mongodb justice, there are a lot of uses for it but relational is not one. Regarding the TV example, this is a classic relational solution and i enjoyed this exact example in a pycon tut,SQL for Python Developers - http://www.youtube.com/watch?feature=player_embedded&v=Thd8y...

I see a lot of people thinking MVP ---> schemaless to save time -----> mongodb but you will always need a schema unless you are just dumping a list of stuff. I would like to say that another cool solution is an RDF data store, i have been using Fuseki with SPARQL.


I use SPARQL a lot although not Fuseki. I really like the flexibility it gives in the schema (flexible not less schema). As well as that one query language can be used on radically different implementations i.e. I don't need to change my datamodel or queries to try different storage models.

Although we also use BerkleyDB/je+lucene for indexing as well as a number of existing relational databases. Yet, considering the youth of the SPARQL eco system (1.1 of the standard is only out since the beginning of this year) there is some fantastic performance possible for both hard and easy queries. I think it will be a bit like Java, not pretty, but fast enough and extremely robust in the long run. With a similar marketing pitch "Query Once Store Anywhere".

I also evaluated MongoDB, and I understand the value of a document store. I just don't think that MongoDB is a good document store, imho it just a slow /dev/null.


> Being ignorant of SQL and attracted by the "flexibility" of schema-less data stores powered by javascript is definitely the wrong reason to look at Mongo.

It's usually the only reason. And 10gen were good at marketing it.


I used it for exactly one production app and it was a huge success. The reason I used it was because the data we needed to represent was actually a document, in this case a representation of fillable form fields in a pdf document. The basic structure was that documents had sections and sections had fields and fields had values, types, formatters, options, etc.

Initially trying to come up with a schema in SQL was somewhat painful as what I was really looking for was an object store. Switching to mongo gave me a way to do a very clean, simple solution that worked quite well for the problem at hand (representing pdf forms). That said, we also played it very safe and used mongo for only the document portion, with every other part of the system being in an sql database. But for the doucuments mongo worked really well as a basic object store without the complexity of something like Neo4j.


Of course, a better choice now would be to use PostgreSQL's new JSON support. Postgres has also had XML document types for a long time, though I'm not sure of their indexing story.

If you don't need indexing into the document, you can easily just store it as serialized bytea data. I've done this quite frequently and it works wonderfully.


Sorry but PostgreSQL's JSON query syntax is insanely complicated compared to MongoDB.

And that is a big deal for a lot of developers.


I just took a look, and if I'm getting this right, it looks about as simple as it gets:

SELECT json_data FROM people WHERE json_data->'age' > 15

vs

db.people.find( { age: { $gt: 10 } } )

Personally, I prefer the postgres syntax. It's much clearer. I also don't buy your claims below about performance. Can you provide a real benchmark? Are you running with the safeties off meaning you lose data?


What is the syntax on finding an array value matching some key? E.g. given a user with field "favFoods":[String], how do I determine that pizza is in there?


select ... where "pizza" in person -> favFoods


Sorry I'm stupid, I was thinking Python

select ... where person -> favFood in ("pizza")


Your ORM will give you whatever syntax is natural, eventually.


For all the talk about how "MongoDB totally has SOME use cases", I've never before heard of a use case where it would be unambiguously better to use a document store. Thanks for explaining that so well.


I've used it well in the past as well.

MongoDB was 5-10x faster than PostgreSQL, Cassandra etc.

If your domain model is structured like a document then MongoDB is a pretty great fit.


Even if it is better, why Mongo? Why not Riak?


In my opinion, datalog (via Datomic) strikes a good balance between schema flexibility and queryability. It's my preferred way of working with data now, after using Mongo for a year (I was also initially attracted by the flexibility of schema-less data stores).


I think a lot of people (at least in the Node community) love the interfaces provided for it and specifically Mongoose which allows you to enforce a schema and thus relational data.

I'll admit that I originally got into it because I didn't really know SQL and I'm still not very talented with it but for example.. Joins in Mongoose? Say there is a comment.. this comment has an author. If that author is type ObjectId when I run a query I can do this:

model.comment.find({_id: <some id>}).populate('author').exec(e, result) { // author will be populated with that author's data instead of just the ID }


I use mongodb only when i have to write an app in Node. Not because it's the better solution , but because mongoose is the only library that deals with data that is mature on npm. The rest is beta at best and drivers for RDBMS are not mature enough.


The mysql driver on NPM is indeed mature. Been using it with out issue for a couple years now in a large app. https://npmjs.org/package/mysql


I had thought the mysql drivers were all more mature or is it that you lean more toward postgresql?


Putting it another way:

Document stores are supposedly "more agile", but by conflating queries, the logical model and the physical structure, they are actually less agile. You've mixed the three things together and ossified around a single model of the domain. When the required view or model changes, you have to write workarounds.


spot on. MongoDB is a step backwards in abstraction. The RDBMS geeks had this figured out in the 70s. Hence the "relational MODEL" vs "document STORE" transparent step backwards in generalization.

MongoDB is just locking you into a specific materialization of an ill-specified data structure. PostgreSQL's team hacked out JSON support in about a year or so, since they are working at a higher level of abstraction and could insert the "MongoDB model" at the proper place in their system. Now if you really need to store "documents" in your database and query lazily-defined fields you can do that for those edge cases (and lets face it, those are edge cases) and use proper relational modeling for the rest of your model.


I once tried to insert a screw using a hammer. I'll be writing my article "Why You Should Never Use A Hammer" shortly.

And here's the crux of the problem with this article, and of so many articles like it:

"When you’re picking a data store"

"a", as in singular. There is no rule in building software that says you have to use 1 tool to do everything.


>I once tried to insert a screw using a hammer. I'll be writing my article "Why You Should Never Use A Hammer" shortly.

More like: "I once tried to insert a screw with a fish-shaped, peanut butter and jelly covered, see-through tv set".


In his defense, the Kickstarter campaign for the TV was impressive.


I always find comparisons to tools disingenuous because people take simple tools (a hammer) and compare them to complex software tools that if you misunderstand can ruin your company.

Your database isn't a hammer. It's closer to 19th century industrial machine with hundreds of buttons and levers that will cut your hand off if you use it incorrectly.


I think this is the first post on HN I wish I had a downvote button for, just for the reason you list. There is a reason there are different flavors of databases, and MongoDB most definitely would not be my choice for representing graph like relationships.

It's also scary that it has 217 points because it bashes Mongo.


I think you are missing the point of the article. If you read down to the Epilogue it explains how the "perfect" application still didn't work with MongoDB once the clients started asking for more features.

My read was that even when you think you don't have "graph like relationships" in your data, you actually do.

The original author did say this, but I would like to add: if you don't have "graph like relationships", then your data is pretty trivial and any data store will do.


From another comment I made, on why I don't think is a good article even using the proposed thesis of "mongo doesn't work for graph like relationships":

Even though their data doesn't fit well in a document store, this article smacks so much of "we grabbed the hottest new database on hacker news and threw it at our problem", that any beneficial parts of the article get lost.

The few things that stuck out at me:

* "Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production." - So you did absolutely no research

* "What could possibly go wrong?" - the one line above the image saying those green boxes are the same gets lost. Give the image a caption, or better yet, use "Friends: User" to indicate type

* "Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars." - Yep, and since users are indexed by their ids, this is extremely easy.

* "What happens if that step 2 background job fails partway through?" - Write concerns. Or in addition to research, did you not read the mongo documents (write concern has been there at least since 2.2)

Finally, why not post the schemas they used? They make it seem like there are joins all over the place, when I mainly see, look at some document, retrieve users that match an array. Pretty simple mongo stuff, and extremely fast since user ids are indexed. Even though graph databases are better suited for this data, without seeing their schemas, I can't really tell why it didn't work for them.

I keep thinking "is it too hard to do sequential asynchronous operations in your code?".


I'm pretty ignorant of MongoDB so I'm genuinely interested in your response: How would you solve the problem in the epilogue, namely "a chronological listing of all of the episodes of all the different shows that actor had ever been in"?

Did Sarah model the data poorly ("We stored each show as a document in MongoDB containing all of its nested information, including cast members").

Or is there an easy way to extract that information that Sarah just doesn't know about yet?

Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

The last part seems like a really straightforward relational critique to me. If you don't break the actors out into unique entities then you can't compare them across shows. But if you do break them out into unique entities, then how to you present the show information without doing joins?


  > Did Sarah model the data poorly ("We stored each show as a 
  > document in MongoDB containing all of its nested 
  > information, including cast members").
Yes, they modeled the data poorly.

In this example, we have a TV Show, which is modeled as an entity (document). This TV Show has a list of cast members, each one modeled by a nested object.

In a relational database, this type of relationship would be modeled by having a TV_SHOWS table, a CAST_MEMBERS table with a foreign key to the TV_SHOWS table, and a CASCADE DELETE relationship to ensure that if a TV_SHOW is deleted, the related CAST_MEMBER records are also deleted.

This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS. (In OO we'd call this a "component" relationship, that is, we're saying that a tv show is composed of cast members, and if we destroy the tv show we destroy the cast members as well.)

They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

  > But if you do break them out into unique entities, then 
  > how to you present the show information without doing 
  > joins?
You must join, albeit in MongoDB you do this in the application layer, not the database, so:

1. Query the cast members collection to find the cast member id. 2. Query the tv shows collection to find all tv shows with cast member id in the cast members set.

Those of us who sharpened our teeth using relational databases have trouble seeing past "two trips to the database" in the above strategy, and that's probably why there's an urge to embed documents rather than to query two collections sequentially. Resist this urge, as it's as as bad as the urge to denormalize, i.e. there'd better be a damn good reason to do it.


> This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS.

... huh?

> They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

So instead of a one-to-many relationship, they should use a one-to-many relationship expressed in a different notation?


MongoDB doesn't forbid you from having entities and relations. It just doesn't support them in the same way that SQL databases do. Ditto for CouchDB, etc.

You end up having to do some joins yourself still, but this is often appropriate. Imagine that the "actor" entity contains a complete bio, including family history with relationship to other actors, links to wikipedia & fan sites, etc. When you're displaying the page for episode #202 of "Everyone Loves MongoDB", you don't want to retrieve all that data for all the actors. You're not going to display it all on the episode page anyway. Instead, you just need an ID (to href an a and src an img) and probably a small amount of denormalized stuff (name, for the img alt ...). Since that's what you need, that's what you store.

There's a limit to how far you can denormalize schemas before it is no longer helpful. The author explores this limit, and finds that MongoDB doesn't make the limit go away.


You're basically saying: don't use mongo. It's trivial to emulate a blob of data in a relational database; just use a... blob of data. Or any of the many, many other options at you fingertips. Conversely, manually implementing efficient joins is a total hassle and it'll probably end up slow and brittle. At the very least you'll need indexes and that means an (implied) schema.

So in the normal mongo usecase for storing (as opposed to caching) data with relations, let me see if I can summarize:

- you can have relations, it's just mongo won't help you deal with then: you just need to implement them yourself.

- you can have (actually need) a schema, it's just mongo won't help you deal with that; you'll need to implement that yourself. Have lots of fun with schema-changes, especially because...

- Since you're changing decoupled entities, you need to keep them in sync. You can (and probably should) use transactions, but mongo won't help you with that. You also probably want foreign keys, but mongo won't help you with that either. Migrations on mongo are a special kind of terrifying.

But hey, on the upside, it can store structured blobs, and it's probably hardly any slower that your filesystem, which could do that too.


You could absolutely do the same thing with Postgres (or SQL Server) and computed indexes over JSON (or XML) blobs. Of course, then you'd have exactly the same schema migration issues.

My point was more that a lot of the time, if you structure your data right (and get the right balance of denormalization) you don't need joins very much and so the lack of them isn't really a big disadvantage.


> Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

As others have pointed out, it requires two trips to the database. Given their architecture (distributed nodes), network latency is minimal, so this is essentially two calls to the database.

show { _id, title }

actor { _id, appearedIn : [id] }

db.find({"title":"awesomeshow"},{"_id":1}) db.find({"appearedIn" : showId})

Each actor is unique in the database, when you query, you get back unique actors. I'm not sure why they're scared of joins (or multiple queries in mongo).

The question you ask yourself is not whether you're joining, but how often you're joining. If you're not joining often on actors and shows, document databases can work better, since you represent the show and all its episodes without having to join.


Another "issue" occurs to me. It seems likely that the data coming in about TV shows, especially old ones with decades of episodes, would be a bit "dirty". This sort of thing just slides right into a document store, but a relational one would have some problems with that. How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors? Of course these things can be fixed with enough manual (or, even better, user) intervention, but the time and place for that are after you've got the data in the system, not before.


> How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors?

In the USA, the various professional creative guilds enforce uniqueness.

Your general musing is right, but the problem of source-data quality is generally considered to be distinct from the design of schemata.


Yeah, the comment on graph databases seemed a bit too flippant.


I often upvote articles because I'm interested in the discussion. It does not always indicate agreement.


Well said sir. I only skimmed the article, but afaict the author still has not discovered graph stores, an appropriate way to store social graphs.

I remember downloading Disapora back in the day. The idea behind it was great. But the code looked quite awful and insecure.


From the article:

> Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production.

Have you used a graph database to good effect? Which one, and for what?

I have a friend who as a learning exercise wrote a toy search engine implementing PageRank — inherently a graph problem. We paired on setting up Neo4j, the only open-source graph database we could find with a working Python API, but found it fiddly and hard to get help. She then switched to SQL (Postgres, I think) and reported faster progress.

Facebook themselves use MySQL[1], so between that and my own first/second-hand experience, I'd call it far from obvious that a graph database is the most appropriate way to store social information. If you're going to criticize the OP for not considering them, it would be nice to offer some justification.

1. https://www.facebook.com/notes/facebook-engineering/mysql-an...


>>Have you used a graph database to good effect? Which one, and for what?<<

I played around with several. But project never got off the ground due to layoffs that killed projects.

I know lots of people who have implemented graph stores with great success. One example:

http://www.bbc.co.uk/blogs/bbcinternet/2012/04/sports_dynami...

Another is a multibillion dollar retailer (not sure if it's public so I'll leave the name out) uses stardog to good effect. LOTS more out there.

>>We paired on setting up Neo4j, the only open-source graph database we could find with a working Python API, but found it fiddly and hard to get help.<<

The Graph Stores do seem to play better with Java. Neo's getting a lot of ink these days but they are far from the only game in town.

>>Facebook themselves use MySQL[1], so between that and my own first/second-hand experience, I'd call it far from obvious that a graph database is the most appropriate way to store social information.<<

They aren't using MySQL the traditional way. They undoubtedly would have made different choices had they started when Disapora did. And they also use TAO, a homegrown graph store of sorts, FYI:

http://dl.acm.org/citation.cfm?id=2213957

It is sitting on top of MySQL at some level, as this is where objects are stored as "source of ultimate truth".

IIR the quote correctly when I invited couple FB DBAs (pre-Mark Callaghan) to speak at a meetup, "I don't think there's a single join in the facebook codebase". That might have been a slight exaggeration, but MySQL at Facebook is not because their recent needs are for a relational db.


Thanks, this is why I read all the bad comments on HN: in hopes of seeing a very informative one like this :)

I'll just point out that this:

> The Graph Stores do seem to play better with Java.

was likely a dealbreaker for Diaspora, since they were a small team without, I'd assume, Java experience. Also the nature of the project virtually requires an open-source database so Stardog would've been out. With SQL you have not one but several free and open-source implementations that are battle-tested and work well with just about any programming language out there. That makes SQL a better choice for many projects even if a graph store would map more neatly onto their problem domain.


True. I think an (even more) ambitious attempt could have attracted core developers to the project, which could have solved all the technical hurdles. That said, I was pretty excited by the idea, and I hope something new along those lines gains momentum one day.


As I recall, FB mostly use MySQL as a glorified K/V store. So I'm not sure if this is a win for relational algebra.


Reddit does that as well with PostgreSQL. It surely doesn't show a win for NoSQL if two of the biggest sites on the internet would rather traditional SQL RDBMS as KV stores.


In my experience, MySQL works better as a K/V store than Mongo under load - another point against Mongo for very simple data.


It's more like using a very good screw driver instead of a swiss army knife that does an OK job at everything.

Yes there is no rule that one tool has to work for everything, but there is a rule in Agile that you should push off making assumptions about the future as far as possible, because you will never know less than right now


I actually liked the article, thought it was interesting. But the title is a complete clickbait. It does not even says that "you should never use mongodb", it points some situations where MongoDB is a good match. I know a title "Think well if mongodb applies to your case" is not attractive, but it is less sensationalist.


I know very little about MongoDB, or NoSQL in general, but I'm very interested in it. Are there any good sites/articles I should start looking at to see where it would be the right tool?


The difference is that many people are trying to insert a screw with this particular hammer today.


I don't know much about MongoDB, but I've been using a lot of CouchDB for my current project. Am I correctly assuming that MongoDB has no equivalent for CouchDB views? Because if it had, all these scenarios shouldn't be a problem.

Here's how relational lookups are efficiently solved in CouchDB:

- You create a view that maps all relational keys contained in a document, keyed by the document's id.

- Say you have a bunch of documents to look up, since you need to display a list of them. You first query the relational view with all the ids at once and you get back a list of relational keys. Then you query the '_all' view with those relational keys at once and you get a collection of all related documents - all pretty quickly, since you never need to scan over anything (couchDB basically enforces this by having almost no features that will require a scan).

- If you have multiple levels of relations (analogous to multiple joins in RDBMs), just extract they keys from above document collection and repeat the first two steps, updating the final collection. You therefore need two view lookups per relational level.

All this can be done in RDBMs with less code, but what I like about Couch is how it forces me to create the correct indexes and therefore be fast.

However, if my assumption about MongoDB is correct, I have to ask why so many people seem to be using it - it would obviously only be suitable in edge cases.


Spot on about CouchDB. I haven't used MongoDB for anything of decent scale but I must say I was shocked to read in the OP that they store huge documents like from the Movie example in MongoDB. In CouchDB you can use Views to sort of recursively find all of the other documents that your current doucment has a document ID for. This takes advantage of CouchDB's excellent indexing. I'm not trying to start a CouchDB vs MongoDB war here but again, I just say I'm surprised at the types of documents OP was storing in MongoDB.


What I still don't understand about MongoDB is where it actually shines compared to Couch. The performance advantage would have to be quite big to offset the loss in flexibility as a general purpose DB. I'm also not trying to start a war but I'd like to get a picture about why Mongo seems to be used more often than Couch.


> What I still don't understand about MongoDB is where it actually shines compared to Couch

Marketing. They shipped with unacknowledged writes for a long time and it made them look really good in write benchmarks. Couch was actually trying to keep your data safe. But it didn't look fast enough so those that didn't read the fine print on page 195 from the manual where it tells you how to enable safe data storage for MongoDB, jumped on the bandwagon.

Oh and mugs, always the mugs. I have 3 I think.


My one and only reason to use Mongo over Couch is geo indexes. As far as I can tell this doesn't exist natively. I'm also not sure how Geocouch comes in worth this.


Cloudant will soon offer geo-spatial queries. Its in beta now: https://cloudant.com/product/cloudant-features/geospatial/.


As the original article said, I think where MongoDB shines is as a glorified, souped up cacheing tier, competing directly with Redis, Couchbase, and similar. It's not really a good general purpose DB.

> I'd like to get a picture about why Mongo seems to be used more often than Couch.

Very good marketing from 10gen on the one hand. On the other, CouchDB is older (and we techies love the new hotness), and the CouchDB/Couchbase split confused a lot of people. Having your original founder found a different and incompatible project with almost the same name but very different goals would cause almost any project to stutter.


> CouchDB/Couchbase split confused a lot of people

Yes, that really didn't inspire confidence in the longevity of the project


Interesting question, I can point out a few interesting differences I know of. Take note, I have more experience with Couch and its ilk than MongoDB, but I know some of Mongo's feature set.

tl;dr: You'd probably see the most difference with how a) the data is distributed and replicated and b) how you query data.

CouchDB (as of the 1.5 release) offers master-master (including offline) replication. It does not offer sharding. Cloudant's BigCouch does implement a dynamo-like distribution system for data that is slated for CouchDB 2.x iirc. Mongo on the other hand does support sharding via mongos, and you can build replica sets within each shard. It does not as far as I know support master-master. This is probably the biggest data-distribution difference between the two.

MongoDB support a more SQL-like ad hoc querying system, so you could query for drink recipes with 3 or less ingredients that have vodka in them, for instance. You'd still need indexes on the data you are querying for performance.

CouchDB queries are facilitated via javascript or erlang map reduce views, which serve as indexes you craft. An additional 'secondary-index' like query facility is to use a lucene plugin and define searchable data. Cloudant has this baked into their offering, and their employees maintain the plugin on github (https://github.com/rnewson/couchdb-lucene)

MongoDB has the ability to do things like append a value to a document array. In Couch, you'd likely read the entire document, append to the array in your app, and put the document back on Couch. It does have an update functionality that can sometimes isolate things more than this, but I haven't seen it used as much. Mongo can also do things like increment counters, while Couch cannot (though CouchBase can).

There's a host of other differences. Mongo has a much broader API, while CouchDB takes a more simple http verb like approach (get, put and delete see the heaviest use). Depending on your situation, one might be a better fit, or you might simply grok one more than the other.

As far as why Mongo gets used more often, I think the closer-to-SQL ad-hoc queries made more sense to people transitioning from stores like MySQL. The CouchDB view/map-reduce stuff is a bit more of a mindset shift (see the View Collation wiki entry for an example of this at http://wiki.apache.org/couchdb/View_collation). CouchDB was also taken hits for being slower than Mongo, but I suspect it was the map-reduce stuff that really steered some folks the other way.


Querying, and querying immediately after insertion. If you want queries after insertion (which require views), this can be slow in couch. Also, if you want to query, but don't want to add a view, wait for it to populate (causing a performance hit while it builds, plus while it is up), then remove it.

If you're doing primarily insertion with querying via id, and using views in which stale data is ok, then couch is far superior to mongo. But that's not a use case everyone has.


Also CouchDB has better safety. Its append-only files allows you to make hot backups and safely pull the plug on your server if need be without worrying corrupting data.

Plus change feeds and peer to peer replication are first class citizens in the CouchDB. Once you start having large number of clients needing realtime updates, having to periodically poll for data updates can get very expensive.


Offline capable peer to peer replication was the main reason I chose CouchDB - we needed something that would realistically run on clients, even mobile devices. NoSQL we mainly chose because we needed schema-less data (the whole system relies on ad-hoc design updates). It's basically an information system IDE with rapid application development.


I immediately wondered why Diaspora didn't try CouchDB, since replication seems to be one of the key features they were after.


In Diaspora as it exists now, replication - really, federation - is between pods. There's a protocol for transferring data between pods that is deliberately database-agnostic:

https://wiki.diasporafoundation.org/Federation_protocol_over...

So CouchDB's replication doesn't really help.

If the day comes that any single pod is big enough to need replication between clustered machines within it, then CouchDB should certainly be a contender for storing its data.


It looks like the CouchDB vs. MongoDB in the document store world is the equivalent of the Postgresql vs. MySQL debate in the relation world.


Not really. They handle querying and aggregation much differently.

For people coming from SQLServer/MySQL/Postgresql, the functionality differences between the NoSQL flavors is something they don't expect, and often don't explore. There are a number of heavily used NoSQL solutions because they're focused on specific use areas.


db.find({"field":"value"},{"field":1,"someotherfield":2})

Finds all documents with field having value, returning only field and someotherfield. That part is similar to the map portion of a CouchDB/Couchbase view. No reduce portion though.

If field is what the index is built off of, it should be similar performance wise to a view. Just like views have to be created beforehand, so do mongo indices.

The difference is the find of a mongo document will happen much more quickly after insertion than the find of a couch value by view. Views require rebuild in couch which is not instantaneous.


If I understand couch correctly, it will run all map/reduce functions on a DB after insertion, thus updating all views right away - except if a view has never been queried, in which case it would happen at the first query. I don't quite understand how mongo could do a better job there - do you mean because mongo's indices are less complex than couch views, so the updates after insertion are quicker? I guess if that's the case it would perform better in insertion heavy cases, but then again I could just not use many map/reduce operations in couch and thus reduce the insertion overhead.


Almost right!

For various complicated reasons, CouchDB update views on read, not on write. So you write some data, then you query a view, CouchDB notices the view is stale, recalculated everything, and then gives you the updated data. That can be a problem if your view is quite heavy because every time you write, the next read will be slow.

However! You can query with "stale=ok" (which means "just give me the old data, and don't kick off a view update"), and then update your views manually (eg, cron job that hits your view every so many minutes, or if you want to be smarter, a very lightweight daemon that monitors the _changes feed and hits your view every X updates, or whenever some key document is touched, or whatever).


From my tests with couch, the view isn't populated immediately after a document has been inserted, and may take some time. I think I tried this doing insert bulk, wait for view, insert 1, query, but I'd have to double check.


You and Lazare are right, I just checked the documentation and Couch indeed updates on first view query after a write.


Cloudant (based on CouchDB) automatically triggers map-reduce and auto-compacts your database for you. This is my second post about Cloudant - note that I am employed by Cloudant. :)


Make it a product, and I'd be more interested. A number of companies have their own hosting, so the hosting part is not only unneeded, but is also usually a non starter.


I must have read a dozen (conservative estimate) articles now all called "Why you should never use MongoDB ever" - or permutation thereof. Each and every one of them ought to have been called "I knew fuckall about MongoDB and started writing my software as if it was a full ACID compliant RDBMS and it bit me."

There are essentially two points that always come up:

1. Oh my God it's not relational!

Well, you could argue that if you move from a type of software that is specifically called RELATIONAL Database Management System to one that isn't, one of the things you may lose is relation handling. Do your homework and deal with it.

2. Oh my God it doesn't have transactions!

This is, arguably, slightly less obvious, and in combination with #1 can cause issues. There are practices to work around it, but it is hardly to be considered a surprise.

I keep stumbling on these stories - but still these are the two major issues that are raised. I'm starting to get a bit puzzled by the fact that these things are still considered surprises.

In either case, I'm happily using MongoDB. It has its fair share of quirks and limitations, but it also has its advantages. Learn about the advantages and disadvantages, and try to avoid locking too large parts of your code to the storage backend and you'll be fine.

FWIW, I think the real benefit of MongoDB is flexibility w/r to schema and datamodel changes. It fits very, very well with a development process which is based on refactoring and minor redesigns when new requirements are defined. I much prefer that over the "guess everything three years in advance" model, and MongoDB has served us well in that respect.


> I must have read a dozen (conservative estimate) articles now all called "Why you should never use MongoDB ever"

Strange statistical oddity if you ask me right? How many "don't use PostgreSQL" or "don't use Cassandra" or "don't use SQLite" have you seen? Not as many. It is just very odd isn't it...

So either everyone is crazy or maybe there is something to it. I lean towards the later here.

> 1. Oh my God it's not relational! ... > 2. Oh my God it doesn't have transactions!

Maybe those, you forget about:

3. Claim "webscale" performance while having a database wide write lock.

4. Until 2 years ago shipped with unacknowledged writes as a default. Talk about craziness. A _data_base product shipping with unacknowledged send-and-pray protocol as a default option. Are you surprised people criticize MongoDB? Because I am not at all. Sorry but after that I cannot let them within 100 feet of my data. They are cool guys perhaps and maybe having beers with them would be fun, but trusting them with data? -- sorry, can't do.


> A _data_base product shipping with unacknowledged send-and-pray protocol as a default option.

MongoDB had a default initial fire-and-forget mentality, but that was on purpose for their initial use cases. Just because someone else uses the tool for a different purpose doesn't mean the software is to blame.

Also, if you're complaining about the default settings and you were running this in production, RTFM.


> MongoDB had a default initial fire-and-forget mentality, but that was on purpose for their initial use cases

Yes I call that deceitful marketing. It wasn't an accidental bug or an "oops". I don't know how someone can be considered honest or trusted with data when they ship a d_a_t_a_base product with those defaults. Call it random storage for 'gossakes, that would ok, anything but "database".

> Also, if you're complaining about the default settings and you were running this in production, RTFM

Yes and I also don't expect to read the fine print on last page of a manual to enable the brakes when I buy a car. I expect cars to have brakes enabled by default, even if it somehow makes them not go as as fast in benchmark tests.


It still boils down RTFM and don't trust marketeers, right?


Mostly, "don't trust marketeers with you data", which I don't.


Mongo is absolutely TERRIBLE for schema changes. It is a terrible fit for refactorings and minor redesigns. (I have implemented and watched several mongo migrations and refactorings)

Because you have a schema, but mongo doesn't model it, you're left to your own devices to implement the migration. If you have real data, and a normal legacy situation, you can't assume all data will necessarily follow the "schema" you think you have - after all, it's implicit. But that means that writing the migration can be quite tricky. There are no validations, no foreign key checks, no constraints you can use to validate your migration did what you think it did. You'll need to do all that in your own code. This is short for: you're going to be lazy and not check it quite as well as you would have otherwise, and the checks you do implement might be buggy.

Furthermore, if a particular entity does fail a migration... what then? In postgres, which supports transactional DDL, you can rollback schema changes - so even if the last entity failed to migration because your assumptions were wrong - and even if the validation had to be in code, not in the database - you can revert to the initial situation. In mongo? Uh-oh; you're in trouble. You better be in a copy of your production database; but if you are, that means that your main database needs to be offline or in read-only mode so that writes aren't lost. Does mongo have a read-only flag? By contrast, postgres (and other Sql databases) are transactional and support snapshots - you can do the safe migration with rollback support all while online for as long as there aren't any conflict; and when there are, it's detected, and you have a range of options from locking to retries to avoid the conflicts.

In practice, I can't imagine a worse tool for schema and datamodel changes. If your schema change is trivial, it's not so bad; but then, if you're only renaming a property that has no external references to it or adding a property or whatever - sql is trivial too.

Mongodb for schema-changes is sort of like writing an automated refactoring for a dynamic programming language codebase that's too large to manually inspect at all and without unit tests tests nor a VCS. You won't necessarily know what goes wrong or even if anything goes wrong; you won't get system support for guaranteeing at least minimal consistency; and if something goes wrong you'll have a corrupted half-way state.


I'm not sure what these two strawmen have to do with the article. Perhaps you've been reading another article?


Given it is relatively new tech. Who is qualified to write about? Presumably by the end the authors did know a bit. Don't you think such articles, assuming they are objective, might be useful to others that are thinking of dipping their toes in the water?


SQL is actaully a rediculously elegant language at expressing data, and relationships. NOT a good general purpose language. So I tend to always favor sql for relational data, actually most data.

queues, caches, etc = nosql solution. They tend to have much more features around performance to handle the needs of these problems, but not much in terms of relational data.

If you study relational databases and what they do, you will quickly find the insane amount of work done by the optimizer and the data joiner. That work is not trivial to replicate even on a specific problem, and ridiculously hard to generalize.

And so this article's assertion that mongodb is an excellent caching engine, but a poor data store is very accurate in my eyes.


No. SQL is actually pretty third-rate at expressing data and relationships. My preferred way of expressing data and relationships is the programming language I am writing in.

The problem with SQL is that it is not an API, it's a DSL. Which usually means source-code-in-source-code, string concatenation/injection attacks, and crappy type translations ('I want to store a double, what column type should I use? FLOAT? NUMERIC(16,8)?'). Even as a DSL it's pretty low-brow: just look at how vastly different the syntax is between insert and update, or 'IS NULL'.

For all those who love SQL, consider having to address your filesystem with it. Directories are tables, foreign-keyed to their parent, files are rows. There's a good reason why this nightmare isn't real: APIs are preferred over DSLs for this use case. And so too for databases, because they are the same abstraction.

Don't get me wrong, I love relational algebra and the Codd model, but SQL just aint it. SQL has survived because of its one and only strength: cross-platform. And like all cross-platform technologies, such as Java bytecode and Javascript, its rightful place is a compilation target for saner, richer, more expressive technologies. This is why I always use an ORM and have vowed to never, ever, write a single line of SQL again.


I like your comparison of SQL to JavaScript. However, personally I love SQL and always use an ORM. My vow is to never have a line of SQL in my application source code. This is perfectly doable with SQLAlchemy, though not with crappy ORM such as ActiveRecord.

Indeed, I blame ActiveRecord for making NoSQL popular. When your ORM doesn't create foreign key for you, it is a slippery slope to blatant denormalization and eventually NoSQL.

EDIT: The other party to blame would be MySQL with its painfully slow "must-make-a-copy-of-everything" ALTER TABLE.


I like ORMs but in my experience the best approach is to use a hybrid of ORM, views and sprocs. Ideally each sproc will return the results of querying a view or at a minimum the identical columns, then the views become 1st class entities in your ORM like anything else (except for updatable views which I shy away from).

So personally I vow never to write an insert, update or delete again, but I am certainly happy to write queries and tune them if necessary.

The one thing that trumps nosql / denormalisation in my opinion is materialised views. Materialised views are a thing of beauty that allow for the design integrity of normalized data and the performance of denormalised data. It seems most people don't use them / understand them because they use b-grade free database engines.

You never stop hearing people complain about nulls,types/precedence and joins in SQL, but seriously it isn't that hard to learn. These are the main things that people complain about and regurgitate endlessly, so a little effort would be a big reward.


How about hybrid sql-builder / data grouper solution?

Not limited to ORM methods - get the full power of SQL instead. Not string concatenation - get the full power of the language to build queries. Also, the ability to get join results in either flat or grouped form.

For example https://github.com/doxout/anydb-sql (shameless plug)


Nice. This is exactly what I mean when I talk about ORMs. See how everything's nicer when its an API?


I wrote a bayesian document classifier for one project I was working on - in SQL. Training the system took one INSERT and a small word-split function. Classifying the documents took one SELECT. Even in F#, I couldn't have written a more elegant or more performant solution. In a procedural language it would have been a mess of loops and roundtrips. Good SQL is almost a pure description of your desired result, with none of the "this is how you should do it" cruft.

I don't have anything against ORMS - they're almost mandatory due to O-R-impedance mismatch - but too often I see them used instead of operations that should rightly be server-side. And none of US have injection problems, because we're binding our parameters, right? ;-)


This is a reasonable comment, but nosql databases do nothing to address it. Nor do ORM libraries.


What rubbish. NoSQL is only good for queues and caches ? Who on earth uses a database for this ?

NoSQL works well when you are modelling your data in ways that fit their particular use cases. Cassandra is great with CQRS, Riak is for key/value, MongoDB document.


As others have pointed out, this article can basically be summarized as, "don't use MongoDB for data that is largely relational in nature."

Mongo (or most document stores) are good for data that is naturally nested and silo'd. An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use.

A distributed social network, I would have assumed, would be the antithesis of the intended use-case for a document store. I would have to imagine a distributed social network's data would be almost entirely relational. This is what relational databases are for.


> As others have pointed out, this article can basically be summarized as, "don't use MongoDB for data that is largely relational in nature."

Disclaimer: I'm a founder of RethinkDB[1] (which is a distributed document database that does some support joins).

The fact that traditional databases use the term "relational" has probably caused more confusion than anything else in software. In this context "relational" doesn't mean "has relationships". The term is just a reference to mathematical relations[2]. This is an important distinction because almost all data has relationships, whether it's hierarchical data, graph data, or more traditional relational data.

To me it's pretty clear that ten years from now, every general purpose document database left standing will support efficient joins. It helps to frame the debate from this perspective.

[1] www.rethinkdb.com [2] http://en.wikipedia.org/wiki/Relation_(mathematics)


Totally agree. I was more using "relational" to mean "cross-relational". I.e. Consider plotting your data on a 2-dimensional space, connecting your "related" data with lines. If your data looks like a spiderweb, probably some graph-type database is most appropriate. If your data resembles an inverted funnel (hierarchical) more than a spiderweb, then a document-store probably is more appropriate. More traditional relational databases are probably more appropriate somewhere in between (which is probably why they're still the most popular type of database being used).

Of course, I can't think of any real-world scenario where your data wouldn't resemble a bit of both. Even very hierarchical data usually has some cross-relationships between un-nested documents, which is why it's still awesome to have a document-store database that supports join-type relationships.


It's funny how people after all that hype "nosql everywhere, for everything" start discovering that maybe those guys in the 70s were onto something when they invented relational databases, and not were just too stupid to come up with key-value store. Some data is relational and answering "why don't you use latest fashion nosql" with "because our data is relational" is a perfectly fine answer.


Document stores != Key-value stores. Well, I guess they are similar but I prefer to separate DBs like MongoDB and CouchDB from other key-value databases like Riak and Redis.


I think your summary is only half. The other half is, I think, "Think real hard about whether your data is relational or not."


Indeed, the article makes the point that most interesting data is relational, or at any rate contains valuable relations. Discarding efficient relationship management may be a mistake.


"An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use."

Until you want some analytics.


Many people already build data warehouses for analytics purposes, you don't want to be running reports against your live datatbase if you don't have to. Why add extra load?


> Until you want some analytics.

I can actually respond to this specifically, as we recently had a project that needed us to build some decently-sized and complex analytics into their app. I spent about a month researching how most analytics solutions are structured and work, and became very familiar with the codebase for FnordMetric, which is one such open-source analytics solution.

You wouldn't initially think it (I certainly didn't), but Mongo is actually a great use-case for analytics data. Here's why...

Most analytics platforms don't query live-data and build reports on the fly. It's terribly inefficient and doesn't scale. If something like Google Analytics did this, it'd take forever for your Analytics dashboard to load, especially at their scale.

What most analytics platforms do, is they know before-hand what data you want to aggregate and at what granularity, and they perform calculations (such as incrementing a counter) and then store the result in a separate analytics database/table. In fact, there are several presentations and articles about doing things like this with Mongo:

http://blog.mongohq.com/first-steps-of-an-analytics-platform...

http://www.10gen.com/presentations/mongodb-analytics

http://blog.tommoor.com/post/24059620728/realtime-analytics-...

And then, this is an interesting article that discusses the difference between processing data into buckets on the way in, and creating an analytics platform that does more ad-hoc processing on the way out:

http://devsmash.com/blog/mongodb-ad-hoc-analytics-aggregatio...

Let's take something as simple as aggregate pageviews for example (for simplicity's sake, we'll say you want total pageviews for your app, not per-page). Normally you'd think, simple, I'll just store my pageview events, and then when I want to view pageviews, I'll issue a `COUNT` command on the database. Even this gets terribly slow, for a couple reasons:

* You may just have a ton of pageview event entries to query.

* Each pageview has a datetime-stamp, and you have to query not just one `COUNT` query for a given time-range; rather, your analytics dashboard needs to show a graph of counts over time, e.g. pageviews per day for the last week, or pageviews per week for the last year or pageviews per hour for the past day, etc. Each of these would require several distinct COUNT queries (or one more-complex GROUP query), which is even slower, especially for large datasets.

So generally, analytics platforms will have different aggregate buckets for pageviews in the database, which each keep a different granular tally. For example, I'd have a bucket for each day, which keeps tally for pageviews that day, and a bucket for each week, which tallies pageviews for that week, etc. When a pageview comes in, they'll increment each bucket (which is a really fast process with Mongo, since it actually has an `INC` command (aka UPSERT) which can easily increment multiple buckets with one really fast query.

So why is Mongo pretty good for analytics? Because 1) each time-interval bucket is a silo of data for that time-interval, and 2) usually analytics are for patterns and aggregate data, so they don't normally require extremely high reliability (i.e. it's usually okay if an event is dropped here or there).

Of course neither #1 or #2 above are always the case, so this doesn't always apply, but my point was just that Mongo is actually a better fit for analytics than you might imagine.


I haven't done the kind of analytics you're talking about, but it sounds like the implementation is basically a round robin database.


thanks for the great references

i really needed some good resources on doing analytics in mongodb


but then you can take it out of mongo into something made for analytics. this is a challenge i'm currently facing, but I feel the flexibility mongo has offered in letting us iterate on our data collection is paying off in the end.


I'm using CouchDB hopefully for the right reasons. Each user is storing and accessing only their own data. I need that data to be easily stored offline in localstorage in the browser (sqlite/indexedDb not being supported in all browsers), and similar key/value stores for iOS/Android apps. On top of that I need synchronization when the user does come online. This is the type of app you'd want to use on the go as well as on your home computer, so easy synchronization is very important, which the CouchDB changes feed provides.


I haven't used CouchDB yet, but I have a good friend who's an amazing developer, and he swears by CouchDB, mainly for the reasons you mentioned. So, I don't have any context for your app, but it sounds like you picked a well-suited database to me.


That sounds like a good use for CouchDB; I'm doing something similar. CouchDB shines at that stuff (and as a bonus, avoids some of the issues the OP was having with MongoDB. CouchDB views aren't magic, but they're powerful and functional; more than capable of doing some basic joins).


Or maybe something like this: A Graph Database

http://en.wikipedia.org/wiki/Graph_database


Yep, the "graph databases are too niche to be put into production" bit urked me- Neo4j et al are in plenty of large production systems. OTOH maybe, due to the distributed nature of the project, they didn't want to distribute a less-known database?


I guess in 2010, Neo4j did not have that much exposure as they have today. Still I concur that the author should not brush graph databases aside for something like a social network - they seem a better solution than a RDBMS.


I suspect there isn't actually a lot of need for graph operations in a social network. At least, not in implementing features for the users. A distinctive thing about social networks is that although the users form a network, they are primarily social - they're interacting with their friends.

They will end up interacting with friends of friends via their friends (eg having a flamewar with your cousin's neighbour in the comments of your cousin's post about potatoes), but not with friends of friends of friends or any degree of separation further out. The queries needed are overwhelmingly local, and a boring old relational database will handle them fine.

Where a graph database might shine is in analytics over the whole network, looking for trends, hubs, clusters, etc, lthough i'm not convinced it would be any better than a relational database which supports recursive queries (as PostgreSQL does). However, this is exactly the sort of privacy-busting awfulness that Diaspora was built to escape from!


I think the distinction is captured in "largely". The author seems to be saying that unless you need only the absolute most minimal relational queries, don't use Mongo. That's more extreme than what I realized (and I can't tell if you're agreeing or not).


> An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use.

You dont need mongo db to store todo list datas. My opinion is , in some plateforms, like nodejs,where orms and rdbms drivers are not mature , it's quicker to stuff your app with a mongodb rather than a relational database, because they both use javascript and json data structures. But does mongo db scales easier than a mysql database ? is it even easier to manage ? i dont think so.


Relational just means tabular in the context of relational databases. For storing large-scale social networks I would think of specialized graph databases before anything else.


Linkbait title aside, it's actually a helpful example for directing a database novice on when to not use a document store. I could have used this post a few times in the past few years.


Agreed. Despite the unfortunate title, this is an informative, well written and entertaining article that I might refer to in the future. It would be better if there was a followup on when it would in fact be appropriate to introduce a document store to a project.


I agree that the OP is lengthy, and putting together this well-illustrated post is no easy feat. However, I don't think the OP should be the one to write about when you should use a document store.

Maybe I'm too annoyed by the poorly chosen title. Or that I read that entire post and was thinking where's the punchline? On one hand, I credit the author for thinking things through. On the other, the fact that she unequivocally attributes this issue to MongoDB shows that she currently lacks the domain knowledge to consider appropriate use cases. It's not a MongoDB problem, it's a problem inherent to this data structure, and someone more well-versed in this topic would not conflate the issue...just as a decent IT person would not blame "Windoze" for the fact that she can't get good Wifi reception in the office.

OK, to be even more petty...I think what really aggravates me is how the OP says she's not a database expert -- which is a good disclosure, but self-evident -- but attempts to assert authority by saying "I build web applications...I build a lot of web applications"...Uh, OK, so what you're saying is that it's possible to be an experienced web developer and yet be a novice at data design?

If that was the angle of the OP, I'd give it five stars. Such sentiment cannot be overstated.


Well you're right of course that web developers (and business analysts, and politicians, etc.) can absolutely get by for a staggeringly long time with novice-level abilities. That problem is only getting worse as the tools get better. Luckily I don't have to judge the OP on that basis since that's what markets are for.

And maybe someone else, who has tackled enough difficult problems over time to evolve a nuanced and technically informed opinion of various data modeling and management options, should write the response I mentioned. I'd argue there are plenty examples of that material available already.

The OP, on the other hand, would be writing from the perspective of a professional user who might choose a tool off the shelf at the recommendation of a colleague, and whack it against the problem du jour to see if it works or not. This is a common enough approach that there is at least a chance that a followup would have some value. I can't really expect everyone who makes a living writing web applications to understand CS fundamentals, any more than I would expect it from chemical engineers or physicians. It is nice to be able to point representative members of that audience to an article that resonates with them, and not have to try to translate my opinions into similar language (with or without cat gifs).

Edit: I actually think Journeyman would be a more appropriate term than novice.


Aren't there things other than 'experts' and 'novices'?

It is possible to be an experienced web developer without being an expert at databases, for some reasonable definition of 'expert', sure. I think so anyway. Do you find that aggravating?

Whether it's possible to be an experienced web developer while being a novice at either 'databases' or 'data design' (are those the same thing? you said the second, OP said the first) is open to debate I suppose, but is not implied by the OP.


Even though their data doesn't fit well in a document store, this article smacks so much of "we grabbed the hottest new database on hacker news and threw it at our problem", that any beneficial parts of the article get lost.

The few things that stuck out at me:

* "Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production." - So you did absolutely no research

* "What could possibly go wrong?" - the one line above the image saying those green boxes are the same gets lost. Give the image a caption, or better yet, use "Friends: User" to indicate type

* "Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars." - Yep, and since users are indexed by their ids, this is extremely easy.

* "What happens if that step 2 background job fails partway through?" - Write concerns. Or in addition to research, did you not read the mongo documents (write concern has been there at least since 2.2)

Finally, why not post the schemas they used? They make it seem like there are joins all over the place, when I mainly see, look at some document, retrieve users that match an array. Pretty simple mongo stuff, and extremely fast since user ids are indexed. Even though graph databases are better suited for this data, without seeing their schemas, I can't really tell why it didn't work for them.

I keep thinking "is it too hard to do sequential asynchronous operations in your code?".


Really, in some places it hurts

* We stored each show as a document in MongoDB containing all of its nested information, including cast member*

I've seen this in people using MongoDB and the bough the BS that because "it's a document store" there should be no link between documents.

People leave their brain at the door, swallow "best practices" without questioning and when it bites them then suddenly it's the fault of technology.

" or using references and doing joins in your application code (double ugh), when you have links between documents"

1) MongoDB offers MapReduce so you can join things inside the DB. 2) What's the problem to have links between documents? Really? Looks like another case of "best practice BS" to me


Links in mongo aren't really links though; its up to the application to handle the "joins", which really means making an extra query for every linked item. It's like SQL joins except without any of the supporting tools or optimizations that exist in RBDMS.


Yes, it is manual

But you can query for a list of ids for example, using the 'in' operator and a list. http://docs.mongodb.org/manual/reference/method/db.collectio...


Isn't this done client side? Without joins in the db engine itself locality is much worse along with lost opportunities for optimization leading to much worse performance.


Yes, you have to build the list of IDs to pass to the $in operator and then send out a second query but grandparent post said you had to make an extra query for each linked item which is incorrect.


at mongo training we were told map/reduce did not offer performance and to avoid it for online use. You must use the "aggregation framework".


> What's the problem to have links between documents? Really? Looks like another case of "best practice BS" to me

I think the main problem is that it becomes difficult to maintain consistency, due to Mongo's lack of transactions.


Do NOT use MongoDB unless you understand it and how your data will be queried. Joins like the author mentions by ID is not a bad thing. If you aren't sure how you are going to query your data, then go with SQL.

With a schemaless store like Mongo, I've found you actually have to think a LOT more about how you will be retrieving your information before you write any code.

SQL can save your ass because it is so flexible. You can have a shitty schema and make it work in the short term until you fix the problem.

I wrote many interactive social apps (fantasy game apps) on Facebook and it worked incredibly well and this was before MongoDB added a lot of things like the aggregation framework.

The speed of development with MongoDB is remarkable. The replica sets are awesome and admin is cake.

It sounds like the author chose it without thinking about their data and querying upfront. I can understand the frustration but it wasn't MongoDB's fault.

This is a big deal for MongoDB: https://jira.mongodb.org/browse/SERVER-142.

Let's say you have comments embedded on a document and you want to query a collection for matches based on a filter. If you do that, you'll get all of the embedded comments back for each match and then have to filter on the client. IMO, when the feature above is added, MongoDB will become more usable for more use cases that web developers see.


I've seen a fair number of articles over the last couple of years comparing the strengths and weaknesses relational/document-store/graph databases. What I've never seen adequately addressed is why that tradeoff even has to exist. Is there some fundamental axiom like the CAP theorem explaining why a database like MongoDB couldn't implement foreign keys and indexing, or why an SQL couldn't implement document storage to go along with its relational goodness?

In fact, as far as I can tell (never having used it), Postgres's Hstore appears to offer the same advantages as a document store, without sacrificing the ability to add relations when necessary. Where's the downside?


> why an SQL couldn't implement document storage to go along with its relational goodness? (…) Postgres's Hstore appears to offer the same advantages as a document store, without sacrificing the ability to add relations when necessary. Where's the downside?

PostgreSQL can store arbitrary unstructured documents just fine: hstore, json, … Each come with the possibility to actually index arbitrary fields within the documents using a BTREE index on an expression, and arbitrary documents wholesale using GIST index.

Besides the need to know a thing or two on query optimization, the only downside I can think of is that ORMs are usually broken (Ruby's Sequel is a notable exception). But this isn't a problem with Postgres itself; it's a problem with ORMs (and training, admittedly).


Typically as your data model complexity ("relatedness") increases, it's more difficult to scale. I'm not sure about anything like CAP, but I do know that in graph-database land we have to remind ourselves that general graph partitioning is NP-Hard, and that our solutions will need to be domain-specific.


But it's not web scaled :)

http://www.youtube.com/watch?v=URJeuxI7kHo


>> Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production.

Is this really true? It sounds like both relational DBs and document DBs are a poor choice for the social network problem posed. I've actually dealt with this exact problem at my last job when we started on Mongo, went to Postgres, and ultimately realized we traded one set of problems for another.

I'd love to see a response blog post from a Graph DB expert that can break down the problem so that laymen like myself can understand the implementation.


At my current employer, we're working on a product that relies heavily on a graph DB (Titan, in this case). Performance characteristics vary dramatically based on the type of query you're trying to run so you have to be careful about how you use it. There are certain types of things you might do in a relational DB with no worries but that would perform horribly in Titan. The converse is also true, of course. For example, a query along the lines of "give me a list of friends of friends of person X" is very fast indeed on a graph database, whereas a query like "give me a random person" tend to perform horribly. But we've been able to get impressively fast, real-time performance on graphs with millions of vertices and tens of millions of edges. They're still niche products compared to NoSQL systems like Mongo, Redis, etc. But I don't see any reason to think think that Titan or Neo4J aren't production ready.

Here's a good intro to Titan and how it works: http://www.slideshare.net/knowfrominfo/titan-big-graph-data-...


I would look at Neo4j. I originally came across it when vetting Grails (it has a Grails plug-in) and it seems to be one of the heavy contenders in terms of a production-ready graph DB. People (this article's included) seem to say that production-ready graph DBs don't exist. Maybe these projects are still trying to gain traction? I expect some stable builds will be out there soon if they aren't already...

http://www.neo4j.org/


My experience with Neo4j (this year) was abysmal. The take-away I had was: it's only good for very small graphs.

Generally, I'd spend some time writing a script to load data into it, start loading data, respond to it crashing a few hours later, increase the memory available to the process, start up again, and respond to it crashing a few hours later. I was never able to get any reasonably-sized graph[0] working reliably well without using an egregious amount of memory, and knowing that I would continue to face memory issues, I gave up on Neo4j and found another way to solve my problem.

It may be that I simply was not competent at setting it up properly, but no other data store I've worked with has been as hard to get stable over a moderately sized data set. I spoke with some other people who had worked with Neo4j at the time, and they expressed the same issues - they couldn't make it work for any reasonably-sized dataset and had to find another solution.

[0] Not big, mind you, just reasonably-sized. E.g. 4 million nodes, with each node having an average of 5 edges and 2-4 properties.


Hm, I assume you reached out to the mailing list and what not? I know a number of installations with numbers well above that. Were you using the batch insertion API?


No, I'm sure there are some great running instances out there - but I was put off by the difficulty of getting it reliably running without being an expert in its configuration. Additionally, the fact that I'd have to spend at least $12k/year to have only 3 nodes in a cluster, knowing we'd need a lot more than that as time went on sealed the deal.

We found that we could do everything we needed with secondary processing against our document store at runtime for so much less without adding another layer of complexity to the architecture.

Edit: forgot to mention - no we weren't use batch-insertion in all cases, IIRC, we had issues with duplication and had to do check-if-exists -> create-if-not as we were reading from raw data sources that were heavy with duplicates.


Many heavy duty production customers of Neo4j run with just a 3 node cluster, no need to scale out as with other NoSQL datastores. And actually they replaced larger clusters with a small Neo4j one.

I would love to learn about your Neo4j setup, and the issues in detail, I want to make it easier for people in your circumstances in the future to get quickly up and running with Neo4j in a reliable manner. If you're willing to help out, please drop me an email at michael at neotechnology dot com.


And I remember in flipping through a book on graph DB engines that some can be mounted as an extra layer on top of relational stores, so there is always that backdoor back into it.


Yeah, FlockDB (https://github.com/twitter/flockdb) comes to mind. I think Titan (https://github.com/thinkaurelius/titan) should / will be able to handle this too.


I can attest that Neo4j is production-ready- I know they're being used at companies like Adobe and Cisco, and we were happy with it at Scholrly.


More than 30 of the Global 2000 now use Neo4j in production for a wide range of use cases, many of them surprising, even to us! (And we invented the property graph.)

A partial list of customers can be found below:

www.neotechnology.com/customers

The "too niche" comment might have been true a few years ago. I won't speak for all graph databases, since many are clearly very new and haven't had much time to mature yet. But Neo4j has been in 24x7 production for 10 years, and while the product has of course evolved significantly since then, it'd built on a very solid foundation.

Most the companies moving to graph databases--speaking for Neo4j, which is what I know about-- are doing so because either a) their RDBMSs weren't able to handle the scope & scale of their connected query requirements, and/or b) the immense convenience and speed that comes from modeling domains that are a graph (social, network & data center management, fraud, portfolios, identity, etc.) as a graph, not as tables.

For kicks, you can find a number of customer talks here, from the four (soon five) GraphConnect conferences that were held this year in major cities around the world:

http://watch.neo4j.org/

If you're in London, the last one will be held next week:

www.graphconnect.com

You'll find a summary below of some of the technology behind, with some customer examples.

www.neotechnology.com/neo4j-scales-for-the-enterprise/

One of the world's largest postal delivery services does all of their real-time package routing with Neo4j. Several customers have more than half of the Facebook social graph running 24x7 on a web application with millions of members, running on a Neo4j cluster. Railroads are building routing systems on Neo4j. Some of the world's largest customers are using them for HR and data governance, alternate-path routing, etc. etc.

The best way to really understand why graph databases are amazing is to try. Check out the latest v2.0 M06 beta milestone of Neo4j (www.neo4j.org) which includes a brand-new query environment. I've seen connected queries ("shortest path", "find all dependencies", etc.) that are four lines in the Cypher query language and 50-100 lines in SQL. I've seen queries go from minutes to milliseconds. It's convenient and fast. Glad to see you exploring graphs!


> Is this really true?

Facebook's TAO is a giant graph database built on top of MySQL [1]. I'd say it's pretty production-ready, because Facebook's social graph probably has at least hundreds of vertices.

[1] https://www.facebook.com/notes/facebook-engineering/tao-the-...


There's a difference between ready-for-production and ready-for-production-if-you-have-the-entire-team-of-developers-that-wrote-it-on-hand-all-the-time.


See http://blog.neo4j.org/2013/11/why-graph-databases-are-best-t... on how to use Neo4j for the mentioned Diaspora cases (Neo4j was actually proposed back in 2010 to the team). Comments are very welcome.


Exactly what I thought. Mongo has its purpose. But it's a tree. If your data is a graph with many nodes, it's going to take some elbow grease. Don't use mongo in that case, use something that is built for that, like Neo4j ...


what's wrong with symlinks on transactional filesystem?


Millions of small files is the worst workload for pretty much every filesystem. Data locality and fragmentation can end up becoming real problems too.


I hate link bait like this.

The real title should be "Why you should never use a tool without investigating it's intended use case"


But the point is that there is no use case. Relational databases and normalisation didn't arise because a load of neckbeards wanted bad performance and extra complexity.

The point of the article is that the world is relational, and because Mongo isn't, it'll bite you in the ass eventual. Sure, that's a specialisation of what you said, but still a useful one, as it allows you to immediately know you shouldn't use Mongo (unless your data is all truly non-relational, and you know you'll never integrate it with any relational data, which, without a crystal ball, you can't know, so don't use it.)


There is a use case, but internet hype has gotten everyone wanting to use Mongo when there's no real reason to. Postgres scales nearly as well as Mongo while being a lot more flexible. That said, Mongo has some real benefits for non-relational computing (see mapreduce) that could make some of the abstraction headaches and lack of data model flexibility worth it for very large data sets.

But I sort of agree; Mongo tends to be overused by startups who are trying to solve a scalability / performance problem before they have one. In the process they often end up running into data model limitations because stuff moves fast early on and you can't foresee what you'll need in a year.


As soon as you have users, you'll want to handle relationships between users, whether that's outward-facing or for internal analytics. All products have users by definition. Therefore...


Yes. There seem to be a lot of people with quite poor reading comprehension commenting here. The case that the article makes is something like:

1. Document stores are no good for data with non-strictly-hierarchical structure 2. All interesting data has some non-strictly-hierarchical structure

The first point is common knowledge nowadays. It's really the second point that is interesting. Moreover, interesting and correct.


Can anyone explain what are some actual real-life good uses for MongoDB?


I was on a team that built a web app for primary school standardized testing. The amount of data presented and collected per student per test is large and perfect for a document store. MapReduce operations allow the app to quickly produce cacheable reports across cross-sections based on requested criteria.

Event the tests themselves are composed of multiple parts that randomize for each student, and lend themselves to the document structure that MongoDB provides. Individualized tests could be assembled from components based on student criteria and stored uniquely for a user as of that time, a thing which would be unnecessarily complex within a relational system.

Could this all have been done with a relational database? Yes, I suppose, but I cringe at the complexity of relating test questions with test answers with users with other data elements ad infinitum using JOINs on both read and write. And this doesn't even touch the topics of sharding and replication, which Mongo made easy in comparison to MySQL or MSSQL.

Choosing MongoDB was the correct decision for this dataset and application. I don't advocate it for every app, but for this one, it was the appropriate fit.


How would you go getting out data that answered a question like "give me the average score for all maths questions of female students aged 6-7"?


The aggregation framework is ideal for answering these kinds of issues. Mongo has a bunch of aggregation routines that are useful for producing reports on demand, but not on the fly. The trade-off is possible because we know that the collected data (test answers) won't change after it is finalized, and the output of any individual report can be cached virtually indefinitely. (See http://docs.mongodb.org/manual/core/aggregation/)

Also keep in mind that unlike something like a web analytics package that gives you the option to filter and sort your data on any combination of criteria imaginable (for no good reason), the questions that academics/educators tend to need the answers to are generally the same for every new set of tests.

In other words, it's not necessary to enable every possible combination, filter, and sort of output, but merely (ha!) to optimize for the specific results that we know we will need (with a nod toward those results we might expect people to want in the future), and to codify the formulae that will produce those results.

Working with not a school but a testing research company (think of a company like "College Board" vs a "Smallville School District") leads you to produce reports that are significantly more detailed and statistically more valuable than "average score" questions like this, all of which is possible within MongoDB. Though, obviously this treads a little more into work product than would be comfortable to expound upon here. ;)


Sounds reasonable but not ideal. Probably some sort of ETL into a data warehouse would be needed for adhoc analysis.


That's what the aggregation framework is great for. I don't have the code handy, but the free MongoDB for Developers course covers almost this exact use case.


Something like Imgur might be a good use-case for MongoDB. There's basically no relations between images, so each image can easily be thought of as a lone document.

That said, even if somebody was building something like Imgur, I would still advise that the start with a SQL database. SQL is very well-understood, and you will have no problem finding developers that have deep experience in your SQL engine of choice.

More importantly, by the time you hit the point where you need a NoSQL solution to handle scaling issues, you will have achieved product-market fit, and can make a sane technology decision based on your vastly greater understanding of the business needs.


> by the time you hit the point where you need a NoSQL solution to handle scaling issues

See, people keep saying that NoSQL databases give you a performance boost over traditional relational solutions (MySQL and Postgres), but exactly where does this performance boost come from? I can understand the appeal of in-memory databases or using caching (Memcached) to supplement the relational solution, but it seems like the vast majority of Mongo's performance benefits come from eschewing ACID guarantees rather than document databases being inherently faster.


Imgur is set up to have almost the same structure and features as reddit. Users have images, there are sections duplicating the subreddit structure, images have comments, comments have votes and voters. That's a lot of related info that isn't strictly hierarchal, and I believe you'd run into the same problem described in the article - the need to manually do joins and associate types of data in your application code to ensure consistency and lack of duplication.


except there are relationships in imgur. For example when it groups images from different subreddits or creates albums.


Seconded. I see a lot of "But there are use cases!" and, so far at least, not even a little bit of "Here's a use case..."


Check my comment history, I've given several use cases. Basically, Mongo is nice if you don't need a lot of relational stuff but do have lots of arbitrary data to store. A good example is time series data where the format changes over time -- often it's not a good idea to go back and convert old data (sometimes it's not even possible). Mongo makes it really easy to support multiple schemas if your business requires it, rather than having to maintain arbitrary numbers of different tables.


Sure. But why use Mongo for that, instead of a PostgreSQL table with a JSON column for the data, and perhaps a denormalized version identifier so your application code knows what to do with the format of a given row's data field? I can see a speed argument, but I can't see how that militates for pure Mongo, instead of Mongo as a cache in front of something that provides a reliably (not "eventually") consistent backing store.


Alright, so I am not the only one who's been very curious too. Please, someone writes up a real use case where MongoDB is used as the only/main data store, and not a persistent cache in front of relational database?

Thanks.


TL:DR; Mongo works for me ('us'), and when I get some time I'll write a post on what we're doing with it, and why it works for us.

I am ('we are') using Mongo for a public transit planner in South Africa. It's not yet production-ready, but beta testing is going well.

Let me paint the picture before I go on to justify our use of Mongo.

In South Africa there are trains, buses, minibus and other services (metered taxis, shuttles). Trains run on stop-by-stop schedules, all of the bus services just a normal departure-based schedule, and minibus completely dynamic. In order to implement a well-organised integrated planner, you have to view all of these as one 'type' of service. There's then how pricing is calculated for each of the services. There are many different ways in which pricing is calculated, being: (distance-based, zone-based, pricing matrix, fixed minimum with variable charge etc.), then there's ticketing, discounts etc. That too we needed to represent in a simple structure.

Now, my SQL is pretty good, I don't frown upon indices nor joins, but what I can say is that in my initial implementation of the whole idea, I faced a number of problems, being:

(1) what level of normalisation/denormalisation is necessary? (i.e. what should I join, what should I keep in the same table)

(2) I'm essentially using a graph, except that nodes aren't always connected, so how do I traverse the graph when it's actually 'broken'? (excuse me if I get the terms wrong, I'm actually a technical accountant, programming is my second love)

(3) I'm working with location data, I can't expect to use the Haversine formula or equivalents, how can I index both locations, and routes? How do I even store routes? (blob of serialised arrays?)

(4) How do I reduce development time, to reduce the amount of time I spend refactoring schemas/code when I want to implement some new 'shiny' feature?

These were my main concerns, as they were the problems that I had with MySQL. PostgreSQL would have done a good job for (1) and (3), and maybe (2), but I was still concerned with (4) as working with PHP/MySQL isn't the friendliest of things. Doing 'in()' queries is one example, as I have to parse an array to a string before using an in() function.

Mongo initially appealed to me because it was marketed as 'schemaless', but even someone with little knowledge as I knew that it should be taken with a grain of salt. The benefit here was that I could store different services with different attributes in one collection. If a service is a minibus, I add all the fields that I need for the minibus, and omit the ones for a train for example. Similarly with the pricing structures.

At first it was difficult grasping the 'store everything in one document' model, but I got the hang of it, and now my schema is 'frozen', so (1) has been solved. I don't use joins or any simulation thereof. Yes, because storing I store what I need in one document, I don't need to go back to Mongo to find joining data.

(2) A graph database wouldn't work well for me, because even though services link together, there are instances where the commuter will have to walk to join another service. How do I do that? I initially created manual walking links in MySQL but that was naive and stupid (trying to avoid Haversine). Obviously PostgreSQL distance-based queries would also work here. Another thing against a graph database is that my project doesn't just rely on traversing graphs all day, there are other things which I need to do, like analytics.

(3) To be honest, even though I'm confident with my SQL knowledge, all the PostgreS/PostGIS functions felt a bit intimidating at first. I can't afford a $250k/year DBA, so I have to know what I'm doing on the database as well as the client/server. I find learning how to use Mongo to be easy, even though people say their query 'language' is (insert bad word), I find it quite user-friendly. Mongo's geospatial support, and GeoJSON as of 2.4, made things like storing a route, and running typical queries on it, very easy.

One other thing that we discount is the transparency of data in Mongo, yes they should use field compression and other space-saving techniques, but until them, I see the benefit in being able to do a db.collection.find() and being able to look at the whole document without having to join any tables (it's just a bit quicker I guess). Which brings me to (4): the most fun thing that I had to implement was scheduling, bearing in mind that there are scheduled services, and variable/random ones. Let's say I have a separate table for schedules, and I want to find: - x services, - that start/pass at [lat,lon] - which are operating during ab:cd AM - which have the next schedule/estimate within m and n.

Sure, you could do it with a JOIN, but why do that when you can do it all from a single document?

Lastly, experienced programmers tend to take for granted the benefit of simplifying certain things for novices/beginners. The reason why JSON is taking over, besides that it's a more compact and readable expression than XML, is that it's also easy to work with. Why should I worry about converting associative arrays to strings in order to do an in()? Instead of saying something like "in(array)" directly? Even though I was forward thinking in my schema design, there were a few changes along the road when I realised that something wasn't working. Making a change in the schema was quick, and I didn't have to spend a lot of time making sure that my data is still fine.

Please note that I didn't talk about 'web scale, speed' or all those other things. Under my current hardware I would need to cover 3 countries' transit systems before I need to shard. I am running a single-node replica so I can enable backups.

That's just my view, I wrote this in pieces, so I might appear to be all over the place. I'll write a thought-out post detailing why Mongo is currently working for us/me.


We have a CMS application that supports creating custom web forms, which each have a different set of fields which hold different types of data. Email addresses, multi-select radio buttons, text areas, etc. Some forms only gets submitted once or twice, others are submitted many thousands of times. To store this in a normal relational database you either need many tables, or you need to normalize your data (probably Entity-Attribute-Value (EAV) style). We didn't want hundreds of tables, and designing a good EAV system can be tough (Magento anyone?), so we looked at other options.

We've settled currently on using MongoDb, with one collection to hold the (versioned) definition of each form's fields (data type, order, validation rules, etc), and another collection to hold all of the submissions (this is a laaarge collection). There _is_ a "relation" between the form definition and the submissions, but because you always query for submissions based on the form (by form_id), you don't really need to do "JOINs" (you just query out the form definition once, before the submissions). Also, because the forms are versioned, and each submission is associated with a particular version of a form, there is no need to retroactively update the de-normalized schema of past submissions (although this does limit your ability to query the submissions if a form is updated frequently - or drastically). It's not perfect, but this use case for MongoDb has been working well for us so far.

My answer to this prompt was starting to get long, so I actually wrote it up in more detail on my blog (the first update in months!). Included are some other drawbacks and tradoffs there. Check it out here if you are interested:

http://chilipepperdesign.com/2013/11/11/versioned-data-model...

I would love to hear how other folks have solved similar issues like this? Or if anyone sees a way to improve on our current solution? Feel free to respond here or on the blog post. Cheers


Reminds me of the NYT "Stuffy" app, which was built in a similar way: http://open.blogs.nytimes.com/2010/05/25/building-a-better-s...


A good simple use-case for a document database (could be MongoDB, but not necessarily) is configuration and system "schema" type data. For example, storing all of a user's settings and preferences into a document keyed by the user's Id.


We use it for event storage in the event sourced parts of our app.

For the rest of our data, we're currently migrating off of Mongo to Postgres due to an experience similar to the OP's.


This is ridiculous linkbait bullshit.

Anyone who dismisses document stores entirely has lost all my respect. It wasn't the right solution for your problem, but it might be the right solution for many others.


> but it might be the right solution for many others.

The author made the example of the movie database and explained why it was a good idea when they started, and why it didn't work out. Can you point out an example of data you would store in a document database, which is not purely for caching purposes?


Collecting structured log data like monitors or exception traces or user analytics. Lots of documents, no fixed schema, they're all self-contained with no relations. Map reduce makes query parallelism crazy magic.

A content management system. Some stuff may want data from across relations (who owns this thing, and what is their email), but that's pretty infrequent and having nice flexible-schema documents that contain all relevant information that's being CRUD'ed simplifies things hugely - particularly in MVCC systems like Couch that put stuff like multi-master/offline-online sync and conflict resolution in the set of core expectations.

Edit: That said, Postgres is also MVCC, and hstore makes schema an option the same way that relations and transactions were already, so I think it could do pretty well. I haven't gotten the chance to play with it in recent history, unfortunately.


> Map reduce makes query parallelism crazy magic.

Isn't that only true if you have lots of shards? Otherwise, you have one process doing the mapping.


> Some stuff may want data from across relations (who owns this thing, and what is their email), but that's pretty infrequent

That might be a shaky assumption. Speaking as someone who works on a CMS, content usually has an author, and people accessing that content might be interested in them.


Yeah, but in most of those cases, it's as easy as get the author based on a key from his content.

It's only when you want joins (e.g. give me all of the titles of all the content and their author's information at the same time) that things get hairy.

Agreed it's not always going to be true for many CMSes. I meant it as a particular CMS, not the general class of CMSes but didn't make that clear at all.


>Can you point out an example of data you would store in a document database, which is not purely for caching purposes?

I've been a pretty vocal critic of document databases in the past[1] (indeed, I get a little bit of a chuckle recalling the prevailing HN wisdom a couple of years ago and comparing it to now), however I recently had a project where added data was immutable and additive and non-relational: MongoDB was the perfect choice, and provided a zero friction, easy to deploy and scale solution.

[1] - http://dennisforbes.ca/index.php/2010/03/24/the-impact-of-ss... -- this went seriously against the prevailing sentiment at the time, and there was this strong "only relics use SQL" sentiment, including here on HN.


> I recently had a project where added data was immutable and additive and non-relational: MongoDB was the perfect choice

Technically that sounds a lot like the TV database example in the OP. MongoDB was the perfect choice until a feature was required that required a relation.


Agreed. Also, this looks way more like a case where the author mis-structured his data for his intended use case, and is blaming the tool instead of the skill level used to implement it. Nesting deeper than one level in a document is rarely going to result in sufficient query capability with respect to the nested items. Even MySQL can't nest rows inside other rows, which is what he seems to have wanted. Maybe he chose MongoDB because he wasn't ready to think around the architecting issues that an SQL-based database would require, which happen to be, although not immediately obvious, similar to those in Mongo.


Despite being a programmer, I believe Sarah is a woman.


The more reason this article should be taken with a grain of salt.


Seriously, dude? It's attitudes like that which makes women hesitant to become coders and computer engineers. It doesn't matter what's between a person's legs; just that a person can code, enjoys doing it, and knows what they're talking about.


Cut the guy some slack, he obviously hates women because his mother named him "Pear Juice". Either that, or he's yet another insecure man-child hiding behind a pseudonym.


For years men were dominant in computer science and engineering and I really see no reason why this should change. Only in recent time with this whole third wave of post-feminism certain groups of females think it is their task to overthrow men supremacy in said fields. They are way too emotional for this profession and this results in drama and absolute shit code in production.

I am not saying the author of the article submitted has fallen victim to described wrong doings, I am only saying that at all cost unknown teritorry should be approached with extreme care. You don't know what is subliminally hidden until you realize. Too late.


This attitude is both hateful and harmful to our profession. It is not ok and I wish more of our peers would step up to tell you that this is not acceptable.

You are also making wildly inaccurate statements to justify your abuse and I hate to think that other readers might accept them uncritically. We have been actively pushing women out of computer science for the past 30 years (and doing a fine job or excluding and ignoring the contributions of other minority groups in the process). Suggesting that men "were" dominant and misrepresenting the direction of this change in willful ignorance of history at best.

I know relying to trolls is not particularly effective and other users have called this out for being hateful but I don't want to see us accept either the premises or the tone presented here.


> They are way too emotional for this profession and this results in drama and absolute shit code in production.

So that is where all of the shit code in production (which is the majority) comes from? it's really insidious, because the commits are made using male names. This must have been going on far longer than i imagined, because I've dealt with really old legacy code that is shit.

Since, you made a provable statement, I'm sure we will soon see a tremendous number of papers documenting this coming gynepacalapse of bad code.


Get out of my profession, you repugnant sexist asshole. This kind of rhetoric is not acceptable.


> third wave of post-feminism

Hint: /r/TheRedPill is going to make your life worse, not better.


Oh my god, I did not need to know that existed. I need brain bleach now.


Yeah, stay away from that stuff... can mess you up.


how does it feel to be a piece of shit?


Fuck you.

...Was that emotional enough? Or not emotional enough? It's so hard to tell.


I don't understand why your comment is being buried. It's absolutely idiotic and backwards in substance, sure, but sweeping it under the rug doesn't help anyone.


While your namesake is sweet and delicious, your opinions are questionable.


They say never jump in an argument late... but here goes...

There is a lot of people arguing in the positive for polyglot persistence. The arguments sound pretty appealing - hammer, nail, etc - on the face of it.

But as you dig deeper into reality, you start to realize that polyglot persistence isn't always practically a great idea.

The primary issue to me boils down to safety. This is your data, often the only thing of value in a company (outside the staff/team). Losing it is cannot happen. Taking it from there, we stumble across the deeper issue that all databases are extremely difficult to run. DBAs dont get paid $250k/year for nothing. These systems are complex, have deep and far reaching impacts, and often take years to master.

Given that perspective, I think it then makes the decision to use a single database technology for all primary storage needs totally practical and in fact the only rational choice possible.


I'm going to reiterate as others have done - this is an area where a good graph database would blow all the others out of the water. I am currently using neo4j for a web app and find it to be extremely good in terms of performance. There is really only one downside to using a graph database - they are not really scalable horizontally as you might want. They need a fair bit of resources. But in terms of querying, they would be unparalleled in this particular use-case.

They are also not in infancy - they are in use in many places where you wouldn't expect them and which aren't discussed. One big area is network management - at least one major telecom uses a particular graph db to manage nodes in real-time.


> they are not really scalable horizontally as you might want

Seems like this would be a huge drawback for a project whose entire raison d'etre is horizontal scaling.


This is a well known and well document "down side" of mongo. Frankly the analysis of your article is jeopardized by your first line stating, "I am not a database designer". Mongo has its downsides that are well known, but there are also very good reasons to use Mongodb too. although its a lengthy article with good examples it states nothing more than an obvious caveat mongo has which is well known and documented.


My advice to the OP: Re-jigger this article and retitle it: "The Data-Design of Social Networks". That would be a worthwhile read and I appreciate the detail that the OP goes into.

One of the subheads should be: "Why we picked the wrong data store and how we recovered from it"

And not to be snarky about it, but an alternative title is: Why Diaspora failed: because a Ruby on Rails programmer read an Etsy blog and thought they understood databases


OP should have been using a graph database. Ranting about MongoDB because it doesn't support what it's not designed to support is a bit silly. A RDBMS would have been just as poor of a choice here.


Facebook seems to be doing quite fine by combining SQL and Memcached.


They probably don't use relational databases the way you do for smaller projects that don't need to scale to the millions.


Though title sounds like a link bait, this is actually eye opening article for database layman like me. Very clearly written.

Now, what is MongoDB fit for? Most of the web applications are what author gives example, complex and having inter-relationships. Can someone light up?


Like the article says, it can be suitable as a caching layer in front of a DB, especially for web apps that deal in ephemeral JSON documents.

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: