Why You Should Never Use MongoDB

gmjoe · on Nov 11, 2013

> Seven-table joins. Ugh.

What? That's what relationship databases are for. And seven is nothing. Properly indexed, that's probably super-super-fast.

This is the equivalent of a C programmer saying "dereferencing a pointer, ugh". Or a PHP programmer saying "associative arrays, ugh".

I think this attitude comes from a similar place as JavaScript-hate. A lot of people have to write JavaScript, but aren't good at JavaScript, so they don't take time to learn the language, and then when it doesn't do what they expect or fit their preconceived notions, they blame it for being a crappy language, when it's really just their own lack of investment.

Likewise, I'm amazed at people who hate relational databases or joins, because they never bothered to learn SQL and how indexes work and how joins work, discover that their badly-written query is slow and CPU-hogging, and then blame relational databases, when it's really just their own lack of experience.

Joins are good, people. They're the whole point of relational databases. But they're like pointers -- very powerful, but you need to use them properly.

(Their only negative is that they don't scale beyond a single database server, but given database server capabilities these days, you'll be very lucky to ever run into this limitation, for most products.)

bowlofpetunias · on Nov 11, 2013

People hate joins because at some point they get in the way of scaling, and getting past that is a huge pain.

Or at least, that's where the original join-hate comes from.

In reality of course, most of us don't have that problem, never had and never will, and it's just being parroted as an excuse for not bothering to understand RDMS's.

Relational database design is a highly undervalued skill outside the enterprise IT world. Many of the best programmers I've worked with couldn't design a proper database if their lives depended on it.

rosser · on Nov 11, 2013

People hate joins because at some point they get in the way of scaling...

No, in fact, they don't.

Poor relational modeling gets in the way of scaling, and that can be geometrically exacerbated by JOINs. A JOIN, in and of itself, is neither good nor bad. It's just a tool, and like all tools, how you use it is what makes it "good" or "bad" — just like you can build a house or bash in a skull with a hammer.

parasubvert · on Nov 12, 2013

In most relational database implementations, joins stop scaling after 10-50 million rows or so assuming an online transactional site.

A time series data warehouse could go into the billions of rows with scalable joins with partitioning and bitmap indices ... but is also only applicable in the unlikely case you could afford oracle at $60-90k/CPU list price

Also, most databases that aren't Oracle don't have high performance materialized views to "preprocess" joins at upsert time, therefore people resort to demoralized tables and their own custom approach to materializing those views.

Then even denormalized tables begin to stop scaling at around 250 million to 500 million rows. So people resort to sharding managed in a custom way.

I haven't even begun to express the scalability impacts of millions of users on a LRU buffer cache used in most RDBMS - that usually is resolved through an in-memory cache (Memcached, Redis) whose coherency is also managed in a custom manner. Or you could spend $$$ for Coherence, Gigaspaces, Gemfire, etc. but that's also unlikely in most web companies.

At the end of all this, even if you bought a cache, you wonder why you're using an RDBMS at all since you're so constrained in your administrative approaches. Cue NoSQL.

of course in practice many devs ignore all of this history and "design by resume" assuming their new social-mobile-chat-photo-conbobulator will be at Facebook scale tomorrow.

olavgg · on Nov 12, 2013

This is not true at all. I've worked on several databases with billions of rows in several tables. A good solution for improving your query performance is to use a multi column index http://www.postgresql.org/docs/9.3/static/indexes-multicolum...

parasubvert · on Nov 12, 2013

What part isn't true?

I'll restate my narrative: Single instance, normalized, unpartitioned databases run into scaling problems the several-hundred million row range especially when under heavy concurrent load.

But once you start moving to multi-instance, partitioned databases, you start to lose the benefits of the relational model as most databases have to restrict how you accomplish things -- e.g. joins are severely restricted.

darkmoth · on Nov 12, 2013

Oracle will handle anything you throw at it, assuming you have the $$$. Ebay uses it for 2 Petabytes of data:

http://www.dba-oracle.com/oracle_news/news_ebay_massive_orac...

parasubvert · on Nov 14, 2013

That's a link discussing an analytic database from SEVEN YEARS ago. eBay has moved on.

Please understand what I am saying:

- Traditional database architectures have limitations on what you can express in SQL for highly available and scalable online transaction processing once you introduce partitioning and clustering.

- Oracle has probably the best support for partitioning and clustering out of all RDBMS, but even that has limits in the billions of rows

- Many companies do not use Oracle for business reasons (licensing/sales/pricing practices)

What I am not saying:

- Oracle sucks (it's the most feature complete and robust RDBMS out there and is );

- Oracle is not used (Amazon, Yahoo, eBay, etc. all use Oracle in various contexts);

- Oracle does not scale (it does, though it requires you, the SQL developer, to intimately know the database physical design at a certain point of scale, which defeats much of why SQL exists to begin with)

batbomb · on Nov 12, 2013

I routinely deal with joins on a 100 million row table and they work just fine. Other than that, I also use a 10 billion row table for searches. This is in Oracle.

k1w1 · on Nov 12, 2013

I've also used databases with more than a 100 million rows in a single table and received realtime query performance in multi-table joins. And this is using Sqlite! No expensive DB licenses - but it was using a high-end SAN since we actually ran thousands of these multi-million row databases in parallel on the same server.

frederickf · on Nov 12, 2013

"demoralized tables" :)

tracker1 · on Nov 11, 2013

For me, it was a database schema with 38 joins (and 2 additional queries) to effectively get the data to display a single page. For that use case, mirroring the data on save to MongoDB was a no-brainer... with geospacial queries out of the box, and a few other indexing features it made a lot of sense.

I wouldn't even think to use MongoDB for certain use cases... but for others, it's a great fit. I think that Cassandra, Riak, Couch, Redis and RethinkDB all have serious advantages and disadvantages to eachother and SQL.

I do find that MongoDB is a very natural fit for node.js development, but am not shy about using the right tool for a job.

Another thing that tends to irk me, is when people use SQL in place of an MQ server.

davidp · on Nov 12, 2013

> For that use case, mirroring the data on save to MongoDB was a no-brainer

I think you just confirmed the OP's point -- MongoDB makes a good cache, not a good primary store. I'm guessing you didn't do updates into that MongoDB store, and always treated the SQL source as "authoritative" when it became necessary. Am I right?

tracker1 · on Nov 13, 2013

I no longer work at the company in question, but the plan was to displace SQL for the records that were being used in MongoDB, for mongo to become the authority. NOTE: this was for a classified ads site for cars. Financial and account transactions and data would remain in the dbms, but vehicle/listing records would have become mongo authoratative.

The transition was difficult because of the sheer number of systems that imported/updated listing records in the database... there wasn't yet a 100% certainty that all records were tagged on update properly, so that they could be re-exported to mongo... each day, all records were tagged ... took about 24 minutes to replicated the active listings (about 50K records), and we're not talking "Big Data" here, but performance was much better doing search/display queries against MongoDB.

drsintoma · on Nov 11, 2013

> In reality of course, most of us don't have that problem, never had and never will

Maybe you never had any problems, but I don't believe "most of us" can say the same. At least me, I'd encountered problems derived from join-abuse in almost every job I've had.

rodgerd · on Nov 12, 2013

That's funny, because I've mostly encountered problems with people who prefer to nest SQL queries inside a sucession of loops in their code, rather than learn how to use SQL properly.

emn13 · on Nov 11, 2013

Yeah, me too. But having said that, I've also seen problems with mongodb and they're much, much, much, much harder to solve.

schrodinger · on Nov 11, 2013

7 isn't necessarily nothing. Each join is O(log(n)), so I believe you're stuck with O(log(n)^7) as a worst case, although in practice it will probably not be so bad since one of the joins will probably limit the result set significantly.

The other problem is that with 7 joins, that's 7! permutations of possible orders in which the database can perform the join. That's a lot of combinations, and often you can run into the optimizer picking a poor plan. Sometimes it picks a good plan initially, and then as your data set changes it can choose a different, suboptimal plan. This leads to unpredictable performance.

I think that in practice, you're best off sticking with only a few joins...

jmelloy · on Nov 11, 2013

If you're regularly doing 7 joins it's a good sign of an over normalized databased.

jacques_chester · on Nov 12, 2013

Nonsense. It very much depends on the problem domain.

_pferreir_ · on Nov 11, 2013

> A lot of people have to write JavaScript, but aren't good at JavaScript, [...] they blame it for being a crappy language, when it's really just their own lack of investment.

I think it's pretty much an accepted fact that JS has its problems. Even Brendan Eich has been quoted as admitting it.

(Note: I am a JS developer myself)

rictic · on Nov 11, 2013

This is true, but the "wtf js is such a fucked up language" meme is outsized compared to the actual problems of javascript. Having worked full time in python for a couple years I could easily show you just as many weird python semantics that will inevitably bite you[1]. I think the grandparent's point has merit, that people expect to invest in their primary language for a project, but when circumstances dictate that they need to use a bit of javascript they find it annoying.

1] What does this program do?

    print object() > object()

ptx · on Nov 12, 2013

  >>> print(object() > object())
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: unorderable types: object() > object()

Yet another reason to upgrade to Python 3! :)

raverbashing · on Nov 11, 2013

About your code snippet, it prints a "random" boolean value

You're creating two objects (with random addresses, which affect the __str__ method result, which in turn result in a string comparison that returns False or True)

aduitsis · on Nov 11, 2013

That's nothing compared to the "Perl is Satan!!1" meme that us Perl programmers usually have to put up with ;)

beat · on Nov 11, 2013

Actually, Perl is many different Satans, depending on the particular stylistic quirks of the programmer in question.

clem · on Nov 15, 2013

TMTOWTDI: There's more than one way to damn it!

berntb · on Nov 12, 2013

Not so much with the usual coding standards (see "Best Practices" book etc).

merfakos · on Nov 12, 2013

IMHO, it's well deserved. Start by stop having those $, @, % identifiers for variables and then we 'll talk again about how many more daemons you need to impale.

draegtun · on Nov 13, 2013

Sigils are what make Perl standout, in syntax and behaviour, to (most) other languages so it would be silly to get rid of them!

i.e. it's a differentiator to what is or isn't Perl.

jiaweihli · on Nov 12, 2013

That one is somewhat reasonable (and relatively obscure code you'd probably never write).

A more realistic example: inner classes can't see class variables from their enclosing classes. (Why enclose classes? - builder pattern)

_pferreir_ · on Nov 11, 2013

Every language has its quirks. Python is not perfect either. But JS shows clear signs of bad design decisions, such as the behavior of the == operator.

woah · on Nov 11, 2013

What == does is pretty simple and easy to understand. If you have a hard time with it, use ===. Problem solved.

georgemcbay · on Nov 12, 2013

The problem is that JavaScript does not exist in isolation, there are other languages that use this operator. If you're familiar with any other C-derived language, the way == acts in JavaScript is very unexpected.

Yes, you can learn to deal with it, but that doesn't mean it wasn't a bad design choice. If both forms of equality are required to be operators, == and === should have been swapped. Too late to do it now, woulda, coulda, shoulda, but it certainly is, IMO, a "bad design" smell in the language, and hardly the only one that still bites people.

Another example: the way 'this' scoping works is similarly busted in that while the rules for it are reasonably straightforward in isolation, it is different enough compared to other languages that share the same basic keywords and syntax that it should have been called something else.

To be fair, I don't think much of this has to do with Brendan Eich being a bad PL designer as much as it has to do with the odd history of JavaScript that is still represented in the name of the language ("Take this client language you made which has no connection to Java, and make it look kinda like Java, please!").

achy · on Nov 11, 2013

I agree with you to a point. Joins are your friend. But trying to pull out all of the information about a graph of 'objects' using a single query with multiple one-to-many and many-to-many joins is just as foolish in SQL as in Mongo.

chao- · on Nov 11, 2013

Do you have any resources you would recommend to understand or at least give an overview of indexes?

I learned basic SQL once-upon-a-time and understand the relational algebra side of things, but only truly picked up the finer details and specific engines in piecemeal manner, as needed in various projects.

Negitivefrags · on Nov 11, 2013

http://use-the-index-luke.com/

This is a really good resource for understanding how the queries you do relate to the actual actions that the database engine takes.

daigoba66 · on Nov 11, 2013

http://use-the-index-luke.com/ is a great online book (free) targeted at programmers and developers. It's practically required reading in my opinion.

jmulho · on Nov 11, 2013

http://docs.oracle.com/cd/E11882_01/server.112/e40540/indexi...

Keyframe · on Nov 11, 2013

Star, constellation, snowflake, flat.. Developers (not the author) would benefit from database introductory course even if they are not using databases. I think Stanford did one that was open to everyone.

mkoryak · on Nov 11, 2013

This article ends up agreeing with you at the end, by the way.

pekk · on Nov 11, 2013

When program errors pass silently, that is a legitimate problem in the toolchain.

dasil003 · on Nov 11, 2013

There is a good reason that relational databases have long been the default data store for new apps: they are fundamentally a hedge on how you query your data. A fully-normalized database is a space-efficient representation of some data model which can be queried reasonably efficiently from any angle with a bit of careful indexing.

Of course relational databases being a hedge, are not optimal for anything. For any particular data access pattern there's probably an easy way to make it faster using x or y nosql data store. However as the article points out, before you decide to go that route you better be pretty certain that you know exactly how you are going to use your data today and for all time. You also should probably have some pretty serious known scalability requirements otherwise it could be premature optimization.

Neither of these things are true for a startup, so I'd say startups should definitely stay away from Mongo unless they really know what they are doing. Being ignorant of SQL and attracted by the "flexibility" of schema-less data stores powered by javascript is definitely the wrong reason to look at Mongo.

nashequilibrium · on Nov 11, 2013

I actually started to laugh as i was reading because i knew what problems they were going to run into. I was basically drawing up my schema for a mongodb app(yes you still need a schema), when i started scratching my head and started reading through the mongodb guides. I quickly realised that i should use a relational store and my problems were solved quickly with postgres.

The title of this article should be honestly changed as it does not do mongodb justice, there are a lot of uses for it but relational is not one. Regarding the TV example, this is a classic relational solution and i enjoyed this exact example in a pycon tut,SQL for Python Developers - http://www.youtube.com/watch?feature=player_embedded&v=Thd8y...

I see a lot of people thinking MVP ---> schemaless to save time -----> mongodb but you will always need a schema unless you are just dumping a list of stuff. I would like to say that another cool solution is an RDF data store, i have been using Fuseki with SPARQL.

jerven · on Nov 12, 2013

I use SPARQL a lot although not Fuseki. I really like the flexibility it gives in the schema (flexible not less schema). As well as that one query language can be used on radically different implementations i.e. I don't need to change my datamodel or queries to try different storage models.

Although we also use BerkleyDB/je+lucene for indexing as well as a number of existing relational databases. Yet, considering the youth of the SPARQL eco system (1.1 of the standard is only out since the beginning of this year) there is some fantastic performance possible for both hard and easy queries. I think it will be a bit like Java, not pretty, but fast enough and extremely robust in the long run. With a similar marketing pitch "Query Once Store Anywhere".

I also evaluated MongoDB, and I understand the value of a document store. I just don't think that MongoDB is a good document store, imho it just a slow /dev/null.

camus2 · on Nov 11, 2013

> Being ignorant of SQL and attracted by the "flexibility" of schema-less data stores powered by javascript is definitely the wrong reason to look at Mongo.

It's usually the only reason. And 10gen were good at marketing it.

chadcf · on Nov 11, 2013

I used it for exactly one production app and it was a huge success. The reason I used it was because the data we needed to represent was actually a document, in this case a representation of fillable form fields in a pdf document. The basic structure was that documents had sections and sections had fields and fields had values, types, formatters, options, etc.

Initially trying to come up with a schema in SQL was somewhat painful as what I was really looking for was an object store. Switching to mongo gave me a way to do a very clean, simple solution that worked quite well for the problem at hand (representing pdf forms). That said, we also played it very safe and used mongo for only the document portion, with every other part of the system being in an sql database. But for the doucuments mongo worked really well as a basic object store without the complexity of something like Neo4j.

tensor · on Nov 12, 2013

Of course, a better choice now would be to use PostgreSQL's new JSON support. Postgres has also had XML document types for a long time, though I'm not sure of their indexing story.

If you don't need indexing into the document, you can easily just store it as serialized bytea data. I've done this quite frequently and it works wonderfully.

threeseed · on Nov 12, 2013

Sorry but PostgreSQL's JSON query syntax is insanely complicated compared to MongoDB.

And that is a big deal for a lot of developers.

tensor · on Nov 12, 2013

I just took a look, and if I'm getting this right, it looks about as simple as it gets:

SELECT json_data FROM people WHERE json_data->'age' > 15

vs

db.people.find( { age: { $gt: 10 } } )

Personally, I prefer the postgres syntax. It's much clearer. I also don't buy your claims below about performance. Can you provide a real benchmark? Are you running with the safeties off meaning you lose data?

dbcfd · on Nov 12, 2013

What is the syntax on finding an array value matching some key? E.g. given a user with field "favFoods":[String], how do I determine that pizza is in there?

paisawalla · on Nov 12, 2013

select ... where "pizza" in person -> favFoods

paisawalla · on Nov 12, 2013

Sorry I'm stupid, I was thinking Python

select ... where person -> favFood in ("pizza")

gfodor · on Nov 12, 2013

Your ORM will give you whatever syntax is natural, eventually.

idProQuo · on Nov 12, 2013

For all the talk about how "MongoDB totally has SOME use cases", I've never before heard of a use case where it would be unambiguously better to use a document store. Thanks for explaining that so well.

threeseed · on Nov 12, 2013

I've used it well in the past as well.

MongoDB was 5-10x faster than PostgreSQL, Cassandra etc.

If your domain model is structured like a document then MongoDB is a pretty great fit.

leccine · on Nov 12, 2013

Even if it is better, why Mongo? Why not Riak?

yayitswei · on Nov 12, 2013

In my opinion, datalog (via Datomic) strikes a good balance between schema flexibility and queryability. It's my preferred way of working with data now, after using Mongo for a year (I was also initially attracted by the flexibility of schema-less data stores).

mattkrea · on Nov 11, 2013

I think a lot of people (at least in the Node community) love the interfaces provided for it and specifically Mongoose which allows you to enforce a schema and thus relational data.

I'll admit that I originally got into it because I didn't really know SQL and I'm still not very talented with it but for example.. Joins in Mongoose? Say there is a comment.. this comment has an author. If that author is type ObjectId when I run a query I can do this:

model.comment.find({_id: <some id>}).populate('author').exec(e, result) { // author will be populated with that author's data instead of just the ID }

camus2 · on Nov 12, 2013

I use mongodb only when i have to write an app in Node. Not because it's the better solution , but because mongoose is the only library that deals with data that is mature on npm. The rest is beta at best and drivers for RDBMS are not mature enough.

hoffer · on Nov 12, 2013

The mysql driver on NPM is indeed mature. Been using it with out issue for a couple years now in a large app. https://npmjs.org/package/mysql

mattkrea · on Nov 12, 2013

I had thought the mysql drivers were all more mature or is it that you lean more toward postgresql?

jacques_chester · on Nov 12, 2013

Putting it another way:

Document stores are supposedly "more agile", but by conflating queries, the logical model and the physical structure, they are actually less agile. You've mixed the three things together and ossified around a single model of the domain. When the required view or model changes, you have to write workarounds.

gfodor · on Nov 12, 2013

spot on. MongoDB is a step backwards in abstraction. The RDBMS geeks had this figured out in the 70s. Hence the "relational MODEL" vs "document STORE" transparent step backwards in generalization.

MongoDB is just locking you into a specific materialization of an ill-specified data structure. PostgreSQL's team hacked out JSON support in about a year or so, since they are working at a higher level of abstraction and could insert the "MongoDB model" at the proper place in their system. Now if you really need to store "documents" in your database and query lazily-defined fields you can do that for those edge cases (and lets face it, those are edge cases) and use proper relational modeling for the rest of your model.

just2n · on Nov 11, 2013

I once tried to insert a screw using a hammer. I'll be writing my article "Why You Should Never Use A Hammer" shortly.

And here's the crux of the problem with this article, and of so many articles like it:

"When you’re picking a data store"

"a", as in singular. There is no rule in building software that says you have to use 1 tool to do everything.

coldtea · on Nov 11, 2013

>I once tried to insert a screw using a hammer. I'll be writing my article "Why You Should Never Use A Hammer" shortly.

More like: "I once tried to insert a screw with a fish-shaped, peanut butter and jelly covered, see-through tv set".

MrZongle2 · on Nov 12, 2013

In his defense, the Kickstarter campaign for the TV was impressive.

sbov · on Nov 11, 2013

I always find comparisons to tools disingenuous because people take simple tools (a hammer) and compare them to complex software tools that if you misunderstand can ruin your company.

Your database isn't a hammer. It's closer to 19th century industrial machine with hundreds of buttons and levers that will cut your hand off if you use it incorrectly.

dbcfd · on Nov 11, 2013

I think this is the first post on HN I wish I had a downvote button for, just for the reason you list. There is a reason there are different flavors of databases, and MongoDB most definitely would not be my choice for representing graph like relationships.

It's also scary that it has 217 points because it bashes Mongo.

towelrod · on Nov 11, 2013

I think you are missing the point of the article. If you read down to the Epilogue it explains how the "perfect" application still didn't work with MongoDB once the clients started asking for more features.

My read was that even when you think you don't have "graph like relationships" in your data, you actually do.

The original author did say this, but I would like to add: if you don't have "graph like relationships", then your data is pretty trivial and any data store will do.

dbcfd · on Nov 11, 2013

From another comment I made, on why I don't think is a good article even using the proposed thesis of "mongo doesn't work for graph like relationships":

Even though their data doesn't fit well in a document store, this article smacks so much of "we grabbed the hottest new database on hacker news and threw it at our problem", that any beneficial parts of the article get lost.

The few things that stuck out at me:

* "Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production." - So you did absolutely no research

* "What could possibly go wrong?" - the one line above the image saying those green boxes are the same gets lost. Give the image a caption, or better yet, use "Friends: User" to indicate type

* "Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars." - Yep, and since users are indexed by their ids, this is extremely easy.

* "What happens if that step 2 background job fails partway through?" - Write concerns. Or in addition to research, did you not read the mongo documents (write concern has been there at least since 2.2)

Finally, why not post the schemas they used? They make it seem like there are joins all over the place, when I mainly see, look at some document, retrieve users that match an array. Pretty simple mongo stuff, and extremely fast since user ids are indexed. Even though graph databases are better suited for this data, without seeing their schemas, I can't really tell why it didn't work for them.

I keep thinking "is it too hard to do sequential asynchronous operations in your code?".

towelrod · on Nov 11, 2013

I'm pretty ignorant of MongoDB so I'm genuinely interested in your response: How would you solve the problem in the epilogue, namely "a chronological listing of all of the episodes of all the different shows that actor had ever been in"?

Did Sarah model the data poorly ("We stored each show as a document in MongoDB containing all of its nested information, including cast members").

Or is there an easy way to extract that information that Sarah just doesn't know about yet?

Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

The last part seems like a really straightforward relational critique to me. If you don't break the actors out into unique entities then you can't compare them across shows. But if you do break them out into unique entities, then how to you present the show information without doing joins?

jt2190 · on Nov 11, 2013

  > Did Sarah model the data poorly ("We stored each show as a 
  > document in MongoDB containing all of its nested 
  > information, including cast members").

Yes, they modeled the data poorly.

In this example, we have a TV Show, which is modeled as an entity (document). This TV Show has a list of cast members, each one modeled by a nested object.

In a relational database, this type of relationship would be modeled by having a TV_SHOWS table, a CAST_MEMBERS table with a foreign key to the TV_SHOWS table, and a CASCADE DELETE relationship to ensure that if a TV_SHOW is deleted, the related CAST_MEMBER records are also deleted.

This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS. (In OO we'd call this a "component" relationship, that is, we're saying that a tv show is composed of cast members, and if we destroy the tv show we destroy the cast members as well.)

They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

  > But if you do break them out into unique entities, then 
  > how to you present the show information without doing 
  > joins?

You must join, albeit in MongoDB you do this in the application layer, not the database, so:

1. Query the cast members collection to find the cast member id. 2. Query the tv shows collection to find all tv shows with cast member id in the cast members set.

Those of us who sharpened our teeth using relational databases have trouble seeing past "two trips to the database" in the above strategy, and that's probably why there's an urge to embed documents rather than to query two collections sequentially. Resist this urge, as it's as as bad as the urge to denormalize, i.e. there'd better be a damn good reason to do it.

jacques_chester · on Nov 12, 2013

> This is obviously too strong a relationship between CAST_MEMBERS and TV_SHOWS.

... huh?

> They should have modeled CAST_MEMBERS as true entities, by making them documents in their own collection, and storing a list of Cast Member IDs in each TV Show.

So instead of a one-to-many relationship, they should use a one-to-many relationship expressed in a different notation?

nickzoic · on Nov 11, 2013

MongoDB doesn't forbid you from having entities and relations. It just doesn't support them in the same way that SQL databases do. Ditto for CouchDB, etc.

You end up having to do some joins yourself still, but this is often appropriate. Imagine that the "actor" entity contains a complete bio, including family history with relationship to other actors, links to wikipedia & fan sites, etc. When you're displaying the page for episode #202 of "Everyone Loves MongoDB", you don't want to retrieve all that data for all the actors. You're not going to display it all on the episode page anyway. Instead, you just need an ID (to href an a and src an img) and probably a small amount of denormalized stuff (name, for the img alt ...). Since that's what you need, that's what you store.

There's a limit to how far you can denormalize schemas before it is no longer helpful. The author explores this limit, and finds that MongoDB doesn't make the limit go away.

emn13 · on Nov 11, 2013

You're basically saying: don't use mongo. It's trivial to emulate a blob of data in a relational database; just use a... blob of data. Or any of the many, many other options at you fingertips. Conversely, manually implementing efficient joins is a total hassle and it'll probably end up slow and brittle. At the very least you'll need indexes and that means an (implied) schema.

So in the normal mongo usecase for storing (as opposed to caching) data with relations, let me see if I can summarize:

- you can have relations, it's just mongo won't help you deal with then: you just need to implement them yourself.

- you can have (actually need) a schema, it's just mongo won't help you deal with that; you'll need to implement that yourself. Have lots of fun with schema-changes, especially because...

- Since you're changing decoupled entities, you need to keep them in sync. You can (and probably should) use transactions, but mongo won't help you with that. You also probably want foreign keys, but mongo won't help you with that either. Migrations on mongo are a special kind of terrifying.

But hey, on the upside, it can store structured blobs, and it's probably hardly any slower that your filesystem, which could do that too.

nickzoic · on Nov 12, 2013

You could absolutely do the same thing with Postgres (or SQL Server) and computed indexes over JSON (or XML) blobs. Of course, then you'd have exactly the same schema migration issues.

My point was more that a lot of the time, if you structure your data right (and get the right balance of denormalization) you don't need joins very much and so the lack of them isn't really a big disadvantage.

dbcfd · on Nov 12, 2013

> Keep in mind the constraints in the article, for example: some shows have 20,000+ episodes, actors show up in 100s of shows, and "We had no way to tell, aside from comparing the names, whether they were the same person".

As others have pointed out, it requires two trips to the database. Given their architecture (distributed nodes), network latency is minimal, so this is essentially two calls to the database.

show { _id, title }

actor { _id, appearedIn : [id] }

db.find({"title":"awesomeshow"},{"_id":1}) db.find({"appearedIn" : showId})

Each actor is unique in the database, when you query, you get back unique actors. I'm not sure why they're scared of joins (or multiple queries in mongo).

The question you ask yourself is not whether you're joining, but how often you're joining. If you're not joining often on actors and shows, document databases can work better, since you represent the show and all its episodes without having to join.

jessaustin · on Nov 11, 2013

Another "issue" occurs to me. It seems likely that the data coming in about TV shows, especially old ones with decades of episodes, would be a bit "dirty". This sort of thing just slides right into a document store, but a relational one would have some problems with that. How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors? Of course these things can be fixed with enough manual (or, even better, user) intervention, but the time and place for that are after you've got the data in the system, not before.

jacques_chester · on Nov 12, 2013

> How do we know e.g. that "Bryan Cranston", "Bryan Lee Cranston", and "Brian Cranston" are the same (or different) actors?

In the USA, the various professional creative guilds enforce uniqueness.

Your general musing is right, but the problem of source-data quality is generally considered to be distinct from the design of schemata.

msellout · on Nov 11, 2013

Yeah, the comment on graph databases seemed a bit too flippant.

mcknz · on Nov 11, 2013

I often upvote articles because I'm interested in the discussion. It does not always indicate agreement.

joedevon · on Nov 12, 2013

Well said sir. I only skimmed the article, but afaict the author still has not discovered graph stores, an appropriate way to store social graphs.

I remember downloading Disapora back in the day. The idea behind it was great. But the code looked quite awful and insecure.

graue · on Nov 12, 2013

From the article:

> Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production.

Have you used a graph database to good effect? Which one, and for what?

I have a friend who as a learning exercise wrote a toy search engine implementing PageRank — inherently a graph problem. We paired on setting up Neo4j, the only open-source graph database we could find with a working Python API, but found it fiddly and hard to get help. She then switched to SQL (Postgres, I think) and reported faster progress.

Facebook themselves use MySQL[1], so between that and my own first/second-hand experience, I'd call it far from obvious that a graph database is the most appropriate way to store social information. If you're going to criticize the OP for not considering them, it would be nice to offer some justification.

1. https://www.facebook.com/notes/facebook-engineering/mysql-an...

joedevon · on Nov 13, 2013

>>Have you used a graph database to good effect? Which one, and for what?<<

I played around with several. But project never got off the ground due to layoffs that killed projects.

I know lots of people who have implemented graph stores with great success. One example:

http://www.bbc.co.uk/blogs/bbcinternet/2012/04/sports_dynami...

Another is a multibillion dollar retailer (not sure if it's public so I'll leave the name out) uses stardog to good effect. LOTS more out there.

>>We paired on setting up Neo4j, the only open-source graph database we could find with a working Python API, but found it fiddly and hard to get help.<<

The Graph Stores do seem to play better with Java. Neo's getting a lot of ink these days but they are far from the only game in town.

>>Facebook themselves use MySQL[1], so between that and my own first/second-hand experience, I'd call it far from obvious that a graph database is the most appropriate way to store social information.<<

They aren't using MySQL the traditional way. They undoubtedly would have made different choices had they started when Disapora did. And they also use TAO, a homegrown graph store of sorts, FYI:

http://dl.acm.org/citation.cfm?id=2213957

It is sitting on top of MySQL at some level, as this is where objects are stored as "source of ultimate truth".

IIR the quote correctly when I invited couple FB DBAs (pre-Mark Callaghan) to speak at a meetup, "I don't think there's a single join in the facebook codebase". That might have been a slight exaggeration, but MySQL at Facebook is not because their recent needs are for a relational db.

graue · on Nov 13, 2013

Thanks, this is why I read all the bad comments on HN: in hopes of seeing a very informative one like this :)

I'll just point out that this:

> The Graph Stores do seem to play better with Java.

was likely a dealbreaker for Diaspora, since they were a small team without, I'd assume, Java experience. Also the nature of the project virtually requires an open-source database so Stardog would've been out. With SQL you have not one but several free and open-source implementations that are battle-tested and work well with just about any programming language out there. That makes SQL a better choice for many projects even if a graph store would map more neatly onto their problem domain.

joedevon · on Nov 14, 2013

True. I think an (even more) ambitious attempt could have attracted core developers to the project, which could have solved all the technical hurdles. That said, I was pretty excited by the idea, and I hope something new along those lines gains momentum one day.

jacques_chester · on Nov 12, 2013

As I recall, FB mostly use MySQL as a glorified K/V store. So I'm not sure if this is a win for relational algebra.

epsylon · on Nov 13, 2013

Reddit does that as well with PostgreSQL. It surely doesn't show a win for NoSQL if two of the biggest sites on the internet would rather traditional SQL RDBMS as KV stores.

tw218 · on Nov 12, 2013

In my experience, MySQL works better as a K/V store than Mongo under load - another point against Mongo for very simple data.

ajmurmann · on Nov 11, 2013

It's more like using a very good screw driver instead of a swiss army knife that does an OK job at everything.

Yes there is no rule that one tool has to work for everything, but there is a rule in Agile that you should push off making assumptions about the future as far as possible, because you will never know less than right now

lucasnemeth · on Nov 12, 2013

I actually liked the article, thought it was interesting. But the title is a complete clickbait. It does not even says that "you should never use mongodb", it points some situations where MongoDB is a good match. I know a title "Think well if mongodb applies to your case" is not attractive, but it is less sensationalist.

prof_hobart · on Nov 12, 2013

I know very little about MongoDB, or NoSQL in general, but I'm very interested in it. Are there any good sites/articles I should start looking at to see where it would be the right tool?

rlpb · on Nov 12, 2013

The difference is that many people are trying to insert a screw with this particular hammer today.

m_mueller · on Nov 11, 2013

I don't know much about MongoDB, but I've been using a lot of CouchDB for my current project. Am I correctly assuming that MongoDB has no equivalent for CouchDB views? Because if it had, all these scenarios shouldn't be a problem.

Here's how relational lookups are efficiently solved in CouchDB:

- You create a view that maps all relational keys contained in a document, keyed by the document's id.

- Say you have a bunch of documents to look up, since you need to display a list of them. You first query the relational view with all the ids at once and you get back a list of relational keys. Then you query the '_all' view with those relational keys at once and you get a collection of all related documents - all pretty quickly, since you never need to scan over anything (couchDB basically enforces this by having almost no features that will require a scan).

- If you have multiple levels of relations (analogous to multiple joins in RDBMs), just extract they keys from above document collection and repeat the first two steps, updating the final collection. You therefore need two view lookups per relational level.

All this can be done in RDBMs with less code, but what I like about Couch is how it forces me to create the correct indexes and therefore be fast.

However, if my assumption about MongoDB is correct, I have to ask why so many people seem to be using it - it would obviously only be suitable in edge cases.

rubiquity · on Nov 11, 2013

Spot on about CouchDB. I haven't used MongoDB for anything of decent scale but I must say I was shocked to read in the OP that they store huge documents like from the Movie example in MongoDB. In CouchDB you can use Views to sort of recursively find all of the other documents that your current doucment has a document ID for. This takes advantage of CouchDB's excellent indexing. I'm not trying to start a CouchDB vs MongoDB war here but again, I just say I'm surprised at the types of documents OP was storing in MongoDB.

m_mueller · on Nov 11, 2013

What I still don't understand about MongoDB is where it actually shines compared to Couch. The performance advantage would have to be quite big to offset the loss in flexibility as a general purpose DB. I'm also not trying to start a war but I'd like to get a picture about why Mongo seems to be used more often than Couch.

rdtsc · on Nov 11, 2013

> What I still don't understand about MongoDB is where it actually shines compared to Couch

Marketing. They shipped with unacknowledged writes for a long time and it made them look really good in write benchmarks. Couch was actually trying to keep your data safe. But it didn't look fast enough so those that didn't read the fine print on page 195 from the manual where it tells you how to enable safe data storage for MongoDB, jumped on the bandwagon.

Oh and mugs, always the mugs. I have 3 I think.

virmundi · on Nov 11, 2013

My one and only reason to use Mongo over Couch is geo indexes. As far as I can tell this doesn't exist natively. I'm also not sure how Geocouch comes in worth this.

gadamc · on Nov 12, 2013

Cloudant will soon offer geo-spatial queries. Its in beta now: https://cloudant.com/product/cloudant-features/geospatial/.

Lazare · on Nov 11, 2013

As the original article said, I think where MongoDB shines is as a glorified, souped up cacheing tier, competing directly with Redis, Couchbase, and similar. It's not really a good general purpose DB.

> I'd like to get a picture about why Mongo seems to be used more often than Couch.

Very good marketing from 10gen on the one hand. On the other, CouchDB is older (and we techies love the new hotness), and the CouchDB/Couchbase split confused a lot of people. Having your original founder found a different and incompatible project with almost the same name but very different goals would cause almost any project to stutter.

matthewcford · on Nov 11, 2013

> CouchDB/Couchbase split confused a lot of people

Yes, that really didn't inspire confidence in the longevity of the project

bemurphy · on Nov 12, 2013

Interesting question, I can point out a few interesting differences I know of. Take note, I have more experience with Couch and its ilk than MongoDB, but I know some of Mongo's feature set.

tl;dr: You'd probably see the most difference with how a) the data is distributed and replicated and b) how you query data.

CouchDB (as of the 1.5 release) offers master-master (including offline) replication. It does not offer sharding. Cloudant's BigCouch does implement a dynamo-like distribution system for data that is slated for CouchDB 2.x iirc. Mongo on the other hand does support sharding via mongos, and you can build replica sets within each shard. It does not as far as I know support master-master. This is probably the biggest data-distribution difference between the two.

MongoDB support a more SQL-like ad hoc querying system, so you could query for drink recipes with 3 or less ingredients that have vodka in them, for instance. You'd still need indexes on the data you are querying for performance.

CouchDB queries are facilitated via javascript or erlang map reduce views, which serve as indexes you craft. An additional 'secondary-index' like query facility is to use a lucene plugin and define searchable data. Cloudant has this baked into their offering, and their employees maintain the plugin on github (https://github.com/rnewson/couchdb-lucene)

MongoDB has the ability to do things like append a value to a document array. In Couch, you'd likely read the entire document, append to the array in your app, and put the document back on Couch. It does have an update functionality that can sometimes isolate things more than this, but I haven't seen it used as much. Mongo can also do things like increment counters, while Couch cannot (though CouchBase can).

There's a host of other differences. Mongo has a much broader API, while CouchDB takes a more simple http verb like approach (get, put and delete see the heaviest use). Depending on your situation, one might be a better fit, or you might simply grok one more than the other.

As far as why Mongo gets used more often, I think the closer-to-SQL ad-hoc queries made more sense to people transitioning from stores like MySQL. The CouchDB view/map-reduce stuff is a bit more of a mindset shift (see the View Collation wiki entry for an example of this at http://wiki.apache.org/couchdb/View_collation). CouchDB was also taken hits for being slower than Mongo, but I suspect it was the map-reduce stuff that really steered some folks the other way.

dbcfd · on Nov 12, 2013

Querying, and querying immediately after insertion. If you want queries after insertion (which require views), this can be slow in couch. Also, if you want to query, but don't want to add a view, wait for it to populate (causing a performance hit while it builds, plus while it is up), then remove it.

If you're doing primarily insertion with querying via id, and using views in which stale data is ok, then couch is far superior to mongo. But that's not a use case everyone has.

rdtsc · on Nov 11, 2013

Also CouchDB has better safety. Its append-only files allows you to make hot backups and safely pull the plug on your server if need be without worrying corrupting data.

Plus change feeds and peer to peer replication are first class citizens in the CouchDB. Once you start having large number of clients needing realtime updates, having to periodically poll for data updates can get very expensive.

m_mueller · on Nov 11, 2013

Offline capable peer to peer replication was the main reason I chose CouchDB - we needed something that would realistically run on clients, even mobile devices. NoSQL we mainly chose because we needed schema-less data (the whole system relies on ad-hoc design updates). It's basically an information system IDE with rapid application development.

jimbokun · on Nov 11, 2013

I immediately wondered why Diaspora didn't try CouchDB, since replication seems to be one of the key features they were after.

twic · on Nov 11, 2013

In Diaspora as it exists now, replication - really, federation - is between pods. There's a protocol for transferring data between pods that is deliberately database-agnostic:

https://wiki.diasporafoundation.org/Federation_protocol_over...

So CouchDB's replication doesn't really help.

If the day comes that any single pod is big enough to need replication between clustered machines within it, then CouchDB should certainly be a contender for storing its data.

jimbokun · on Nov 11, 2013

It looks like the CouchDB vs. MongoDB in the document store world is the equivalent of the Postgresql vs. MySQL debate in the relation world.

dbcfd · on Nov 12, 2013

Not really. They handle querying and aggregation much differently.

For people coming from SQLServer/MySQL/Postgresql, the functionality differences between the NoSQL flavors is something they don't expect, and often don't explore. There are a number of heavily used NoSQL solutions because they're focused on specific use areas.

dbcfd · on Nov 11, 2013

db.find({"field":"value"},{"field":1,"someotherfield":2})

Finds all documents with field having value, returning only field and someotherfield. That part is similar to the map portion of a CouchDB/Couchbase view. No reduce portion though.

If field is what the index is built off of, it should be similar performance wise to a view. Just like views have to be created beforehand, so do mongo indices.

The difference is the find of a mongo document will happen much more quickly after insertion than the find of a couch value by view. Views require rebuild in couch which is not instantaneous.

m_mueller · on Nov 11, 2013

If I understand couch correctly, it will run all map/reduce functions on a DB after insertion, thus updating all views right away - except if a view has never been queried, in which case it would happen at the first query. I don't quite understand how mongo could do a better job there - do you mean because mongo's indices are less complex than couch views, so the updates after insertion are quicker? I guess if that's the case it would perform better in insertion heavy cases, but then again I could just not use many map/reduce operations in couch and thus reduce the insertion overhead.

Lazare · on Nov 11, 2013

Almost right!

For various complicated reasons, CouchDB update views on read, not on write. So you write some data, then you query a view, CouchDB notices the view is stale, recalculated everything, and then gives you the updated data. That can be a problem if your view is quite heavy because every time you write, the next read will be slow.

However! You can query with "stale=ok" (which means "just give me the old data, and don't kick off a view update"), and then update your views manually (eg, cron job that hits your view every so many minutes, or if you want to be smarter, a very lightweight daemon that monitors the _changes feed and hits your view every X updates, or whenever some key document is touched, or whatever).

dbcfd · on Nov 11, 2013

From my tests with couch, the view isn't populated immediately after a document has been inserted, and may take some time. I think I tried this doing insert bulk, wait for view, insert 1, query, but I'd have to double check.

m_mueller · on Nov 11, 2013

You and Lazare are right, I just checked the documentation and Couch indeed updates on first view query after a write.

gadamc · on Nov 12, 2013

Cloudant (based on CouchDB) automatically triggers map-reduce and auto-compacts your database for you. This is my second post about Cloudant - note that I am employed by Cloudant. :)

dbcfd · on Nov 12, 2013

Make it a product, and I'd be more interested. A number of companies have their own hosting, so the hosting part is not only unneeded, but is also usually a non starter.

scbrg · on Nov 11, 2013

I must have read a dozen (conservative estimate) articles now all called "Why you should never use MongoDB ever" - or permutation thereof. Each and every one of them ought to have been called "I knew fuckall about MongoDB and started writing my software as if it was a full ACID compliant RDBMS and it bit me."

There are essentially two points that always come up:

1. Oh my God it's not relational!

Well, you could argue that if you move from a type of software that is specifically called RELATIONAL Database Management System to one that isn't, one of the things you may lose is relation handling. Do your homework and deal with it.

2. Oh my God it doesn't have transactions!

This is, arguably, slightly less obvious, and in combination with #1 can cause issues. There are practices to work around it, but it is hardly to be considered a surprise.

I keep stumbling on these stories - but still these are the two major issues that are raised. I'm starting to get a bit puzzled by the fact that these things are still considered surprises.

In either case, I'm happily using MongoDB. It has its fair share of quirks and limitations, but it also has its advantages. Learn about the advantages and disadvantages, and try to avoid locking too large parts of your code to the storage backend and you'll be fine.

FWIW, I think the real benefit of MongoDB is flexibility w/r to schema and datamodel changes. It fits very, very well with a development process which is based on refactoring and minor redesigns when new requirements are defined. I much prefer that over the "guess everything three years in advance" model, and MongoDB has served us well in that respect.

rdtsc · on Nov 11, 2013

> I must have read a dozen (conservative estimate) articles now all called "Why you should never use MongoDB ever"

Strange statistical oddity if you ask me right? How many "don't use PostgreSQL" or "don't use Cassandra" or "don't use SQLite" have you seen? Not as many. It is just very odd isn't it...

So either everyone is crazy or maybe there is something to it. I lean towards the later here.

> 1. Oh my God it's not relational! ... > 2. Oh my God it doesn't have transactions!

Maybe those, you forget about:

3. Claim "webscale" performance while having a database wide write lock.

4. Until 2 years ago shipped with unacknowledged writes as a default. Talk about craziness. A _data_base product shipping with unacknowledged send-and-pray protocol as a default option. Are you surprised people criticize MongoDB? Because I am not at all. Sorry but after that I cannot let them within 100 feet of my data. They are cool guys perhaps and maybe having beers with them would be fun, but trusting them with data? -- sorry, can't do.

dangayle · on Nov 11, 2013

> A _data_base product shipping with unacknowledged send-and-pray protocol as a default option.

MongoDB had a default initial fire-and-forget mentality, but that was on purpose for their initial use cases. Just because someone else uses the tool for a different purpose doesn't mean the software is to blame.

Also, if you're complaining about the default settings and you were running this in production, RTFM.

rdtsc · on Nov 11, 2013

> MongoDB had a default initial fire-and-forget mentality, but that was on purpose for their initial use cases

Yes I call that deceitful marketing. It wasn't an accidental bug or an "oops". I don't know how someone can be considered honest or trusted with data when they ship a d_a_t_a_base product with those defaults. Call it random storage for 'gossakes, that would ok, anything but "database".

> Also, if you're complaining about the default settings and you were running this in production, RTFM

Yes and I also don't expect to read the fine print on last page of a manual to enable the brakes when I buy a car. I expect cars to have brakes enabled by default, even if it somehow makes them not go as as fast in benchmark tests.

lttlrck · on Nov 11, 2013

It still boils down RTFM and don't trust marketeers, right?

rdtsc · on Nov 11, 2013

Mostly, "don't trust marketeers with you data", which I don't.

emn13 · on Nov 11, 2013

Mongo is absolutely TERRIBLE for schema changes. It is a terrible fit for refactorings and minor redesigns. (I have implemented and watched several mongo migrations and refactorings)

Because you have a schema, but mongo doesn't model it, you're left to your own devices to implement the migration. If you have real data, and a normal legacy situation, you can't assume all data will necessarily follow the "schema" you think you have - after all, it's implicit. But that means that writing the migration can be quite tricky. There are no validations, no foreign key checks, no constraints you can use to validate your migration did what you think it did. You'll need to do all that in your own code. This is short for: you're going to be lazy and not check it quite as well as you would have otherwise, and the checks you do implement might be buggy.

Furthermore, if a particular entity does fail a migration... what then? In postgres, which supports transactional DDL, you can rollback schema changes - so even if the last entity failed to migration because your assumptions were wrong - and even if the validation had to be in code, not in the database - you can revert to the initial situation. In mongo? Uh-oh; you're in trouble. You better be in a copy of your production database; but if you are, that means that your main database needs to be offline or in read-only mode so that writes aren't lost. Does mongo have a read-only flag? By contrast, postgres (and other Sql databases) are transactional and support snapshots - you can do the safe migration with rollback support all while online for as long as there aren't any conflict; and when there are, it's detected, and you have a range of options from locking to retries to avoid the conflicts.

In practice, I can't imagine a worse tool for schema and datamodel changes. If your schema change is trivial, it's not so bad; but then, if you're only renaming a property that has no external references to it or adding a property or whatever - sql is trivial too.

Mongodb for schema-changes is sort of like writing an automated refactoring for a dynamic programming language codebase that's too large to manually inspect at all and without unit tests tests nor a VCS. You won't necessarily know what goes wrong or even if anything goes wrong; you won't get system support for guaranteeing at least minimal consistency; and if something goes wrong you'll have a corrupted half-way state.

jahewson · on Nov 11, 2013

I'm not sure what these two strawmen have to do with the article. Perhaps you've been reading another article?

adamconroy · on Nov 11, 2013

Given it is relatively new tech. Who is qualified to write about? Presumably by the end the authors did know a bit. Don't you think such articles, assuming they are objective, might be useful to others that are thinking of dipping their toes in the water?

Justsignedup · on Nov 11, 2013

SQL is actaully a rediculously elegant language at expressing data, and relationships. NOT a good general purpose language. So I tend to always favor sql for relational data, actually most data.

queues, caches, etc = nosql solution. They tend to have much more features around performance to handle the needs of these problems, but not much in terms of relational data.

If you study relational databases and what they do, you will quickly find the insane amount of work done by the optimizer and the data joiner. That work is not trivial to replicate even on a specific problem, and ridiculously hard to generalize.

And so this article's assertion that mongodb is an excellent caching engine, but a poor data store is very accurate in my eyes.

_sh · on Nov 11, 2013

No. SQL is actually pretty third-rate at expressing data and relationships. My preferred way of expressing data and relationships is the programming language I am writing in.

The problem with SQL is that it is not an API, it's a DSL. Which usually means source-code-in-source-code, string concatenation/injection attacks, and crappy type translations ('I want to store a double, what column type should I use? FLOAT? NUMERIC(16,8)?'). Even as a DSL it's pretty low-brow: just look at how vastly different the syntax is between insert and update, or 'IS NULL'.

For all those who love SQL, consider having to address your filesystem with it. Directories are tables, foreign-keyed to their parent, files are rows. There's a good reason why this nightmare isn't real: APIs are preferred over DSLs for this use case. And so too for databases, because they are the same abstraction.

Don't get me wrong, I love relational algebra and the Codd model, but SQL just aint it. SQL has survived because of its one and only strength: cross-platform. And like all cross-platform technologies, such as Java bytecode and Javascript, its rightful place is a compilation target for saner, richer, more expressive technologies. This is why I always use an ORM and have vowed to never, ever, write a single line of SQL again.

j-kidd · on Nov 12, 2013

I like your comparison of SQL to JavaScript. However, personally I love SQL and always use an ORM. My vow is to never have a line of SQL in my application source code. This is perfectly doable with SQLAlchemy, though not with crappy ORM such as ActiveRecord.

Indeed, I blame ActiveRecord for making NoSQL popular. When your ORM doesn't create foreign key for you, it is a slippery slope to blatant denormalization and eventually NoSQL.

EDIT: The other party to blame would be MySQL with its painfully slow "must-make-a-copy-of-everything" ALTER TABLE.

adamconroy · on Nov 12, 2013

I like ORMs but in my experience the best approach is to use a hybrid of ORM, views and sprocs. Ideally each sproc will return the results of querying a view or at a minimum the identical columns, then the views become 1st class entities in your ORM like anything else (except for updatable views which I shy away from).

So personally I vow never to write an insert, update or delete again, but I am certainly happy to write queries and tune them if necessary.

The one thing that trumps nosql / denormalisation in my opinion is materialised views. Materialised views are a thing of beauty that allow for the design integrity of normalized data and the performance of denormalised data. It seems most people don't use them / understand them because they use b-grade free database engines.

You never stop hearing people complain about nulls,types/precedence and joins in SQL, but seriously it isn't that hard to learn. These are the main things that people complain about and regurgitate endlessly, so a little effort would be a big reward.

spion · on Nov 11, 2013

How about hybrid sql-builder / data grouper solution?

Not limited to ORM methods - get the full power of SQL instead. Not string concatenation - get the full power of the language to build queries. Also, the ability to get join results in either flat or grouped form.

For example https://github.com/doxout/anydb-sql (shameless plug)

_sh · on Nov 11, 2013

Nice. This is exactly what I mean when I talk about ORMs. See how everything's nicer when its an API?

darkmoth · on Nov 12, 2013

I wrote a bayesian document classifier for one project I was working on - in SQL. Training the system took one INSERT and a small word-split function. Classifying the documents took one SELECT. Even in F#, I couldn't have written a more elegant or more performant solution. In a procedural language it would have been a mess of loops and roundtrips. Good SQL is almost a pure description of your desired result, with none of the "this is how you should do it" cruft.

I don't have anything against ORMS - they're almost mandatory due to O-R-impedance mismatch - but too often I see them used instead of operations that should rightly be server-side. And none of US have injection problems, because we're binding our parameters, right? ;-)

tensor · on Nov 12, 2013

This is a reasonable comment, but nosql databases do nothing to address it. Nor do ORM libraries.

threeseed · on Nov 12, 2013

What rubbish. NoSQL is only good for queues and caches ? Who on earth uses a database for this ?

NoSQL works well when you are modelling your data in ways that fit their particular use cases. Cassandra is great with CQRS, Riak is for key/value, MongoDB document.

JangoSteve · on Nov 11, 2013

As others have pointed out, this article can basically be summarized as, "don't use MongoDB for data that is largely relational in nature."

Mongo (or most document stores) are good for data that is naturally nested and silo'd. An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use.

A distributed social network, I would have assumed, would be the antithesis of the intended use-case for a document store. I would have to imagine a distributed social network's data would be almost entirely relational. This is what relational databases are for.

coffeemug · on Nov 11, 2013

> As others have pointed out, this article can basically be summarized as, "don't use MongoDB for data that is largely relational in nature."

Disclaimer: I'm a founder of RethinkDB[1] (which is a distributed document database that does some support joins).

The fact that traditional databases use the term "relational" has probably caused more confusion than anything else in software. In this context "relational" doesn't mean "has relationships". The term is just a reference to mathematical relations[2]. This is an important distinction because almost all data has relationships, whether it's hierarchical data, graph data, or more traditional relational data.

To me it's pretty clear that ten years from now, every general purpose document database left standing will support efficient joins. It helps to frame the debate from this perspective.

[1] www.rethinkdb.com [2] http://en.wikipedia.org/wiki/Relation_(mathematics)

JangoSteve · on Nov 11, 2013

Totally agree. I was more using "relational" to mean "cross-relational". I.e. Consider plotting your data on a 2-dimensional space, connecting your "related" data with lines. If your data looks like a spiderweb, probably some graph-type database is most appropriate. If your data resembles an inverted funnel (hierarchical) more than a spiderweb, then a document-store probably is more appropriate. More traditional relational databases are probably more appropriate somewhere in between (which is probably why they're still the most popular type of database being used).

Of course, I can't think of any real-world scenario where your data wouldn't resemble a bit of both. Even very hierarchical data usually has some cross-relationships between un-nested documents, which is why it's still awesome to have a document-store database that supports join-type relationships.

smsm42 · on Nov 11, 2013

It's funny how people after all that hype "nosql everywhere, for everything" start discovering that maybe those guys in the 70s were onto something when they invented relational databases, and not were just too stupid to come up with key-value store. Some data is relational and answering "why don't you use latest fashion nosql" with "because our data is relational" is a perfectly fine answer.

rubiquity · on Nov 11, 2013

Document stores != Key-value stores. Well, I guess they are similar but I prefer to separate DBs like MongoDB and CouchDB from other key-value databases like Riak and Redis.

scott_s · on Nov 11, 2013

I think your summary is only half. The other half is, I think, "Think real hard about whether your data is relational or not."

bguthrie · on Nov 11, 2013

Indeed, the article makes the point that most interesting data is relational, or at any rate contains valuable relations. Discarding efficient relationship management may be a mistake.

danenania · on Nov 11, 2013

"An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use."

Until you want some analytics.

steveklabnik · on Nov 11, 2013

Many people already build data warehouses for analytics purposes, you don't want to be running reports against your live datatbase if you don't have to. Why add extra load?

JangoSteve · on Nov 11, 2013

> Until you want some analytics.

I can actually respond to this specifically, as we recently had a project that needed us to build some decently-sized and complex analytics into their app. I spent about a month researching how most analytics solutions are structured and work, and became very familiar with the codebase for FnordMetric, which is one such open-source analytics solution.

You wouldn't initially think it (I certainly didn't), but Mongo is actually a great use-case for analytics data. Here's why...

Most analytics platforms don't query live-data and build reports on the fly. It's terribly inefficient and doesn't scale. If something like Google Analytics did this, it'd take forever for your Analytics dashboard to load, especially at their scale.

What most analytics platforms do, is they know before-hand what data you want to aggregate and at what granularity, and they perform calculations (such as incrementing a counter) and then store the result in a separate analytics database/table. In fact, there are several presentations and articles about doing things like this with Mongo:

http://blog.mongohq.com/first-steps-of-an-analytics-platform...

http://www.10gen.com/presentations/mongodb-analytics

http://blog.tommoor.com/post/24059620728/realtime-analytics-...

And then, this is an interesting article that discusses the difference between processing data into buckets on the way in, and creating an analytics platform that does more ad-hoc processing on the way out:

http://devsmash.com/blog/mongodb-ad-hoc-analytics-aggregatio...

Let's take something as simple as aggregate pageviews for example (for simplicity's sake, we'll say you want total pageviews for your app, not per-page). Normally you'd think, simple, I'll just store my pageview events, and then when I want to view pageviews, I'll issue a `COUNT` command on the database. Even this gets terribly slow, for a couple reasons:

* You may just have a ton of pageview event entries to query.

* Each pageview has a datetime-stamp, and you have to query not just one `COUNT` query for a given time-range; rather, your analytics dashboard needs to show a graph of counts over time, e.g. pageviews per day for the last week, or pageviews per week for the last year or pageviews per hour for the past day, etc. Each of these would require several distinct COUNT queries (or one more-complex GROUP query), which is even slower, especially for large datasets.

So generally, analytics platforms will have different aggregate buckets for pageviews in the database, which each keep a different granular tally. For example, I'd have a bucket for each day, which keeps tally for pageviews that day, and a bucket for each week, which tallies pageviews for that week, etc. When a pageview comes in, they'll increment each bucket (which is a really fast process with Mongo, since it actually has an `INC` command (aka UPSERT) which can easily increment multiple buckets with one really fast query.

So why is Mongo pretty good for analytics? Because 1) each time-interval bucket is a silo of data for that time-interval, and 2) usually analytics are for patterns and aggregate data, so they don't normally require extremely high reliability (i.e. it's usually okay if an event is dropped here or there).

Of course neither #1 or #2 above are always the case, so this doesn't always apply, but my point was just that Mongo is actually a better fit for analytics than you might imagine.

jsmeaton · on Nov 11, 2013

I haven't done the kind of analytics you're talking about, but it sounds like the implementation is basically a round robin database.

hayksaakian · on Nov 11, 2013

thanks for the great references

i really needed some good resources on doing analytics in mongodb

enjalot · on Nov 11, 2013

but then you can take it out of mongo into something made for analytics. this is a challenge i'm currently facing, but I feel the flexibility mongo has offered in letting us iterate on our data collection is paying off in the end.

lingoberry · on Nov 11, 2013

I'm using CouchDB hopefully for the right reasons. Each user is storing and accessing only their own data. I need that data to be easily stored offline in localstorage in the browser (sqlite/indexedDb not being supported in all browsers), and similar key/value stores for iOS/Android apps. On top of that I need synchronization when the user does come online. This is the type of app you'd want to use on the go as well as on your home computer, so easy synchronization is very important, which the CouchDB changes feed provides.

JangoSteve · on Nov 11, 2013

I haven't used CouchDB yet, but I have a good friend who's an amazing developer, and he swears by CouchDB, mainly for the reasons you mentioned. So, I don't have any context for your app, but it sounds like you picked a well-suited database to me.

Lazare · on Nov 11, 2013

That sounds like a good use for CouchDB; I'm doing something similar. CouchDB shines at that stuff (and as a bonus, avoids some of the issues the OP was having with MongoDB. CouchDB views aren't magic, but they're powerful and functional; more than capable of doing some basic joins).

raverbashing · on Nov 11, 2013

Or maybe something like this: A Graph Database

http://en.wikipedia.org/wiki/Graph_database

mhluongo · on Nov 11, 2013

Yep, the "graph databases are too niche to be put into production" bit urked me- Neo4j et al are in plenty of large production systems. OTOH maybe, due to the distributed nature of the project, they didn't want to distribute a less-known database?

iSnow · on Nov 11, 2013

I guess in 2010, Neo4j did not have that much exposure as they have today. Still I concur that the author should not brush graph databases aside for something like a social network - they seem a better solution than a RDBMS.

twic · on Nov 12, 2013

I suspect there isn't actually a lot of need for graph operations in a social network. At least, not in implementing features for the users. A distinctive thing about social networks is that although the users form a network, they are primarily social - they're interacting with their friends.

They will end up interacting with friends of friends via their friends (eg having a flamewar with your cousin's neighbour in the comments of your cousin's post about potatoes), but not with friends of friends of friends or any degree of separation further out. The queries needed are overwhelmingly local, and a boring old relational database will handle them fine.

Where a graph database might shine is in analytics over the whole network, looking for trends, hubs, clusters, etc, lthough i'm not convinced it would be any better than a relational database which supports recursive queries (as PostgreSQL does). However, this is exactly the sort of privacy-busting awfulness that Diaspora was built to escape from!

hyperpape · on Nov 11, 2013

I think the distinction is captured in "largely". The author seems to be saying that unless you need only the absolute most minimal relational queries, don't use Mongo. That's more extreme than what I realized (and I can't tell if you're agreeing or not).

camus2 · on Nov 11, 2013

> An example would be an app where each user account is storing and accessing only their own data. E.g. something like a todo list, or a note-taking app, would be examples where Mongo may be beneficial to use.

You dont need mongo db to store todo list datas. My opinion is , in some plateforms, like nodejs,where orms and rdbms drivers are not mature , it's quicker to stuff your app with a mongodb rather than a relational database, because they both use javascript and json data structures. But does mongo db scales easier than a mysql database ? is it even easier to manage ? i dont think so.

upquark · on Nov 11, 2013

Relational just means tabular in the context of relational databases. For storing large-scale social networks I would think of specialized graph databases before anything else.

chaz · on Nov 11, 2013

Linkbait title aside, it's actually a helpful example for directing a database novice on when to not use a document store. I could have used this post a few times in the past few years.

rch · on Nov 11, 2013

Agreed. Despite the unfortunate title, this is an informative, well written and entertaining article that I might refer to in the future. It would be better if there was a followup on when it would in fact be appropriate to introduce a document store to a project.

danso · on Nov 11, 2013

I agree that the OP is lengthy, and putting together this well-illustrated post is no easy feat. However, I don't think the OP should be the one to write about when you should use a document store.

Maybe I'm too annoyed by the poorly chosen title. Or that I read that entire post and was thinking where's the punchline? On one hand, I credit the author for thinking things through. On the other, the fact that she unequivocally attributes this issue to MongoDB shows that she currently lacks the domain knowledge to consider appropriate use cases. It's not a MongoDB problem, it's a problem inherent to this data structure, and someone more well-versed in this topic would not conflate the issue...just as a decent IT person would not blame "Windoze" for the fact that she can't get good Wifi reception in the office.

OK, to be even more petty...I think what really aggravates me is how the OP says she's not a database expert -- which is a good disclosure, but self-evident -- but attempts to assert authority by saying "I build web applications...I build a lot of web applications"...Uh, OK, so what you're saying is that it's possible to be an experienced web developer and yet be a novice at data design?

If that was the angle of the OP, I'd give it five stars. Such sentiment cannot be overstated.

rch · on Nov 11, 2013

Well you're right of course that web developers (and business analysts, and politicians, etc.) can absolutely get by for a staggeringly long time with novice-level abilities. That problem is only getting worse as the tools get better. Luckily I don't have to judge the OP on that basis since that's what markets are for.

And maybe someone else, who has tackled enough difficult problems over time to evolve a nuanced and technically informed opinion of various data modeling and management options, should write the response I mentioned. I'd argue there are plenty examples of that material available already.

The OP, on the other hand, would be writing from the perspective of a professional user who might choose a tool off the shelf at the recommendation of a colleague, and whack it against the problem du jour to see if it works or not. This is a common enough approach that there is at least a chance that a followup would have some value. I can't really expect everyone who makes a living writing web applications to understand CS fundamentals, any more than I would expect it from chemical engineers or physicians. It is nice to be able to point representative members of that audience to an article that resonates with them, and not have to try to translate my opinions into similar language (with or without cat gifs).

Edit: I actually think Journeyman would be a more appropriate term than novice.

jrochkind1 · on Nov 11, 2013

Aren't there things other than 'experts' and 'novices'?

It is possible to be an experienced web developer without being an expert at databases, for some reasonable definition of 'expert', sure. I think so anyway. Do you find that aggravating?

Whether it's possible to be an experienced web developer while being a novice at either 'databases' or 'data design' (are those the same thing? you said the second, OP said the first) is open to debate I suppose, but is not implied by the OP.

dbcfd · on Nov 11, 2013

Even though their data doesn't fit well in a document store, this article smacks so much of "we grabbed the hottest new database on hacker news and threw it at our problem", that any beneficial parts of the article get lost.

The few things that stuck out at me:

* "Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production." - So you did absolutely no research

* "What could possibly go wrong?" - the one line above the image saying those green boxes are the same gets lost. Give the image a caption, or better yet, use "Friends: User" to indicate type

* "Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars." - Yep, and since users are indexed by their ids, this is extremely easy.

* "What happens if that step 2 background job fails partway through?" - Write concerns. Or in addition to research, did you not read the mongo documents (write concern has been there at least since 2.2)

Finally, why not post the schemas they used? They make it seem like there are joins all over the place, when I mainly see, look at some document, retrieve users that match an array. Pretty simple mongo stuff, and extremely fast since user ids are indexed. Even though graph databases are better suited for this data, without seeing their schemas, I can't really tell why it didn't work for them.

I keep thinking "is it too hard to do sequential asynchronous operations in your code?".

raverbashing · on Nov 11, 2013

Really, in some places it hurts

* We stored each show as a document in MongoDB containing all of its nested information, including cast member*

I've seen this in people using MongoDB and the bough the BS that because "it's a document store" there should be no link between documents.

People leave their brain at the door, swallow "best practices" without questioning and when it bites them then suddenly it's the fault of technology.

" or using references and doing joins in your application code (double ugh), when you have links between documents"

1) MongoDB offers MapReduce so you can join things inside the DB. 2) What's the problem to have links between documents? Really? Looks like another case of "best practice BS" to me

lhc- · on Nov 11, 2013

Links in mongo aren't really links though; its up to the application to handle the "joins", which really means making an extra query for every linked item. It's like SQL joins except without any of the supporting tools or optimizations that exist in RBDMS.

raverbashing · on Nov 11, 2013

Yes, it is manual

But you can query for a list of ids for example, using the 'in' operator and a list. http://docs.mongodb.org/manual/reference/method/db.collectio...

SigmundA · on Nov 11, 2013

Isn't this done client side? Without joins in the db engine itself locality is much worse along with lost opportunities for optimization leading to much worse performance.

mason55 · on Nov 11, 2013

Yes, you have to build the list of IDs to pass to the $in operator and then send out a second query but grandparent post said you had to make an extra query for each linked item which is incorrect.

liveoneggs · on Nov 11, 2013

at mongo training we were told map/reduce did not offer performance and to avoid it for online use. You must use the "aggregation framework".

AlisdairO · on Nov 12, 2013

> What's the problem to have links between documents? Really? Looks like another case of "best practice BS" to me

I think the main problem is that it becomes difficult to maintain consistency, due to Mongo's lack of transactions.

exclusiv · on Nov 11, 2013

Do NOT use MongoDB unless you understand it and how your data will be queried. Joins like the author mentions by ID is not a bad thing. If you aren't sure how you are going to query your data, then go with SQL.

With a schemaless store like Mongo, I've found you actually have to think a LOT more about how you will be retrieving your information before you write any code.

SQL can save your ass because it is so flexible. You can have a shitty schema and make it work in the short term until you fix the problem.

I wrote many interactive social apps (fantasy game apps) on Facebook and it worked incredibly well and this was before MongoDB added a lot of things like the aggregation framework.

The speed of development with MongoDB is remarkable. The replica sets are awesome and admin is cake.

It sounds like the author chose it without thinking about their data and querying upfront. I can understand the frustration but it wasn't MongoDB's fault.

This is a big deal for MongoDB: https://jira.mongodb.org/browse/SERVER-142.

Let's say you have comments embedded on a document and you want to query a collection for matches based on a filter. If you do that, you'll get all of the embedded comments back for each match and then have to filter on the client. IMO, when the feature above is added, MongoDB will become more usable for more use cases that web developers see.

kcorbitt · on Nov 11, 2013

I've seen a fair number of articles over the last couple of years comparing the strengths and weaknesses relational/document-store/graph databases. What I've never seen adequately addressed is why that tradeoff even has to exist. Is there some fundamental axiom like the CAP theorem explaining why a database like MongoDB couldn't implement foreign keys and indexing, or why an SQL couldn't implement document storage to go along with its relational goodness?

In fact, as far as I can tell (never having used it), Postgres's Hstore appears to offer the same advantages as a document store, without sacrificing the ability to add relations when necessary. Where's the downside?

ddebernardy · on Nov 11, 2013

> why an SQL couldn't implement document storage to go along with its relational goodness? (…) Postgres's Hstore appears to offer the same advantages as a document store, without sacrificing the ability to add relations when necessary. Where's the downside?

PostgreSQL can store arbitrary unstructured documents just fine: hstore, json, … Each come with the possibility to actually index arbitrary fields within the documents using a BTREE index on an expression, and arbitrary documents wholesale using GIST index.

Besides the need to know a thing or two on query optimization, the only downside I can think of is that ORMs are usually broken (Ruby's Sequel is a notable exception). But this isn't a problem with Postgres itself; it's a problem with ORMs (and training, admittedly).

mhluongo · on Nov 11, 2013

Typically as your data model complexity ("relatedness") increases, it's more difficult to scale. I'm not sure about anything like CAP, but I do know that in graph-database land we have to remind ourselves that general graph partitioning is NP-Hard, and that our solutions will need to be domain-specific.

boomzilla · on Nov 11, 2013

But it's not web scaled :)

http://www.youtube.com/watch?v=URJeuxI7kHo

hkarthik · on Nov 11, 2013

>> Some folks say graph databases are more natural, but I’m not going to cover those here, since graph databases are too niche to be put into production.

Is this really true? It sounds like both relational DBs and document DBs are a poor choice for the social network problem posed. I've actually dealt with this exact problem at my last job when we started on Mongo, went to Postgres, and ultimately realized we traded one set of problems for another.

I'd love to see a response blog post from a Graph DB expert that can break down the problem so that laymen like myself can understand the implementation.

mazelife · on Nov 11, 2013

At my current employer, we're working on a product that relies heavily on a graph DB (Titan, in this case). Performance characteristics vary dramatically based on the type of query you're trying to run so you have to be careful about how you use it. There are certain types of things you might do in a relational DB with no worries but that would perform horribly in Titan. The converse is also true, of course. For example, a query along the lines of "give me a list of friends of friends of person X" is very fast indeed on a graph database, whereas a query like "give me a random person" tend to perform horribly. But we've been able to get impressively fast, real-time performance on graphs with millions of vertices and tens of millions of edges. They're still niche products compared to NoSQL systems like Mongo, Redis, etc. But I don't see any reason to think think that Titan or Neo4J aren't production ready.

Here's a good intro to Titan and how it works: http://www.slideshare.net/knowfrominfo/titan-big-graph-data-...

fat0wl · on Nov 11, 2013

I would look at Neo4j. I originally came across it when vetting Grails (it has a Grails plug-in) and it seems to be one of the heavy contenders in terms of a production-ready graph DB. People (this article's included) seem to say that production-ready graph DBs don't exist. Maybe these projects are still trying to gain traction? I expect some stable builds will be out there soon if they aren't already...

http://www.neo4j.org/

drone · on Nov 11, 2013

My experience with Neo4j (this year) was abysmal. The take-away I had was: it's only good for very small graphs.

Generally, I'd spend some time writing a script to load data into it, start loading data, respond to it crashing a few hours later, increase the memory available to the process, start up again, and respond to it crashing a few hours later. I was never able to get any reasonably-sized graph[0] working reliably well without using an egregious amount of memory, and knowing that I would continue to face memory issues, I gave up on Neo4j and found another way to solve my problem.

It may be that I simply was not competent at setting it up properly, but no other data store I've worked with has been as hard to get stable over a moderately sized data set. I spoke with some other people who had worked with Neo4j at the time, and they expressed the same issues - they couldn't make it work for any reasonably-sized dataset and had to find another solution.

[0] Not big, mind you, just reasonably-sized. E.g. 4 million nodes, with each node having an average of 5 edges and 2-4 properties.

mhluongo · on Nov 11, 2013

Hm, I assume you reached out to the mailing list and what not? I know a number of installations with numbers well above that. Were you using the batch insertion API?

drone · on Nov 11, 2013

No, I'm sure there are some great running instances out there - but I was put off by the difficulty of getting it reliably running without being an expert in its configuration. Additionally, the fact that I'd have to spend at least $12k/year to have only 3 nodes in a cluster, knowing we'd need a lot more than that as time went on sealed the deal.

We found that we could do everything we needed with secondary processing against our document store at runtime for so much less without adding another layer of complexity to the architecture.

Edit: forgot to mention - no we weren't use batch-insertion in all cases, IIRC, we had issues with duplication and had to do check-if-exists -> create-if-not as we were reading from raw data sources that were heavy with duplicates.

jexp · on Nov 12, 2013

Many heavy duty production customers of Neo4j run with just a 3 node cluster, no need to scale out as with other NoSQL datastores. And actually they replaced larger clusters with a small Neo4j one.

I would love to learn about your Neo4j setup, and the issues in detail, I want to make it easier for people in your circumstances in the future to get quickly up and running with Neo4j in a reliable manner. If you're willing to help out, please drop me an email at michael at neotechnology dot com.

fat0wl · on Nov 11, 2013

And I remember in flipping through a book on graph DB engines that some can be mounted as an extra layer on top of relational stores, so there is always that backdoor back into it.

mhluongo · on Nov 11, 2013

Yeah, FlockDB (https://github.com/twitter/flockdb) comes to mind. I think Titan (https://github.com/thinkaurelius/titan) should / will be able to handle this too.

mhluongo · on Nov 11, 2013

I can attest that Neo4j is production-ready- I know they're being used at companies like Adobe and Cisco, and we were happy with it at Scholrly.

prathle · on Nov 11, 2013

More than 30 of the Global 2000 now use Neo4j in production for a wide range of use cases, many of them surprising, even to us! (And we invented the property graph.)

A partial list of customers can be found below:

www.neotechnology.com/customers

The "too niche" comment might have been true a few years ago. I won't speak for all graph databases, since many are clearly very new and haven't had much time to mature yet. But Neo4j has been in 24x7 production for 10 years, and while the product has of course evolved significantly since then, it'd built on a very solid foundation.

Most the companies moving to graph databases--speaking for Neo4j, which is what I know about-- are doing so because either a) their RDBMSs weren't able to handle the scope & scale of their connected query requirements, and/or b) the immense convenience and speed that comes from modeling domains that are a graph (social, network & data center management, fraud, portfolios, identity, etc.) as a graph, not as tables.

For kicks, you can find a number of customer talks here, from the four (soon five) GraphConnect conferences that were held this year in major cities around the world:

http://watch.neo4j.org/

If you're in London, the last one will be held next week:

www.graphconnect.com

You'll find a summary below of some of the technology behind, with some customer examples.

www.neotechnology.com/neo4j-scales-for-the-enterprise/

One of the world's largest postal delivery services does all of their real-time package routing with Neo4j. Several customers have more than half of the Facebook social graph running 24x7 on a web application with millions of members, running on a Neo4j cluster. Railroads are building routing systems on Neo4j. Some of the world's largest customers are using them for HR and data governance, alternate-path routing, etc. etc.

The best way to really understand why graph databases are amazing is to try. Check out the latest v2.0 M06 beta milestone of Neo4j (www.neo4j.org) which includes a brand-new query environment. I've seen connected queries ("shortest path", "find all dependencies", etc.) that are four lines in the Cypher query language and 50-100 lines in SQL. I've seen queries go from minutes to milliseconds. It's convenient and fast. Glad to see you exploring graphs!

yid · on Nov 11, 2013

> Is this really true?

Facebook's TAO is a giant graph database built on top of MySQL [1]. I'd say it's pretty production-ready, because Facebook's social graph probably has at least hundreds of vertices.

[1] https://www.facebook.com/notes/facebook-engineering/tao-the-...

ris · on Nov 11, 2013

There's a difference between ready-for-production and ready-for-production-if-you-have-the-entire-team-of-developers-that-wrote-it-on-hand-all-the-time.

petern · on Nov 13, 2013

See http://blog.neo4j.org/2013/11/why-graph-databases-are-best-t... on how to use Neo4j for the mentioned Diaspora cases (Neo4j was actually proposed back in 2010 to the team). Comments are very welcome.

msane · on Nov 12, 2013

Exactly what I thought. Mongo has its purpose. But it's a tree. If your data is a graph with many nodes, it's going to take some elbow grease. Don't use mongo in that case, use something that is built for that, like Neo4j ...

the1 · on Nov 11, 2013

what's wrong with symlinks on transactional filesystem?

jahewson · on Nov 11, 2013

Millions of small files is the worst workload for pretty much every filesystem. Data locality and fragmentation can end up becoming real problems too.

edude03 · on Nov 11, 2013

I hate link bait like this.

The real title should be "Why you should never use a tool without investigating it's intended use case"

Robin_Message · on Nov 11, 2013

But the point is that there is no use case. Relational databases and normalisation didn't arise because a load of neckbeards wanted bad performance and extra complexity.

The point of the article is that the world is relational, and because Mongo isn't, it'll bite you in the ass eventual. Sure, that's a specialisation of what you said, but still a useful one, as it allows you to immediately know you shouldn't use Mongo (unless your data is all truly non-relational, and you know you'll never integrate it with any relational data, which, without a crystal ball, you can't know, so don't use it.)

exelius · on Nov 11, 2013

There is a use case, but internet hype has gotten everyone wanting to use Mongo when there's no real reason to. Postgres scales nearly as well as Mongo while being a lot more flexible. That said, Mongo has some real benefits for non-relational computing (see mapreduce) that could make some of the abstraction headaches and lack of data model flexibility worth it for very large data sets.

But I sort of agree; Mongo tends to be overused by startups who are trying to solve a scalability / performance problem before they have one. In the process they often end up running into data model limitations because stuff moves fast early on and you can't foresee what you'll need in a year.

btown · on Nov 11, 2013

As soon as you have users, you'll want to handle relationships between users, whether that's outward-facing or for internal analytics. All products have users by definition. Therefore...

twic · on Nov 12, 2013

Yes. There seem to be a lot of people with quite poor reading comprehension commenting here. The case that the article makes is something like:

1. Document stores are no good for data with non-strictly-hierarchical structure 2. All interesting data has some non-strictly-hierarchical structure

The first point is common knowledge nowadays. It's really the second point that is interesting. Moreover, interesting and correct.

Axsuul · on Nov 11, 2013

Can anyone explain what are some actual real-life good uses for MongoDB?

ringmaster · on Nov 11, 2013

I was on a team that built a web app for primary school standardized testing. The amount of data presented and collected per student per test is large and perfect for a document store. MapReduce operations allow the app to quickly produce cacheable reports across cross-sections based on requested criteria.

Event the tests themselves are composed of multiple parts that randomize for each student, and lend themselves to the document structure that MongoDB provides. Individualized tests could be assembled from components based on student criteria and stored uniquely for a user as of that time, a thing which would be unnecessarily complex within a relational system.

Could this all have been done with a relational database? Yes, I suppose, but I cringe at the complexity of relating test questions with test answers with users with other data elements ad infinitum using JOINs on both read and write. And this doesn't even touch the topics of sharding and replication, which Mongo made easy in comparison to MySQL or MSSQL.

Choosing MongoDB was the correct decision for this dataset and application. I don't advocate it for every app, but for this one, it was the appropriate fit.