

A Year of MongoDB - j4mie
https://speakerdeck.com/mitsuhiko/a-year-of-mongodb

======
neya
This is the problem with most of the guys who go with MongoDB. Obviously, this
person is very technical, so I am not flaming him nor accusing him, but this
is my view of the rest of them who pick MongoDB without exactly having a clue
as to why (hipsters) or when they should use a NoSQL db and when they
shouldn't.

I do not hesitate to admit that I was a hipster sometime back too. I chose
MongoDB for many of my projects and it went well, till it reached some kind of
moderate scale where I realized it was a terrible choice going with a NoSQL db
(Sometimes, I'd have to duplicate data because there were no Joins, etc). And
that's when you start to realize, that NoSQL is not a pure-white solution. It
is designed to satisfy very specific use-cases. Relational databases are
really good enough for 99% of the use cases out there.

Unless otherwise you are COMPLETELY unable to design your schema in a
relational database, you SHOULD NOT simply opt for a NoSQL database. The
claimed NoSQL performance benefits will easily be outrun by a terribly
designed schema, if you use the wrong db for the wrong scenario. Trust me,
MySQL has had so much negativity because of these hipsters, but even something
as basically relational as MySQL scales really reaaallllly well. Infact, many
top guys still use MySQL till date, for a reason, in production.[1]

Next time you launch your start-up, spend some time carefully evaluating your
db design decisions, as the wrong db for the wrong use-case could easily
become the most expensive mistake of your startup.

[1][http://www.quora.com/Quora-Infrastructure/Why-does-Quora-
use...](http://www.quora.com/Quora-Infrastructure/Why-does-Quora-use-MySQL-as-
the-data-store-instead-of-NoSQLs-such-as-Cassandra-MongoDB-or-CouchDB)

~~~
eranation
All I can say is this: if the saying "Always plan to throw away your MVP" is
true, then I can't see any other storage solution other than MongoDB (or a
similar schema-less document storage DB) for MVPs. The speed of development
and flexibility are simply worth it. Yes, it is hard to refactor a live
product and move it from MongoDB to MySQL / Postgre but was done before and
you only do that if you get traction, so its a good problem to have.

I would start with MongoDB, get a grip on what on earth the product is doing,
finalize the schema on the fly based on A/B tests, customer feedback and
analytics, only once the schema is finalized move it to a SQL database if
needed.

If you are not using a good ORM + DB migration system, then MongoDB will make
perfect sense when you are quickly iterating through ideas and trying to find
a product / market match. You really have no clue how your data schema is
going to look like in the end, why confine it at the start? the vast majority
of startups sadly won't get even near the phase where they really reach scale
related performance issues, so choosing MongoDB for prototyping your business
absolutely makes sense to me.

~~~
cobbzilla
You make a VERY big "if" in your first sentence, one that (admittedly
anecdotally) I've very rarely seen hold true in tech companies. Much more
often, the MVP becomes the product, and all those shortcuts and poor design
decisions come back to kill your productivity when it becomes necessary to
refactor foundational tech/designs that have metastasized throughout the
codebase.

I am curious if this is other folks' experience as well, or do you actually
throw out the MVP and start all over at some point? If so, at what point do
you make the break?

~~~
mrgreenfur
In real life, you are totally true. The initial test that succeeds
becomes/continues to be the real product. I agree with the parent poster that
Mongo is awesome for testing, if not awesome for scale. I think the point here
is that tech co-founders/leads need to make it clear that this is a debt that
will need to be paid if things take off.

------
cpleppert
I'm a little blown away by the total lack of technical understanding when it
comes to MongoDB. This person is obviously very technical, so why would he
have chosen MongoDB in the first place? It isn't like MongoDb's technical
shortcomings are a secret. The description of how MongoDb does sharding and
distributed queries should have immediately raised red flags to any who has
even a modicum of CAP understanding. If it sounds complex, a total hack and
impossible to manage it probably is.

For gosh sakes, even the 10gen guys admit MongoDB lost data a year(!) ago. [1]
If a database lost data in a single server configuration, why would you trust
it as a cluster?

1: <http://www.dbms2.com/2011/04/04/the-mongodb-story/>

~~~
bsg75
Because developers without significant database / data store experience are
choosing where to put their data because its "easy to use".

Yes, RBDMS and similar approaches are difficult to work with [1] and introduce
"impedance mismatch". Yes joins can be slow. Yes scaling can be difficult and
or expensive. But, when choosing where to put data without understanding (or
accepting) _why_ the above is difficult is only asking history to repeat
itself.

[1] "Database guy here". I know databases and SQL. I would not consider
choosing what language to implement a web application tier on, because I don't
have enough relevant experience. I focus on what I have battle experience in.

~~~
ig1
"Easy to use" is a perfectly legitimate criteria to optimize for. For most
startups performance/scale is a total non-issue, most of the time even
something like berkleydb would do. At most early stage startups developer time
is the single biggest bottleneck.

If you make a non-optimal choice upfront you can always migrate to another
database later on (database migrations are painful but in practice is
something you have to do sooner or later in any case).

(obviously this applies to the scaling argument; if you need transactions you
should pick a database which supports them)

------
dmytton
A few comments on the problems:

* CPU bottleneck. The mongod is no usually bound by CPU except for building indexes on existing data (which shouldn't really happen in production). The issue he's talking about is contention between the web server (or workers) and the mongos. This isn't anything unexpected. It's recommended to put the mongos onto the application server and then you scale this by adding CPUs initially but then by adding multiple application nodes and using a load balancer. Or perhaps splitting the mongos onto dedicated nodes?

* Virtualisation: VMs are notorious for having variable performance because you have the overhead of the hypervisor but more importantly, are sharing resources with others. We run performance critical apps on dedicated servers and reserve VMs for tools or things which aren't high throughput e.g. a MongoDB arbiter.

* EBS: This has had known performance issues for years. It's fine as a basic file store but should never be used for databases. PIOPs are the way around this but local instance storage is also an option.

* No transactions: MongoDB has never had them. This is known.

* Schemaless != no schema design. It makes it easy to play around but you still need to think through things carefully. See [http://blog.serverdensity.com/mongodb-schema-design-pitfalls...](http://blog.serverdensity.com/mongodb-schema-design-pitfalls/)

* No joins. Again, it has never had joins. This is known.

~~~
jeffdavis
"[EBS is] fine as a basic file store but should never be used for databases."

Doesn't Heroku use EBS for all of their postgres databases?

It may be the case that postgres works better on EBS than mongo does. Postgres
has a traditional write-ahead log that minimizes (and spreads out) block
writes and hides latencies. Mongo does not.

~~~
jbellis
WAL only helps so much, since you also need seeks for reads. Cassandra has a
WAL + log-structured storage + no read-before-update design, so it basically
eliminates seeks on writes entirely, and EBS is still ass for workloads that
don't fit in cache. Which, if you're bothering to use C*, is almost all of
them.

~~~
jeffdavis
That's ideal, but many kinds of updates require some kind of read.

I agree in general though.

------
ot
Completely off-topic (well that's my nickname) but I'm seeing on HN more and
more _beautiful_ slide decks, from a purely esthetical point of view. This
deck has beautiful fonts and a beautiful color scheme, and it is nicely
designed.

My question is: how are they made? Keynote, Powerpoint, HTML...? Are they made
with the help of a graphic designer? They look completely outside of the reach
of the average technical developer. Or do they use a pre-made theme?

~~~
ims
Zach Holman has written about this: <http://zachholman.com/posts/slide-design-
for-developers/>

In fact, this slide deck was extremely reminiscent of his style, down to the
font and some other details. I wouldn't be surprised if the person who
designed this deck was influenced by some of Holman's previous presentations.

~~~
FraaJad
eh.. Mitsuhiko is known to have good eye for design. see his Flask website
etc.,

------
nasalgoat
As I clicked forward in the slideshow, I kept expecting to find something to
disagree with but it never happened. Mostly I nodded to myself.

I have a huge investment in MongoDB at this point, both financially and in
equipment - over 200 machines dedicated in various separate clusters - and a
pivot to another datastore at this point would be a significant re-engineering
effort.

All the people talking about changing databases after the MVP have clearly
never had to deal with a typical hockey stick growth profile and having to
allocate engineering resources based on need - either making the product
better, or wasting time changing your database and losing traction.

Anyway, I'm hoping posts like this can dissuade people from choosing MongoDB
for anything destined for high throughput Enterprise-level production
environments.

I just wish there was something I could easily put in to replace MongoDB, but
none of the available options quite fit the same document store model, but
make better use of available resources and provide much better performance.

------
dccoolgai
Who is still surprised by this? I feel that after 2-3 years of the litany of
stories and cases like this, it should shock absolutely no one anymore.

~~~
Jaigus
He is a relatively popular programmer so people will still comment on his
presentation (even if it is a little redundant at this point). Unsung database
veterans like Tony Marston have been saying things like this for years but few
people took them seriously.

~~~
sherr
Yes, his web site is definitely not cool, but thank you for linking to it.
I'll have to brew a decent coffee or three over the next week and take some
time to read some of his stuff because it looks pretty good. Discounting the
animated spiders and mention of COBOL!

------
realrocker
In all of these discussions about MongoDb on HN, I always wonder why aren't
people talking about the 1 % use cases for which it's good for. Maybe it is
too obvious for most folks , but anyway I will just regurgitate the use cases
Mongodb folks have documented officially. Here:
<http://docs.mongodb.org/manual/use-cases/>. They have literally like three
major use cases.

------
jonny_eh
Sounds like rethinkDB would be the answer to our prayers. How close is it to
being "ready"?

~~~
cmircea
Seems RethinkDB is still relational. Meh.

If you're using C#, I'd recommend RavenDB.

~~~
nissimk
RethinkDB sounds like it's document oriented from their site: "RethinkDB is
built to store JSON documents, and scale to multiple machines with very little
effort. It has a pleasant query language that supports really useful queries
like table joins and group by, and is easy to setup and learn."

Also not sure why relational == Meh.

RthinkDB does sound promising though not ready for primetime. The following
(critical) features are still in the development pipeline: Secondary indices,
a db backup tool.

~~~
mglukhovsky
Definitely, we're leaving the production-ready tag off until we've built in
some of these features.

Secondary index support is ready to go, we'll be releasing v1.5 with it in a
few days.

~~~
nissimk
I read your site and your system sounds very promising as I said in my
previous post. It seems like you've identified some of the key problems with
earlier noSQL implementations and that you're trying to solve them. Schema
free was always a feature I could understand but "no joins" always sounded
like an anti-feature to me. The promise of consistency and automatic sharding
/ replication is very nice but I'll remain skeptical until I hear about some
production implementations.

------
jsemrau
Well my 2 cents to the discussion.

We really tried to make Mongo work over the last 1.5 years. We decided to go
for mongo mostly because of JSON and geospatial indexing.

We became to realize that for our business case it is not the right tool. That
said, and I work with databases now for 20+ years, I don't see what problem
Mongo solves.

------
serichsen
I can confirm that the map-reduce thing does not hold up to any performance-
wise expectations.

~~~
bjt
I assume your experience was on version < 2.4?

Now that 2.4 is out I'm interested in seeing how much the switch to
multithreaded V8 has improved that. <http://docs.mongodb.org/manual/release-
notes/2.4-javascript/>

~~~
lebski88
I've run quite a few m/r queries against 2.4.2. The performance is still poor
enough that'd I'd say it's only usable for occasional ad hoc work.

More seriously though running queries against a master node will regularly
(maybe one in two queries) cause it to failover and elect a new primary. You'd
think you should be able to run the queries against a slave, which you sort of
can. Unless the result set is bigger than the maximum document size (16megs),
in which case you're limited to the master as you can't write to a collection
on a slave.

We're using Mongo fairly successfully but it has a lot of issues, particularly
around administration tasks. Map reduce work gets done in Hadoop.

------
jwilliams
My first thoughts when I read this:

1\. It's easy to screw up a relational database. I've seen more than a few
mature relational databases... Most of them have plenty of sins, near-
crippling performance snafus and other horrible legacy. _Any_ database that is
big enough and growing enough is a beast to manage.

2\. From the slides, I think this guy took "Schema-Less" as a cue to stuff
completely arbitrary data into MongoDB. No wonder his indexes went crazy. You
still need to think about the data you're storing & the relationships. You
need a design whatever database you use.

3\. Any relational databases I've seen at scale have a lot of flattening. I've
seen intra-day transaction dbs that are completely flattened. If your next
port of call is a highly normalised relational database, you're going to hit
another wall fast enough.

4\. Two Phase commit. Seriously. Forget it. I've spent half my career in
financial institutions. Quick, fail-fast processing with a reconciliation
process is _by far_ the most common approach. 2PC actually slows you down and
introduces another component that gets in the way. It's used very sparingly
(and even then, usually causes a world of pain).

------
ericcholis
I think that some people misunderstand the primary purpose of Schema-less
design. It's not about typing, it's about document flexibility. It's about
getting rid of EAV tables (read: Magento) and storing document-specific
information. Typing obviously comes into play, but is only half the topic.

If you are building a system where the schema is the same for all records,
then you really shouldn't be using Schema-less design.

~~~
tracker1
In a few projects I've worked on in the past year and a half, I've used
EntityFramework C#, and added a Data NVarChar(MAX) field to each table.. then
I added a base class that has a UseData(Action<JObject>) method that will pass
a Json.Net based JObject in to be able to manipulate... adding additional
properties that don't need to be indexed, and handling default values then
becomes fairly easy with getters/setters. I also have a TempData table that is
pretty basic where the core data is JObject based.

It's not the fastest option, but has worked pretty well for me. It did work
out pretty well for holding temporary values, or other values that don't need
to be indexed, or where the shape can change dramatically. I tend to store
transaction details (credit-card, vs paypal, etc) in JObjects, since the shape
can be different, with a key that can pull the right properties out via
.ToObject<ConcreteInstance>()

In other cases, I've mirrored data to Mongo, so display versions of records
can be pulled up denormalized from a single record/authority (the source
records are across 30+ joins, and fairly expensive with a 50:1 view:edit
ratio).

I will say that using MongoDB with NodeJS has to be the most seamless
combination of tools I have ever worked with. I've written a few services
based on this combination and love the development output. Fortunately my
needs have been limited enough, that I have not hit too many walls. Most of
the issues I have experienced relate to geo-indexes combined with other data,
and the limits on multi-key indexes, and no secondary indexes.

I think more people need to consider how their data is shaped, and used and go
from there.

~~~
junto
I'd love to see a blog post on your Entity Framework JObject implementation!

------
drorweiss
After 6 intensive months with MongoDB to build my MVP, I just love it.

For sure, it's not perfect. Lack of joins is a shame, but can be quite easily
solved outside the database.

But the ease of use and speed of development are such a HUGE advantage. Not
having to break my schema into normalized relations and define it in the DB
saved me literally days of work.

I can imagine a case that 1 year from now, when our product will be more
mature, we'll be leaving mongodb for another SQL or No-SQL database. Doesn't
matter. The benefit that mongodb gives us now justifies the costs the may be
incurred years from now.

~~~
nateweiss
Totally agree, and not just because of your last name.

------
astral303
These guys are lucky they didn't try Cassandra. That's really Mongo's problem:
it's too close to a regular SQL solution. You have a sharded NoSQL data store
that performs in-store filtering and sorting? You can run aggregation queries?
Compound indexes? Amazing! Tell me more.

Moral of the story is unless you can justify a NoSQL datastore for your
particular solution and you can live without joins, stick with a regular SQL
db.

~~~
dindresto
"These guys are lucky they didn't try Cassandra." Could you explain that
further? I'm currently using Cassandra for a new project, so this sentence
caught my eyes.

~~~
astral303
Just like with Mongo or any NoSQL/non-traditional solution, you have to
understand how the trade-offs and capabilities of the database relate to what
you're using the database for. You also have to design your data storage with
these tradeoffs in mind.

For example, joins. There are no joins in Mongo or Cassandra and anything
working around joins is simply not going to be as fast as a traditional
database's join. If you need to do joins all the time, you will be in pain. So
the answer is to deduplicate your data, such that joins are not necessary for
frequent operations.

In particular, with Cassandra, while it's great at many things, such as write
speed and write availability, you have to be very careful with your data
design to get the results that you need. And you have to be cognizant about
the querying that you need to do.

Cassandra has really weak in-store aggregation and filtering, as in there is
almost no in-store aggregation and there is no filtering other than by a
prefix of a column or a key (prefixed subset). So if your column names are
made out of a composite parts A:B:C, you can scan for A:* or A:B:* (or A:[some
value of B to some other value of B]), but you can't do _:B:_ or *:B:C.

The advanced trick is to use ordered rows, which are so strongly discouraged
(because you can shoot yourself in the foot with a key distribution hotspot),
which allows you another axis of prefixed subset filtering. But only one more
axis.

Sorting? Cassandra doesn't sort. Cassandra project leadership thinks that
sorting should be done in the client. If you want to filter a subset of keys
in the shape of A:B:C, e.g. get all keys of a certain value of A and sort B:C,
you have to do the sorting yourself. If you wanted to do a top-N report, you
have to retrieve all that data to your client and then sort.

The only sorting in Cassandra is the hierarchical column (and optionally key)
ordering. So if you want to have quick top-N reporting functionality on values
A and B from an A:B data tuple, you end up maintaining two indices (i.e.
precomputing query results). One such index has columns that start with A and
another starts with B.

But then the indexing support is particularly weak. Secondary indexing is only
done on values, so if you want to index portions of your keys, that's not
natively supported. Also, only in Cassandra 1.2 is indexing finally "write-
only," instead of "read-then-write." (Write-only performance is much faster.)

There are no triggers, so you can't write custom indices where you can
atomically perform "read-then-write" operations to maintain an index. Instead,
you have to write all such custom indexing logic yourself and take a hit for
the transmission of all the indexing mutations over the network wire. This
hurts particularly bad when you have a cluster distributed over geographical
regions (i.e. slow/expensive link).

Cassandra does have the ability to count the number of columns, but only in
one row (w/ only the same prefixed subset filtering available). Counting
columns in multiple rows is not available, even if these rows are co-located
on the same node.

Map-reduce is available, but it is not suitable for frequent queries (not
meant to be run quickly, just like map-reduce in Mongo is not something you
want to be hitting very frequently).

So, of course, whether these are issues for you depends entirely on your data
design. There are many things that Cassandra does well and certain data shapes
for which it is just diesel. It's quite ops-friendly, rolling full-uptime
upgrades are reliable and are a key priority for the Cassandra team.

So Cassandra is even more specialized in terms of its uses than Mongo. If the
original author of the presentation tried to use Cassandra for the same kind
of data he used for Mongo, he probably would've written an even more scathing
article.

~~~
pkolaczk
Cassandra and many other NoSQLs are designed primarily for OLTP workloads, not
OLAP. OLTP is almost exclusively "find me sth by primary key" (see the TPC-C
benchmark used for RDBMSes). Sorting huge amounts of data, top-n queries,
skyline queries, aggregation, joining huge data sets, complex filtering belong
to analytics world, not OLTP. Unless your whole database is very tiny, let's
say 10MB, those operations are pretty expensive even in RDBMSes. That's why
those features are deliberately not included in Cassandra. Contrary, MongoDB
took a different route - they include some of those features and then
seriously underdeliver on many of them.

------
thelarry
Not totally related to the article, but every technology has its use. I get
really annoyed when people ask me why I don't use ruby or mongodb at my job. I
hate this movement towards "blog technologists" that basically read something
in a blog, maybe (probably not) try it out themselves for something small and
assume it is the best for everyone and if you dont use it you are stupid.

------
pxer80
What version of Mongo and pymongo is he using? The connection client looks old
(not using MongoClient) and the compound index selection (or lack of with
$and) doesn't exist in 2.4.1. I'm curious as to how many of issues have been
resolved...

------
jkldotio
I just submitted a related article on Hyperdex which, although it's not
Python, has a very good Python interface.

<https://news.ycombinator.com/item?id=5686973>

~~~
lucian1900
Hyperdex looks quite interesting indeed, but after getting burned with new
products claiming too much, I'm perhaps overly cautious. Also, it doesn't
appear to be able to change schemas after creation, which is a significant
issue.

------
babl
Is there a video somewhere of the actual talk?

------
gbog
The question I think is fundamental but that I didn't see asked nor replied is
this: if we had a data store that had full performance, full scalability,
etc., would we design it as a relational database or not? Said otherwise: the
debate about NoSQL, is it an optimisation issue or a design issue? (Keep in
mind pg's article about what would be a language in hundred years)

------
serichsen
Slide 56 states: "Schema vs. Schema-less is just a different version of
dynamic typing vs. static typing."

Wrong.

If anything, it is a different version of weak typing vs. strong typing. That
is totally orthogonal.

------
danielrhodes
The problem with most of these databases is that they are trying to do too
much.

Either the database is great for ad-hoc queries and flexibility or it's great
for performance and scale.

------
malkia
Why the URL are not working from the SpeakerDeck document? Captain Obvious
here: But isn't that the purpose of the Web? It's the WEB... make those links
work!

~~~
ricardobeat
I think this is a fair comment. SpeakerDeck focuses on preserving the
presentation layout with absolute fidelity, and does that by using static
images. It's a trade-off. It's certainly possible to implement some kind of
link-detection, I hope they are working or plan to work on it.

------
Finster
I feel like there is a lot of context missing for some of these slides. Will
there be video available of the talk?

------
paradox95
Sounds like 90% of your problems could have been solved if you'd spent a day
or two researching what you were about to build your entire company around.

Sounds like the "We Fail" slide might be the most accurate one there.

