
MongoDB queries don’t always return all matching documents - dan_ahmadi
https://engineering.meteor.com/mongodb-queries-dont-always-return-all-matching-documents-654b6594a827#.s3ko3vfnx
======
im_down_w_otp
Said it before, will say it again... "MongoDB is the core piece of
architectural rot in every single teetering and broken data platform I've
worked with."

The fundamental problem is that MongoDB provides almost no stable semantics to
build something deterministic and reliable on top of it.

That said. It is really, really easy to use.

~~~
eloff
As a guy who works on ACID database internals, I'm appalled that people use
MongoDB. You want a document store? Use Postgres. Why on earth would you use a
database that makes so little in the way of guarantees about what results you
get from it? I think most people have really low load and concurrency, so
things seem to work. When things get busier you're in for a world of pain.
Look I get that's it's easy to use and easy to get started with, but you're
going to pay for all of that later.

~~~
spriggan3
> Why on earth would you use a database that makes so little in the way of
> guarantees about what results you get from it?

Because some people can't stand having to work with SQL,migrations,schema and
constraints, it's as simple as that ( That's not my opinion,that's just the
rational behind MongoDB). Even if you use Postgres with the Json column type,
you still need to write SQL queries and schemas.

In the context of analytics, it might make sense, I'm not a big data analyst,
but I've seen MongoDB used to centralize logs.

~~~
Lazare
> Because some people can't stand having to work with SQL,migrations,schema
> and constraints

The thing is, if you actually try and write an app using MongoDB, you will
rapidly find that you:

1) Have migrations (except they're going to be some scary ad hoc nodejs script
that loop through your document store and modify fields on the fly).

2) Have schemas (except they'll be implicit and undocumented)

3) Constraints (except they'll be hidden inside your app logic, and violating
them will cause data corruption).

The biggest lie about NoSQL databases is that they're schemaless. If you're
EVER going to read the data back and do _anything_ with it, it has a schema.

~~~
red_hare
This.

My company used mongo for years before we got our shit together.

Schmemas were always implicit (until we got our shit together and started
defining and enforcing them with Python Schematics).

Migrations were crazy scripts you run in prod or hacks you stick into your
code to "transition".

And yes, surprise constrains left and right causing awful anti-patterns. One-
character key names to save disk. Hashed values for indexed keys to save
memory. Awkward structuring to improve query performance.

The worst part is, we now have tons of important data in these databases and
almost no one understands the legacy crazy app logic that makes them tick.

~~~
xapata
> used mongo for years before we got our shit together.

That's actually a legit use case. Use MongoDB while you get your shit
together. I use global variables while I'm noodling around in code. Eventually
I refactor.

~~~
arctor
The problem is, in a great portion of real world projects "eventually" never
comes and there's just no time for any major refactoring or replacing
technologies since you are too busy implementing the feature that was needed
two weeks ago.

~~~
derefr
I've often dreamed of a specific type of software built and released as
"prototypeware", where any app created using it will have certain built-in
scaling limits—and going past them will _irrevocably_ force the app into a
read-only mode. It would warn anyone monitoring it well in advance of hitting
such a limit, of course. But there'd be no way to just slide the limit upward
or otherwise tarry. It'd force the migration to something better just as if it
were a Big Customer with Enterprise Compliance Demands.

If an _enforceable_ mechanism like that existed, I'd be a lot more confident
in mocking things up. Stick SQLite in for the database, munge HTML and
Javascript together, whatever—it's literally going to slap away the hand of
anyone who tries to use it on a production workload, so why not?

(Going further, it'd be interesting to create some sort of quagmire of a
software license, specifically for prototypeware, such that you'd be forced to
rewrite all the prototype code instead of reusing even a hair of it in
production. Maybe something like reassigning the IP to a trust, with the trust
having an obligation to sue anyone and everyone who tries to create derivative
works of the code they've been handed?)

~~~
erez
This will not work. The whole "prototype" idea assumes once you grow out of
the "prototype" phase you have the time, money, manpower, etc. to rewrite the
whole thing based on solid, powerful technology and tools. That is, more of
then than not, not the case.

The first problem is that every tool has demands, especially the limited ones,
and you end up writing your application around those limits and demands, using
platform-specific code that will have to be discarded and re-written come the
migration.

The second problem is that these tools dictate design, and once you try
migrating, you still have an application designed around the prototype tools,
which make a lot of concessions and have design flaws because of that.

Finally, I've never understood the need for learning a specific tool, platform
or language for "rapid prototyping". Use the tools you will use eventually,
it's not that building something in, say, Java from scratch will take an order
of magnitude more time and effort than building it on Node.js, despite all the
hype, especially if you're a Java shop.

~~~
derefr
> it's not that building something in, say, Java from scratch will take an
> order of magnitude more time and effort than building it on Node.js, despite
> all the hype, especially if you're a Java shop.

I think we're picturing different things here. You're picturing having
software engineers make the prototype, and then having the same engineers do
the final implementation. Meanwhile, I'm picturing two different teams, with
different competencies—one who knows a prototyping toolchain backward and
forward and is extremely productive in it, and the other who knows a solidly-
architected platform just as well.

The classical pipeline in the animation industry is to have two separate
"teams" of artists. One team does _concept illustration_ and _storyboarding_ ,
and the other does _keyframe animation_ and _in-betweening_. The first of the
two teams is essentially a team of prototypers. Their output is a product
which stands on its own for internal evaluation purposes—but which isn't
commercially viable "in production." (Nobody really wants to watch 1FPS
sketches.) So, after the storyboarding is complete, the whole product is
redone by the actual _animators_ into the more familiar product of 24FPS
tweened vector-lines or CGI model-joint movements.

The more familiar case of this for web development is where the "prototype" is
a PSD file. Professional capital-D Designers are usually Photoshop
experts—they're very productive in it, and can mock up something that can be
evaluated for being "what the customer wants" quickly, with rapid iteration if
it's not right. Once they've got the customer's sign-off, their output
product—their _prototype_ —can be tossed over to development staff to "make it
work." (There are also an increasing number of interaction-design prototyping
apps targeting the same set of designers, under the theory that they'll be
able to become productive in quickly iterating the "feeling" of an app with a
customer in the way they're already doing with the "look" of the app. I
haven't met a designer that uses one of these professionally, but I think
that's mostly because there aren't any of these yet well-known enough to be
taught in art schools.)

But when it comes to _workflow_ and _use-case_ design, we don't really see the
equivalent pipeline. Looking through the lens of separated "prototyper" and
"engineer" roles, there are clearly tons of software-development tools that
were _intended_ to be used purely by "prototypers": Rails' view scaffolding,
for example. But since this role _isn 't_ separate, these things get used _by
engineers_ , and sneered at, since, as you said, it's no more effort—when
you're already an engineer—to just engineer the thing right from the
beginning.

Interestingly, all of the true examples of workflow prototyping I can think of
come from the specific domain of game development—but even there, nobody seems
to realize that prototyping is the goal of these tools, and tries to misuse
them as "production" tools. RPG Maker, seen as a tool for making a commercial
RPG, is total crap. RPG Maker, seen as a tool for _prototyping_ an RPG, is an
_excellent_ tool. Its output is effectively a _sketch_ , a _cartoon_ in the
classical sense:

> The concept [of a cartoon] originated in the Middle Ages and first described
> a preparatory drawing for a piece of art, such as a painting, fresco,
> tapestry, or stained glass window.

A cartoon is a prototype used to communicate intent. Yes, you (as the producer
of the finished piece) can cartoon together with a client to iterate on a
proposal. But much more interestingly, a client can learn to cartoon on their
own—and then, in place of a long design document, they can submit their
cartoon to you. An RPG Maker game project is the best possible thing I could
hope to receive as a design proposal from a client asking for me to make an
RPG. It forces all the same _decisions_ to be made that making the actual
commercial game does—and thus embeds the answers to those decisions in the
product—but it doesn't require the same skillset to create that the commercial
game does, so the client can do it themselves. The prototyping tool, here, is
doing the "iterating on a design together" job of the designer for them.

We do have one common prototyping tool in the software world—Excel. A complex
Excel spreadsheet is a cartoon of a business process, that nearly anyone can
make. We as engineers might hate them, because people generally have no sense
of project organization when making them—but every project to convert an Excel
"app" will take far less time than one that involves collecting the business
requirements yourself. The decisions have already been made, and codified,
into the spreadsheet. You don't have to sit there forcing the client to make
them. The process of cartooning has forced them to do it themselves.

\---

To summarize: software prototyping tools aren't _for_ engineers—if you have an
engineer's mindset, you'll prototype at the speed of sound engineering
practice, so prototype tools won't be any _help_ to you; and you'll be more
familiar with the production-quality tools anyway, so you'll be _more_
productive in those than with the prototyping toolset.

But software prototyping tools definitely have uses: they can help designers
to iterate on a "functional mock-up" to capture a client's intent; or they can
even help clients to create those same mock-ups on their own. This is why
"prototypeware" makes sense as software—but also why it should be self-
limiting from being used in production. The prototype app wasn't created by
someone with an engineering mindset—so there's no way it could end up well-
engineered. Its purpose is to serve as a cartoon, a communication to an
engineer; not to function in production on its own.

(Mind you, prototypeware _could_ be made to function as an MVP in closed-alpha
test scenarios, in the same way that the MVPs of many startups are actually
backed by manual human action in their early stages. The point there is to
test the correctness of the _codified business process_ , rather than to
support a production workload.)

------
lossolo
I've just migrated one project from mongo to postgresql and i advise you to do
the same. It was my mistake to use mongo, after I've found memory leak in
cursors first day I've used the db which I've reported and they fixed it. It
was 2015.. If you have a lot of relations in your data don't use mongo, it's
just hype. You will end up with collections without relations and then do
joins in your code instead of having db do it for you.

~~~
Joeboy
> don't use mongo, it's just hype

I'm kind of curious as to where this hype is. I've almost never heard anybody
say anything positive about mongodb. All I ever see is people saying it's
terrible / hilarious for various reasons.

~~~
throwaway420
Like with any online community, Hacker News can be kind of an echo chamber
where groupthink reigns and alternative points of view aren't encouraged.
MongoDB hype has died down here, but there are still some people that are
fans.

There are some things MongoDB does fairly well:

* MongoDB is really easy to use

* Document databases can be great and flexible solutions for some kinds of projects

* Documentation is fairly good so learning the basics isn't too hard even if you know nothing about it

* scales fairly well at the initial stages

* arguably quicker to get a project off of the ground with than traditions RDBMs, which might be the most important consideration for any startup even if a complete rewrite would eventually need to take place

That being said, I've used MongoDB significantly before and it wouldn't be my
first choice for most types of new project: PostgreSQL probably would be

~~~
chaostheory
About the only thing I agree with is how great their docs are.

* Mongo is only easy to learn. Beyond simple demos, it gets harder and harder to use as projects evolve i.e. you have to do a lot of work yourself imo this is a common problem with nosql datastores that isn't exclusive to Mongo

* "Document databases can be great and flexible solutions for some kinds of projects": Postgresql has been able to work directly with JSON for some time now. There are also other document datastores that are more reliable than Mongo

* Scaling with Mongo is difficult, specifically the crazy setup. Even if you set it up properly, the results don't tend to match the marketing [https://aphyr.com/posts/322-jepsen-mongodb-stale-reads](https://aphyr.com/posts/322-jepsen-mongodb-stale-reads)

* "arguably quicker to get a project off of the ground with than traditions RDBMs" unless you're using Meteor, I'm also going to disagree here. Most frameworks target a relational database by default. Developing by convention tends to get you off the ground much faster than using something more specialized and niche

~~~
rdtsc
But they do make great mugs. Who here doesn't have at least a couple of
MongoDB mugs. I don't use MongoDB and still have a bunch from random
conferences over the last 3-4 years.

------
hardwaresofton
If you're currently using MongoDB in your stack and are finding yourselves
outgrowing it or worried that an issue like this might pop up, you owe it to
yourself to check out RethinkDB:

[https://rethinkdb.com/](https://rethinkdb.com/)

It's quite possibly the best document store out right now. Many others in this
thread have said good things about it, but give it a try and you'll see.

Here's a technical comparison of RethinkDB and Mongo:
[https://rethinkdb.com/docs/comparison-
tables/](https://rethinkdb.com/docs/comparison-tables/)

Here's the aphyr review of RethinkDB (based on 2.2.3):
[https://aphyr.com/posts/330-jepsen-
rethinkdb-2-2-3-reconfigu...](https://aphyr.com/posts/330-jepsen-
rethinkdb-2-2-3-reconfiguration)

~~~
brightball
How does it compare to Couchbase? That seems to be lighting the world on fire
in that space lately.

~~~
hardwaresofton
I'm not sure if lack of overbearing marketing speak counts for something, but
RethinkDB definitely has that going for it.

I'm not an expert on couchbase (and neither on RethinkDB, to be frank, though
I am a huge fan), but here's what RethinkDB has going for it:

\- Changefeeds - easily open a persistent connection to the server and get
updates when the results of an almost arbitrary query changes.

\- Joins

\- Expressive query language that is pretty functionally minded, really shines
in their clojure/haskell drivers

\- Excellent client libraries, well maintained

\- Geospatial queries/objects

\- Amazing admin interface (it has been amazing for a long time, too, not a
recent change)

\- First class consideration of replication & sharding (it is not a bolt-on in
any way shape or form)

\- API-driven cluster configuration

\- API driven permissions management (this is relatively new)

\- Excellent, easy to follow documentation

There are more things, but this is just what I can think of off the top of my
head.

The team at RethinkDB is also just great -- I've met them in person and gotten
help from them and they're straight shooters.

They've also got this great project coming up called Horizon:
[https://www.youtube.com/watch?v=Sb1lH5mvYmU](https://www.youtube.com/watch?v=Sb1lH5mvYmU)

Also they have a video up with a member of the team building a realtime game
with React Native:
[https://www.youtube.com/watch?v=xRK0SYSgVF0](https://www.youtube.com/watch?v=xRK0SYSgVF0)

Maybe someone who is very familiar with couchbase can help make a list... I'll
start it off:

\- Custom query language

\- First class consideration for scale -- replication and sharding

~~~
tomgreen000
I'm ex-Couchbase, so I can probably give a reasonably informed but independent
view on this.

Firstly, regarding the marketing, it may not have been to many people's tastes
- but it definitely worked, and achieved a lot of what was set out in terms of
raising the awareness of what was a decent product that wasn't as well known
as its competitors. There may be cases where people avoid it because they
don't like the marketing, but the reality, having seen its effect, is they are
in the minority, and would probably serve themselves better by assessing
products based on technology rather than spiel.

Now, on the actual technology!

What Couchbase has historically been good at is highly scalable Key-Value
access, at very high performance and low latency. Performance is comparable to
Redis, but CB much more mature sharding, clustering and HA. e.g. fully online
growing/shrinking of cluster, protection from node failures, rack/zone
failures and data center failures. Redis may be a good fit for single-machine
caching situations, and also has its own advantages in terms its
datastructures support, etc.

Quality of SDK's is pretty subjective, but I'd say the 2.x re-write of
Couchbase SDK's makes them very solid. The Java SDK in particular is extremely
good both in performance and by providing native RxJava interfaces.

In terms of query interface, there's geospatial and a new freetext capability
on the way.

Couchbase chose down to go down the route of a SQL based interface as their
main query language. This seems to be a bit love/hate with developers with
some delighted and some perplexed. Maybe for devs it should really be about
higher level interfaces like Spring are increasingly important anyway?

The native interface being SQL based is usually very popular with the BI /
Reporting side of things.

Changefeeds (continuous query?) is a feature not in Couchbase which I would
very much like to see in the future. One thing I would say is that it's
something you have to be very careful in the design of to ensure scalability
and performance. Consistency is something which would obviously need thought
as well.

------
lath
A lot of Mongo DB bashing on HA. We use it and I love it. Of course we have a
dataset suited perfectly for Mongo - large documents with little relational
data. We paid $0 and quickly and easily configured a 3 node HA cluster that is
easy to maintain and performs great.

Remember, not all software needs to scale to millions of users so something
affordable and easy to install, use, and maintain makes a lot of sense. Long
story short, use the best tool for the job.

~~~
ahi
This has also been my experience. Millions of large documents on a single
(beefy) node with a single user it's been fine. Although, the sysadmins had
previously left me with flat file xml on shared storage so the bar was pretty
low.

------
danbmil99
Oh, the fud of it.

The behavior is well documented here
[https://jira.mongodb.org/browse/SERVER-14766](https://jira.mongodb.org/browse/SERVER-14766)

and in the linked issues. Seasoned users of mongodb know to structure their
queries to avoid depending on a cursor if the collection may be concurrently
updated by another process.

The usual pattern is to re-query the db in cases where your cursor may have
gone stale. This tends to be habit due to the 10-minute cursor timeout
default.

MongoDB may not be perfect, but like any tool, if you know its limitations it
can be extremely useful, and it certainly is way more approachable for
programmers who do not have the luxury of learning all the voodoo and lore
that surrounds SQL-based relational DB's.

Look for some rational discussion at the bottom of this mongo hatefest!

~~~
lars_francke
I wouldn't call a JIRA ticket good documentation.

While I agree that it's good to know the limitations of the tools you chose
those limitations should be clearly spelled out in the documentation.

I don't think most programmers have the luxury of learning all the voodoo and
lore that surrounds MongoDB from JIRA tickets and blog posts.

~~~
danbmil99
> I don't think most programmers have the luxury of learning all the voodoo
> and lore that surrounds MongoDB from JIRA tickets and blog posts.

That's how I learned everything I know about most FOSS products I have
encountered - through the code pages and social media surrounding the project.

Pretty much everything about the mongodb hate derives from their marketing and
sales. The truth is, they've obviously stumbled onto something the market
wants, otherwise they would never have become so successful.

For me, as a long-time programmer with no database experience, the mental
mapping of JSON constructs as both data and query language was far easier for
me to absorb than the relational model, which didn't fit the paradigms that I
was used to.

At my present gig, we've used Mongo DB for two years, scaling up to quite a
large production setup. Like any technology it has strengths and weaknesses,
but it has not been the utter failure that readers of Hacker News would be led
to expect. We adopted it knowing quite a bit about its history, and it has
turned out to be an excellent choice that has held up over time.

Periodically we've considered switching to postgres, and we may do so for part
of our stack. But for the core jobs of data collection and batch processing
data with fluid schema, I'm pretty sure we will stick with mongodb for the
duration.

It's just a tool, folks.

------
ahachete
Strongly biased comment here, but hope its useful.

Have you tried ToroDB
([https://github.com/torodb/torodb](https://github.com/torodb/torodb))? It
still has a lot of room for improvement, but it basically gives you what
MongoDB does (even the same API at the wire level) while transforming data
into a relational form. Completely automatically, no need to design the
schema. It uses Postgres, but it is far better than JSONB alone, as it maps
data to relational tables and offers a MongoDB-compatible API.

Needless to say, queries and cursors run under REPEATABLE READ isolation mode,
which means that the problem stated by OP will never happen here. Problem
solved.

Please give it a try and contribute to its development, even just with
providing feedback.

P.S. ToroDB developer here :)

~~~
nimrody
How does ToroDB handles sharding across multiple instances?

~~~
ahachete
Right now ToroDB handles sharding at the backend (RDBMS) level, with those dbs
that support that. There's currently a Greenplum-based backend on the works,
that obviously handles sharding by itself. Also CitusDB is on the roadmap.

At a later release, we also plan to natively support MongoDB's sharding
protocol.

------
cachemiss
My general feeling is that MongoDb was designed by people who hadn't designed
a database before, and marketed to people who didn't know how to use one.

Its marketing was pretty silly about all the various things it would do, when
it didn't even have a reliable storage engine.

Its defaults at launch would consider a write stored when it was buffered for
send on the client, which is nuts. There's lots of ways to solve the problems
that people use MongoDB for, without all of the issues it brings.

~~~
zamalek
I really agree with your sentiments, that first paragraph is a great quote. I
grew quite an adverse to MongoDB after researching it. While I never found
this specific caveat, I found other very worrying decisions.

> reliable storage engine

By "reliable" I assume you mean "consistent?" While MongoDB claims that it's
CP (which it's not, as per the article) there's nothing wrong with
inconsistent databases (AP, e.g. CouchDB). Mathematically there is no reason
for MongoDB to behave like this. It's fundamentally broken; it's neither AP
nor CP.

~~~
cachemiss
I actually mean reliable. Its probably different now, but at launch, the
defaults were fsync'ing every 30 seconds or so. It would literally just apply
the change to an memory mapped buffer and just fsync it once in a while.

They did that so they could look good in benchmarks, and it's why they
recommended so strongly that your memory completely fit in RAM or else things
would fall apart (pro-tip, any system that recommends that has a poorly
designed storage engine).

They also screwed up the consistent side of things as well.

------
vegabook
I have moved from Mongo to Cassandra in a financial time series context, and
it's what I should have done straight from the getgo. I don't see Cassandra as
that much more difficult to setup than Mongo, certainly no harder than
Postgres IMHO, even in a cluster, and what you get leaves _everything_ else in
the dust if you can wrap your mind around its key-key-value store engine. It
brings enormous benefits to a huge class of queries that are common in
timeseries, logs, chats etc, and with it, no-single-point-of-failure
robustness, and real-deal scalability. I literally saw a 20x performance
improvement on range queries. Cannot recommend it more (and no, I have no
affiliation to Datastax).

~~~
pixelmonkey
Genuinely curious: when you say "it brings enormous benefits to a huge class
of queries that are common in timeseries", what are you referring to, exactly?

I run Cassandra in production and I love its operational simplicity, scale-out
design, and write performance. But I think its support for time series is
perhaps over-hyped. To me, it seems the only queries you can run in Cassandra
is a key lookup (partition key row get) and a column slice (partition key row
get filtered by an ordered range of columns). This allows for a certain time
series use case e.g. where each row represents exactly one series, and where
the only thing you want to do with a series is to get its raw values. But it
doesn't allow for many of the things I personally think of when I think about
"time series queries", e.g. resampling, aggregates, rollups, and the like.

~~~
vegabook
I am referring to anything that resembles a range query, ie, where you require
a bunch of contiguous information queried on a single key. Think "give me all
of this person's chat entries from x time to y time", or indeed "give me all
this topic's comment entries from x time to y time" (but not both - only one
of the above would be efficiently stored - you decide which it would be).

Cassandra, as you know, forces a certain amount of "low level awareness"
requirement on the programmer because to tap into its uniqueness, you need to
know how you will query stuff, so that Cassandra will ensure that the most
common range queries are contiguously stored in rows. All other databases hide
the on-disk storage order from you in an abstraction, and you can find
atomisation causing inefficiency. Cassandra forces you to think about it, and
in return, guarantees contiguous storage order on disk along one of your keys
so that along that key, retrieval is lightning fast as it requires only one
pass.

Basically, both spinning disks but _also_ SSDs, are in essence, 1d media (ie,
a lot in common with tape) in the sense that along one dimension you can read
stuff massively fast, but as soon as you need to seek (ie start using
dimension 2), even on an SSD, your performance dramatically declines.
Cassandra forces you to think about your queries so that they will be
"aligned" along the most efficient direction on disk.

Now agreed that if your queries cannot be aligned along said direction, then
Cassandra drops to being no better than all the others, and penalises you with
some complexity. That includes some examples of aggregrates, resampling etc
(though I would argue that the order of magnitude contiguous read still helps
these). Some of this can be mitigated with denormalisation ie: storing stuff
more than once, in transposed or sub-sampled orders, something that relational
DB purists will hate, with some justification (potential for inconsistency).

FWIW Riak TS sounds promising with automatic "blob" style storage etc and
resampling capabilities which might take Cassandra on quite explicity and in a
higher level, more convenient way. I am about to evaluate it because I agree
with you that the resampling capability in particular could be better
supported in Cassandra, though ultimately, both databases will still be
limited by the underlying D1 v D2 "contiguous v seek" capabilities of the
storage so I'm not expecting miracles from Riak.

By the way, I'm not even touching on Cassandra's scale-out ease. More perf
needed? Literally just add boxes though it would be unfair not to comment on
the cost of this, which is Cassandra's node-level consistency tradeoffs for
very recently added data, and which is, if I recall correctly, why Facebook
went to Hbase. You can force consistency at the query level, but performance
can suffer.

~~~
_halgari
After some truely horrific experiences with Riak K/V, especially combined with
Riak Solr, I won't touch anything from Basho with a ten foot pole. Not sure
what's going on over there, but the reality of Riak in production was miles
away from what Basho's sales claimed was possible. And yes, we even spent
about 4 months working with their tech support. It almost seems that "It's
based on Erlang thus it scales" was the entirety of their design work.

I've also worked with Cassandra and have nothing but good to say about it, did
what we asked it right out-of-the-box. Datastax was really helpful as well.

\--

And I have no affiliation with either Basho nor Datastax, just really happy
with one product and completely blown away with the poor performance of the
other.

------
jsemrau
Weird to see that Mongo is still around. We started to use them on a project
~4 years ago. Easy install, but that's where the problems started. Overall
terrible experience. Low performance, Syntax a mess, unreadable documentation.

They seem to still have this outstanding marketing team.

------
paradox95
Should an infrastructure company be advertising the fact that it didn't
research the technology it chose to use to build its own infrastructure?

All these people saying Mongo is garbage are all likely neckbeards sysadmins.
Unless you're hiring database admin and sysadmins, Postgres (unless managed -
then you have a different set of scaling problems) or any other tradition SQL
store is not a viable alternative. This author uses Bigtable as a point of
comparison. Stay tuned for his next blog post comparing IIS to Cloudflare.

Almost every blog post titled "why we're moving from Mongo to X" or "Top 10
reason to avoid Mongo" could have been prevented with a little bit of
research. People have spent their entire life working with the SQL world so
throw something new at them and they reject it like the plague. Postgres is
only good now because they had to do some of the features in order to compete
with Mongo. Postgres been around since 1996 and you're only now using it? Tell
me more about how awesome it is.

~~~
glasser
My goal in writing this post was not to convince people to use or not use
MongoDB, but to document an edge case that may affect people who happen to use
it for whatever reason, which as far as I could tell was inadequately
documented elsewhere.

~~~
paradox95
Only the first line was directed at you - and it was more in jest. Everything
else was directed more at the other commenters and Mongo detractors in
general.

------
ruw1090
While I love to hate on MongoDB as much as the next guy, this behavior is
consistent with read-committed isolation. You'd have to be using Serializable
isolation in an RDBMS to avoid this anomaly.

~~~
teraflop
I think this is incorrect, but it's not as simple as the other replies are
making it out to be.

Under read-committed isolation, _within a single operation_ , you must not be
able to see inconsistent data. So if you do "SELECT <star>" on a table while
rows are being updated, you're guaranteed to always see either the old value
or the new value. But if you do two separate statements, "SELECT <star> WHERE
value='new'" and "SELECT <star> WHERE value='old'" in the same transaction,
you may not see the row because its value could have changed. Serializable
isolation prevents this case, typically by holding locks until the transaction
commits.

It gets messy because the ANSI SQL isolation levels are of course defined in
terms of SQL statements, which don't map perfectly to the operations that a
MongoDB client can do. Mongo apparently treats an "index scan" as a sequence
of many individual operations, not as a single read. So you could argue that
it _technically_ obeys read-committed isolation, but it definitely violates
the spirit.

------
twunde
The real problem with Mongo is that it's so enjoyable to start a project with
that it's easy to look for ways to continue using it even when Mongo's
problems start surfacing. I'll never forget how many problems my team ended up
facing with Mongo. Missing inserts, slow queries with only a few hundred
records, document size limits. All while Mongo was paraded as web scale in
talks.

------
wzy
Does Meteor support a proper database system yet, a la. MySQL or Postgres?

~~~
dan_ahmadi
Yes - with Apollo/GraphQL (currently available as a technical preview):
[http://docs.apollostack.com/apollo-
client/meteor.html](http://docs.apollostack.com/apollo-client/meteor.html)

I recommend you check out the Apollo Meteor Starter Kit:
[https://github.com/apollostack/meteor-starter-
kit](https://github.com/apollostack/meteor-starter-kit)

~~~
wzy
Noticed how I referenced 2 proper RDBMS in my question? Then how you proceeded
to introduce another flavour-of-the-month

~~~
sotojuan
I haven't looked at Apollo, but GP should've explained that GraphQL is not a
database and can be hooked up to any backend, so with it I'm guessing you can
use any kind of database in Apollo/Meteor apps. Still, kind of weird.

Another reason why I never bothered with Meteor.

~~~
wzy
I remember when Meteor was the JavaScript flavour-of-the-month and everyone
was saying it will kill Rails. I wanted to believe so I looked into Meteor,
then I saw its dependence on MongoDB... Nope!

~~~
sotojuan
If you study the backend JavaScript ecosystem you'll see no Rails-like all-in-
one framework has ever succeeded. They're just not part of the culture. The
only Rails-like thing in the JS ecosystem that's popular and great is Ember
but that's only frontend.

------
aavotins
MongoDB reminds me of an old saying that if you have a problem and you use a
regex to solve it, you end up with two problems.

I have personally used MongoDB in production two times for fairly busy and
loaded projects, and both times I ended up to be the person that encouraged
migrating away from MongoDB to a SQL based storage solution. Even at my
current job there's still evidence that MongoDB was used for our product, but
eventually got migrated to PostgreSQL.

Most of the times I've thought that I chose the wrong tool for the right job,
which may be true, but still leaves a lot of thought about the correct
application. Right now I have a MongoDB anxiety - as soon as I start thinking
about maybe using it(with an emphasis on maybe), I remember all the troubles I
went through and just forget it.

It is certainly not a bad product, but it's a niche product in my opinion.
Maybe I just haven't found the niche.

~~~
MoOmer
I literally brought up the regex joke in a meeting yesterday. A data warehouse
was built on top of Mongo, and I get to help clean up the mess.

------
jtchang
This single issue would make me not want to use MongoDB. I'm sure there are
design considerations around it but I rather use something that has sane
semantics around these edge cases.

------
Animats
Not when they're changing rapidly, anyway. Well, that's relaxed consistency
for you.

Does this guy have so many containers running that the status info can't be
kept in RAM? I have a status table in MySQL that's kept by the MEMORY engine;
it's thus in RAM. It doesn't have to survive reboots.

------
fiatjaf
CouchDB is simple and reliable. You can understand it from day one. I can't
imagine why it isn't being used.

~~~
skeoh
I really want an excuse to build something with CouchDB and PouchDB
([https://pouchdb.com/](https://pouchdb.com/)). Can you expand on your
experiences with it?

~~~
mikekchar
I'm not the original poster, but I can give you some of my limited experience
with CouchDB from an application I inherited. The original idea for the
project still seems like a good idea to me. Basically they wanted to record
events that came into the system and store them in a write only ledger. Then
they wanted to version every change so that you have an audit trail. Finally
they wanted to be able to create views of that ledger to create the kind of
data that they would work with on a day to day basis. For this, CouchDB seems
like a perfect fit.

Unfortunately, it didn't work out as well as one might hope because the people
who implemented the idea didn't seem to be able to resist using the DB the way
they would use a relational db. Instead of maintaining the concept of a write
only ledger, they started to use it as a data store for things that were
ephemeral. Also, instead of replicating the db, using a view to create a new
db that was optimal for certain queries, they wrote a huge number of views in
the main db. Finally they organised the views by relation rather than by use,
so you would have 60-80 views in the same design document that would have to
be reindexed if one of them changed.

The result was something with very poor performance and where the storage for
the indexes was more than an order of magnitude more than the storage for the
documents themselves.

CouchDB is also not super speedy at the best of times. There is a lot of
latency involved in serializing the documents and farming them out to view
servers, etc. So it takes a good 10 minutes to process a million documents,
but you will find that your CPU is chugging along at 30-40% utilisation.

Having said all that, one of the things I want to try (but have only done some
preliminary trials with) is to keep the concept of the write only ledger, but
to replicate the db into several views of the data (some with severely
restricted content). Then instead of building something like a rails
application to farm out the data, make "couch applications" where you serve
the HTML and JS directly from attachments on documents in the DB. In fact,
I've written a React application to allow users to interact with portions of
the data and it was quite simple. Then you can write a really small
coordinating application to allow users to navigate to the parts of the system
(really single page apps) that they want to use.

Again, the nice thing about this is that you have a write only data store with
versioning and the ability to audit history. You have views that allow you to
interact with a small subset of the overall data. You can easily write single
page applications where deployment is as easy as pushing a document to the DB.
Replication is relatively cheap and you can move expensive view creation to
restricted versions of the DB. You can stick the whole thing behind a load
balancer and scale it as cheaply as setting up a new replication (again just
another document in your DB).

But, I will warn you. Don't use it like you would a relational DB, or else you
will be in for a world of hurt. Especially you will see comments in this
thread about migrations. If you are migrating your data, by definition you do
not have a write-only-with-versioning application. Your application will have
to deal with multiple versions of data or else you will not have the ability
to audit history. If you do not care about this, then possibly there are
better solutions than this.

~~~
fiatjaf

        one of the things I want to try (but have only done some
        preliminary trials with) is to keep the concept of the 
        write only ledger, but to replicate the db into several 
        views of the data (some with severely restricted content). 
    

How is that different from the current concept of CouchDB views? You meant to
replicate the DB to various different places and use the same CouchDB views
from there? Or you meant something like a replication-view, in which some
calculations are done with the documents in the source database and the target
database receives the result of those calculations as their primary documents?

~~~
mikekchar
Yes, the latter. One of the main problems I've seen is that indexing things
that you will never query is both expensive in time and space. Also it's
amazing how many views tend to have exactly the same data, only sorted
differently. And the reason to sort it differently is because you only want to
work on a subset of the data, but you can only restrict the query in a
contiguous section of keys.

An example of this might be that you have a large number of daily reports.
They all need different aspects of the data, but you end up writing views that
sort by date and then collate the result in the server. So you end up
maintaining an index for data that you will never query again and you are
doing lots of extra processing merging the data after the query. Much better
to replicate one day's worth of data to a new db every evening (possibly
setting up a continuous replication to keep it up to date) and then add views
on _that_ db to do what you want. Like I said a full replication of a million
documents takes about 10 minutes, so it's a reasonable thing to do.

~~~
fiatjaf
I like the concept of views. I think it is the best part of CouchDB. Saving
data and later defining views (that should be very fast). Once I wrote an app
that stored massively enormous documents with lots of data, the views were
used to turn that data into queriable information later.

I like what you suggested very much, because what CouchDB can currently do
with views is very limited to what a powerful views implementation would, and
the one you've suggested is a good suggestion of how to do it better.

Thank you. I'll give this a lot of thinking when I restart writing my
[https://github.com/fiatjaf/summadb/](https://github.com/fiatjaf/summadb/)

------
avital
I believe this is solved by Mongo's "snapshot" method on cursors:
[https://docs.mongodb.com/v3.0/faq/developers/#faq-
developers...](https://docs.mongodb.com/v3.0/faq/developers/#faq-developers-
isolate-cursors)

~~~
glasser
If I understand correctly, this method says "only scan the built in _id index,
not any other index". Which means that you will not hit this index-specific
bad behavior, but also that you won't get the performance characteristics of
using an index.

------
rjurney
Mongo is hilarious. Ease of use is so important, we just don't much give a
shit that it has all these gaping holes and flaws in it.

------
shruubi
Seriously, who looks at MongoDB and thinks "this is a sane way of doing
things"?

To be fair, I've never been much of a fan of the whole NoSQL solution, so I
may be biased, but what real benefits do you gain from using NoSQL over
anything else?

~~~
cortesoft
Web scale!

------
d3ckard
I worked with MongoDB quite a lot in context of Rails applications. While it
has performance issues and can generally become pain because of lack of
relations features, it also allows for really fast prototyping (and I believe
that Mongoid is much nicer to work with than Active Record).

When you're developing MVPs, work with ever changing designs and features,
ability to cut off this whole migration part comes around really handy. I
would however recommend to anybody to keep migration plan for the moment the
product stabilizes. If you don't, you end up in the world of pain.

------
hendzen
Actually, if this lack of index update isolation is correct, you can get the
matching document zero, one or _multiple times_!

------
doubleorseven
Mongo, in one word: sucks. Couchbase, does not.

~~~
bioinformatics
I use RethinkDB in most of my production things. I recommend it.

~~~
ubercore
Agreed, I've had nothing but positive experiences with it.

~~~
bioinformatics
I started with Mongo too, had some performance issues and started used
RethinkDB when they first released it. It does get better with every update.

------
spullara
It literally returns wrong answers for queries. I can't believe anyone this
thread is defending it.

------
jitix
What storage engine are you using? I wonder if the same issue comes in
wiredtiger MVCC engine.

~~~
lossolo
He wrote in comments that this issue is in mmap and in wiredtiger.

~~~
tinix
There are like 5+ storage engines available for Mongo, probably more, but
those are just the ones I'm aware of, plus various forks, like TokuMX an
Percona, etc... This is all FUD.

------
alkonaut
So it's a bit weak in the design department, offers a bit less rigid semantics
than one might hope, and from the start it's a technology that was almost a
reaction to the rigid and enterprise-y of old.

Mongo reminds me a wee bit of JS...

------
xchaotic
Unless you want to code every rdbms and enterprise feature in the application
layer, don't use Minho, use Postgres or Use Marklogic. It is 'nosql', but it
is acid compliant and uses MVCC so what the queries return is predictable.

------
Osiris
I hear a lot about MongoDB's reliability issues. How do CouchDB or other
document store database compare in terms of reliability and consistency?

~~~
rdtsc
CouchDB is rock solid. Used it for 5 years now. Never got corrupted data. Has
master-to-master replications. Really shines in sometimes offline operation
mode (with re-sync on reconnect).

I use that extensively to build custom replication cluster topoligies
(overlapping rings, star, hierarchy) etc.

Has HTTP interface so easy to build clients for.

Transactions are per document only. So have to design your application to
accomodate it. Raw single document write speed is not as fast as Mongo or
Postgres. But I noticed in a concurrent environment, multiple connections
writing it scaled pretty well.

Moreover, CouchDB 2.0 will have built-in clustering from code donated by
Cloudant. And it will also have a similar query language like MongoDB (instead
of having to use Javascript / Python / Other map-reduce functions).

------
clentaminator
An interesting read into the development of a project that started using
MongoDB and switched to PostgreSQL after eight months in production:
[http://www.sarahmei.com/blog/2013/11/11/why-you-should-
never...](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-
mongodb/)

------
partycoder
This use-case is not something that you would use MongoDB for. Try Zookeeper.

This being said, I would feel embarrassed to post this on behalf of the
engineering department of a company.

This post is just a very illustrated way of saying "we have no idea about what
we are doing and our services are completely unreliable".

This is so bad that is more of an HR problem than it is an engineering
problem.

~~~
teraflop
Did you miss the part about how they're running a hosting platform that stores
details about the status of containers for all their customers?

Zookeeper is fine for things like service discovery that deal with a bounded
amount of data. You _don 't_ want to use it for something where the amount of
data depends on, say, how many containers your customers decide to start.
Every ZK server keeps all of its data on the Java heap, so if your data gets
too big, _pow_. How big is too big? Don't worry, you'll find out the hard way
sooner or later!

Plus, there's no sharding -- every write operation has to be acknowledged by a
majority of nodes in your cluster. So for write-heavy workloads (which is what
I would expect a service status dashboard to experience) your cluster actually
gets _slower_ if you try to add more machines.

~~~
partycoder
Zookeeper slows down when you add nodes since quorum/consensus is larger. You
can mitigate some of this with non-voting nodes (observer nodes) but only up
to certain extent. So yes, a single Zookeeper cluster won't scale
horizontally.

But that doesn't limit the amount of independent clusters you can have.

The reason I suggested Zookeeper is because it offers you ephemeral nodes,
which is convenient to mark stuff as unavailable.

------
tinix
Y'all know other storage engines exist, right?

I searched the comments for "percona" and found nothing...

Figures.

Meanwhile, [https://github.com/percona/percona-server-
mongodb/pull/17](https://github.com/percona/percona-server-mongodb/pull/17)

------
bbcbasic
Ahhh the Trough of Dissolutionment!

[1] [https://setandbma.wordpress.com/2012/05/28/technology-
adopti...](https://setandbma.wordpress.com/2012/05/28/technology-adoption-
shift/)

------
xenadu02
Use of MongoDB at PlanGrid is probably the single worst technical decision the
company ever made.

We've migrated our largest collections to Postgres tables and our happiness
with that decision increases by the day.

------
vs2370
I am pretty excited about cockroachDb. Its still in beta so not suggested for
production use yet, but its being designed pretty carefully and by a great
team.. check them out cockroachlabs.com

------
mouzogu
Is MongoDB really that bad?

I am someone just getting into Meteor Js and it seems like moving from MongoDB
would make it Meteor trickier to learn.

Is it difficult to switch to an alternative? Thanks

~~~
nevi-me
It's not, go ahead and use it, learn and gain experience. It's not a
replacement for SQL databases. It doesn't have joins and the biggest issue
academics and as sysadmins have is it's not fully ACID compliant, so no
transactions for example.

If I was writing this 2 years ago, I would say horizontal scaling is much
easier. Add a node to your replica, watch it catch up, and continue.

Have data stored in an array 4 levels deep? Mongo will find it for you. It's
only difficult to switch to an alternative to the extent that you've convolved
your schema in an unfriendly way. Most migration entails normalising your data
into different SQL table and exporting it. Not rocket science as people make
it seem to be.

I use SQL at work, Oracle, SAS excuse, a bit of MySQL and sometimes Postgres -
I'm a consultant. I have tried some NoSQL DBs but always come back to Mongo,
for personal projects.

I've done a few prototypes for clients using Mongo, but those are almost
always for geospatial support.

~~~
mouzogu
Thanks

------
wvenable
I wonder how much data they are storing and in what pattern that they actually
need a NoSQL database. I'm curious why someone would make that choice.

------
acarrera
If you were inserting changes in the status you'd have much better data and
never incur in such issues.

------
geoPointInSpace
I'm prototyping in meteor using MongoDB and Compute Engine.

I have two VM instances in google cloud platform. One is a web app and the
other is a MongoDB instance. They are in the same network. The connection I
use is their internal IP.

Can other people eaves drop between my two instances?

------
apeace
TL;DR During updates, Mongo moves a record from one position in the index to
another position. It does this in-place without acquiring a lock. Thus during
a read query, the index scan can miss the record being updated, even if the
record matched the query before the update began.

~~~
nameless912
_mongo developer 1_

Man, it's just taking too damn long to run this update. What should we do?

 _mongo developer 2_

Uh....remove all the locks?

 _mongo developer 1_

Oh, yeah, that makes sense, let's do that.

 _mongo developer 3_

Hey, what are you guys up to?

 _mongo developer 2_

Making it web-scale!

 _mongo developer 3_

Good job, keep it up!

~~~
xyience
Seriously. I was looking at DB usage statistics recently and was appalled
MongoDB is still so popular. I thought it was done, nail in the coffin, when
[https://www.youtube.com/watch?v=b2F-DItXtZs](https://www.youtube.com/watch?v=b2F-DItXtZs)
came out 6 years ago, I haven't followed it much since then apart from the
occasional post like this whose content is just "you thought it was bad
already? haha it's worse."

------
opless
But it's web scale! </sarcasm>

------
wizardhat
TLDR: He was reading the database while another process was writing to it.

Why all the Mongo hate? I'm sure this would happen with other databases.

~~~
Amezarak
No, this does not happen with any relational database I've worked with.

------
throoooowaway
But is your database webscalwebscale? MongoDB is a web scale database.

~~~
bbcbasic
You've seen THAT YouTube video then!

------
rgo
Everytime I hear arguments for going back to relational databases, I remember
all the scalability problems I lived through for 15 years in relational hell
before switching to Mongo.

The thing about relational databases is that they do everything for you. You
just lay the schema out (with ancient E-R tools maybe) load your relational
data, write the queries, indexes, that's it.

The problem was scalability, or any tough performance situation really. That's
when you realized RDBMSs were huge lock-ins, in the sense that they would
require an enormous amount of time to figure out how to optimize queries and
db parameters so that they could do that magic outer join for you. I remember
queries that would take 10x more time to finish just by changing the order of
tables in a FROM. I recall spending days trying different Oracle hints just to
see if that would make any difference. And the SQL-way, with PK constraints
and things like triggers, just made matters worse by claiming the database was
actually responsible for maintaining data consistency. SQL, with its
naturalish language syntax, was designed so that businessman could inquire the
database directly about their business, but somehow that became a programming
interface, and finally things like ORMs where invented that actually
translated code into English so that a query compiler could translate that
back into code. Insane!

Mongo, like most NoSQL, forces you to denormalize and do data consistency in
your code, moving data logic into solid models that are tested and versioned
from day one. That's the way it's supposed to be done, it sorta screams take
control over your data goddammit. So, yes, there's a long way to go with Mongo
or any generalistic NoSQL database really, but RDBMS seems a step back even if
your data is purely relational.

~~~
wvenable
I've been in the opposite situation and I couldn't disagree more. But I will
say this, it's always possible to take an RDBMS model and de-normalize it and
use it like a NoSQL database (like reddit does, for example) but it's not
possible to go the other way.

~~~
lloyd-christmas
> but it's not possible to go the other way

Why not? We do exactly that. We prototype in mongo and then migrate to
postgres when we're comfortable with where the app is headed.

~~~
wvenable
I don't mean it's possible to use a different technology; I mean within the
same technology (postgres, for example) you can use it both as a normalized
relational database and/or as a de-normalized document store.

------
TimPrice
The article is interesting, but title is fud. Besides, all this is not
unexpected:

> How does MongoDB ensure consistency?

> Applications can optionally read from secondary replicas, where data is
> eventually consistent by default. Reads from secondaries can be useful in
> scenarios where it is acceptable for data to be slightly out of date, such
> as some reporting applications.

[https://www.mongodb.com/faq](https://www.mongodb.com/faq)

~~~
glasser
This is not related to reading from secondaries. This issue can occur in
single node systems.

~~~
TimPrice
Yes, as I said, all this is NOT unexpected from them.

