
Why MongoDB is a bad choice for storing our scraped data - reinhardt
http://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
======
JulianMorrison
I really don't understand why people use MongoDB.

It seems like it's a elegant technological metaphor (lets use mmap, the OS is
our cache and we can overwrite in place in RAM) that in practise turns out to
be a terrible idea. Overwrite/mmap cannot be made reliable, requires blocking
write-locks, wastes disk, and causes problems shuffling data around as it
grows. Add other bad decisions (keys aren't interned, seriously?)and it's just
a terrible limping monster.

Abandon it, walk away.

~~~
aroman
Because when you're not operating at significant scale, or have certain
specific use cases, it's a fantastically elegant solution and one that's very
quick and easy to set up.

I have used mongodb for a number of smaller projects, and I have had an
excellent experience. It's not "a terrible idea in practice". It might be a
terrible fit for what _you want_ , but that doesn't mean it's bad technology.

~~~
timr
_"when you're not operating at significant scale, or have certain specific use
cases, it's a fantastically elegant solution and one that's very quick and
easy to set up."_

When you're not operating at significant scale, you can use a relational
database. They're easy and fast to set up, have nice write-safety guarantees,
are more flexible than a key-value store, and will scale well beyond anything
that mongo has ever achieved. You can even use them as a key-value store! The
downside, of course, is that you have to a tiny bit of knowledge about set
theory, and that's a deal breaker for most "developers" today.

The whole point of the GP was that Mongo _isn't_ elegant or easy...it's just
naive and short-sighted, and the architectural mistakes within it are
fundamental and probably unfixable (at least, not without killing the speed
advantages they claim). The real reason that people use mongo is that most
webapp devs don't have a very good understanding of how computers work, and
want everything to look like Javascript, because that's all they really know.

~~~
lucisferre
You lost me when you quoted the word "developers".

Edit: And this is downvoted for calling out the fact that people on HN can't
discuss a freakin' database without hurling insults.

~~~
timr
I quoted "developers", because we need a term to distinguish people who know
basic computer science from people who know just enough to install software
and piece together APIs. The latter group tends not to realize that things
like overwriting your working set in memory and global write-locking lead
inevitably to consistency and throughput issues.

The primary problem in software today is that we've confused the ability to
build something with actually knowing anything of value.

~~~
lucisferre
In the real world though, both groups still need to use what works in
practice. It's entirely possible for MongoDB to work sufficiently well for a
certain group of people in a reasonably cost effective way. Exaggerating its
problems (as bad as they are) doesn't add weight to your agrement. For
example, global write-locking will not _inevitably_ lead to consistency or
throughput unless the write frequencies are sufficient to cause and require
that. Lots of data problems aren't "big data", they're barely a medium.

I'll agree MongoDB is not a terribly well engineered database however I don't
agree that SQL is always the best alternative in these simpler scenarios.
There are lots of things wrong with using SQL to solve every data problem. I
also don't agree that knowing SQL is equivalent to understanding "set theory".
I've known plenty of DBAs who don't know the first thing about set theory and
really just know just enough to install the software and piece together the
APIs. The fact that one has chosen SQL doesn't make them good at working with
data any more than choosing MongoDB makes someone bad, or implies they don't
understand "set theory".

What concerns me mostly though isn't the MongoDB issue but that we can't
discus the issue in a professional way, without elitism and distain dripping
through. You seem to have confused "computer science" with "anything of
value". I happen to believe there are more things worth knowing that are of
value to practical software development than just the computer science (not to
minimize that of course).

~~~
ownagefool
SQL isn't the alternative, it's the standard and noSQL databases are supposed
to offer extra value to cause you to migrate. According to these articles,
MongoDB doesn't offer any real additional value, thus you shouldn't use it.

It's not about elitism, it's about making good decisions. That said, given all
the hype with companies that hire, protesting loudly might not be the best
short term personal decision. Meh.

~~~
threeseed
These articles are just one side of the picture which gets heavily upvoted on
HN.

And of course it is about elitism. Listen to yourself. "It's about making good
decisions".

I mean who are you to judge from the outside what technology a company should
use for a specific use cases ?

~~~
ownagefool
I wasn't specifically talking about the articles, just in general point
that'll I'm happy to stand behind. I'll happily repeat it: unless you have a
specfic use case, RDBMS is the default and it should be so. Now if you have a
specific use case, fair enough, but the aforementioned examples don't honestly
seem to be valid reason to throw our the advantages of RDBMS.

------
leothekim
I'm interested in hearing what the author's new storage system is. What would
be compelling is to hear if the same hardware and storage with the new storage
system performed better than mongo with some semblance of concrete metrics.
There are a lot of complaints here about mongo -- all of them not new -- but
no hard numbers.

------
craigching
Whenever I see the "You don't need Mongo DB, use an SQL database" and then in
the flaming back and forth, I never see my key problem mentioned:

MongoDB makes it easy to scale out (replica sets and sharding), where is the
"easy to setup replicated and sharded open source SQL database?"

I mean, I know that Postgres has replication (via Slony? honestly, it's been
awhile since I looked at their solutions) but I don't recall it being as dead
simple to set up.

For me, setting up replication needs to be easy because we redistribute the
store as part of our product and we need scalability (both replication for
redundancy and sharding for scaling).

So I'm honestly asking here, where is the easy to use sharded and replicated
open source SQL store that I've been missing?

~~~
nasalgoat
That "easy to scale out" is a misnomer. Replica sets and sharding work in the
technical sense, but the implementation isn't anywhere near what I would
qualify as production ready.

For example, today my entire production MongoDB database was running 3x slower
because a single replica in one shard was down, and their buggy PHP driver
kept trying to talk to it despite it being marked down. I really enjoyed
waking up at 2am to deal with that.

It relates back to the "easy to use" nature of their marketing. It really is
super easy to use and develop on, but the minute you need to do anything
important or serious, it breaks down.

You aren't doing yourself any favors going with it except as a proof-of-
concept.

~~~
craigching
But Mongo DB being buggy isn't a reason to need to use an SQL database vs. a
NoSQL store. An SQL database could be buggy as well (I still use Postgres and
comparing anything to that quality-wise is just going to bring sorrow for the
thing you compare it to ;) ).

FWIW, it's been spotless for us so far. Our needs aren't web scale, but
they're big enough to need scaling features.

~~~
nasalgoat
The main issue with MongoDB is that it's so easy to use and seems like it
scales, but soon you're invested in it to the point of refactoring being a
serious engineering effort, and you're stuck with something that doesn't
actually offer real scaling features.

So, it's less "nosql vs. sql" and more just "don't use mongodb".

------
eknkc
It's like people started complaining about MongoDB just for the sake of it. I
guess it's the new trend?

\- Ordered data and skip / limit: These would run just fine on any database
system. Given that you have appropriate indexes. It does not matter if there
are a trillion items total, as long as you are seeking over an index and the
result set is in reasonable size.

\- Restrictions: A lot of software has restrictions. Filesystems has file name
limitations. RDBMSs have table / column name limitations. It's a fact of life.
Why is this a con for MongoDB?

\- Impossible to keep working set in memory: It is a fair argument that
MongoDB has shitty memory management because it just delegates the
responsibility to OS. However, this is a concern with any DBMS. Also, given
that there are appropriate indexes, you don't need to keep the entire database
on memory. This comes back to indexing problem.

\- No transactions / lack of schema / no joins...: I don't remember mongoDB
claiming to have such features. My car can't fly. I'm not complaining. (Well,
sometimes)

\- Locking: Fair point. Better I/O performance might come handy (like an SSD)
or eventually sharding.

\- Poor space efficiency: Fair point about fragmentation and field names.
Compression can be achieved on the filesystem level. There was an article
about that a couple of days ago. I'm not sure about pefroamnce though.

\- Too many databases: This should not be a big issue. Mongo does not go ahead
and allocate a couple gigagbytes for each db, it uses incremental file sizes.

\- Silent failures: Yep.. There it fails miserably. Recent versions are better
though.

~~~
kalleboo
Why are you taking this personally? They're just listing reasons why it's not
a good fit for them. Useful information to others who are trying to pick a
database for similar applications.

~~~
eknkc
I'm sorry if it looks like I'm attacking the criticism. Nope, I would not use
MongoDB ever again, after a year and a half with it. I have my reasons for
this decision.

I just don't like people bashing something without valid reasons. It might
just be a perfect solution for similar applications, this is not a good way to
evaluate.

~~~
Jaigus
>I just don't like people bashing something without valid reasons

That really doesn't seem to be the case here. Like you, the article's
author(and several others here) have had issues with it for their particular
use-case, and the reasons are clearly listed in a well organized paragraph by
paragraph summary explanation in the article. Others here who've had a similar
experience at least stated they had issues with it as well, even if they
didn't go into much detail about it.

And speaking of the lack of _valid reasons_ , to be fair, many relatively new
technologies like these often get significant _praise/hype_ without many valid
reasons as well, other than [X]startup/company is using it, so it should be
able to work for me, or it must be an awesome technology to use.

------
dkhenry
you lost me here

""" Ordered data

Some data (e.g. crawl logs) needs to be returned in the order it was written.
Retrieving data in order requires sorting which is impractical when the number
of records gets large. ""

it requires _indexing_ and is quite feasable as I do it every day with stock
ticker logs ( also required to be retrieved incrementially )

There are a few other flags that make me wonder about the exact limitations
you found, but I will be anticipating your follow up post to see what your fix
was since some of those issues are very common.

~~~
leothekim
Also, mongo has natural ordering which would do what the author wants without
sorting.

~~~
berito
No mongo natural order is just order on the disk. It is not always in reverse
insertion order

------
KaiserPro
Now the interesting thing about this post is this, I can see why they wanted
to use mongodb, and I can see why it bit them in the arse.

What interests me is why they would want to keep _everything_ in the database?
I'd assume that they need to aggregate and curate the scraped data. After the
initial scrape the majority of actions surely are going to be on the metadata
of the scraped content? (where is said data, when was it scraped, how big,
relationship to other data, etc) This data is much smaller and can be stored
in relational database, as its proper structured data with relationships.

This allows the nasty unstructured data to be kept on a plain boring
filesystem. After all filesystems are exceptionally mature, universal,
multilevel key-value stores.

Now people will say that filesystems don't scale, well that's not really true.
ext4/ntfs on a single system won't scale, but something like
lustre/gluster(although not as neat)/gpfs scales linearly with the amount of
nodes you apply to it.

~~~
shane42
This approach (scraped data on FS + metadata in DB) works well for storing
scraped data. It was the first thing I prototyped when we started the project
to move away from MongoDB. We've worked on similar designs in the past where
the data is in S3, it's a common pattern.

We'd need to code the searching, filtering, paginating, (distributed?) job
management ourselves while being careful to keep the DB & metadata consistent.
It works best if each file is a reasonable 'chunk' of data (not too big, not
tiny). None of this is a problem, and it scales very well as you said.

In the end, we went with HBase for crawl data in the new system. Of course,
you can look at this as files on a filesystem (HDFS or others) :) It does a
lot of what we would otherwise have to code ourselves and it's a good fit for
applications we want to build on that data in future (e.g. storing other crawl
datastructures, processing with hadoop). I'll provide more details on that in
the next post.

------
wiremine
Quote from the original author in the comments. TLDR: it was human error that
got us into this situation:

"The lack of joins & transactions of course did factor into the original
decision. My point (which perhaps could be clearer) was that MongoDB ended up
being used outside of the area in which we originally intended to use it.
There was some reluctance to add another technology when we could get by with
what we had for what was (initially) only a small use. Additionally, some
limitations were not always well understood by web developers (who were new to
mongo and enthusiastic to try it). I see this as our mistake. With hindsight,
it’s clear we should have introduced an RDBMS immediately and kept MongoDB for
managing the crawl data."

~~~
jeffdavis
Databases almost always grow outside of their original scope, much more so
than applications. And there is a good reason: data is more valuable when it's
combined with other data.

So I think it's reasonable to be cautious about using a system that can't
effectively grow outside of its initial special purpose.

------
latchkey
I use Mongo for storing a fairly large amount of scraped data and it works
great. Some of the data I store is results from bike races that I want to
display on my website in a better way than it is displayed elsewhere. The
columns change which makes Mongo a great fit, but the data is pretty static.

The real issue here is that it feels like the author has just 'discovered'
these problems as if Mongo was hiding them all along and after a long time
using the system he just found them. The reality is that all of the things he
brings up are well documented. It is fascinating to me how people pick a
buzzword database and don't bother to think about how their application might
run poorly on it over time.

~~~
calpaterson
Every time there is a mongodb retrospective or experience report posted to HN,
the top comment is one along the lines of "Well, these issues are all well
documented."

Firstly, the fact that some drawback is well documented does not excuse the
fact that it is a drawback.

Second, while some drawbacks are documented some implications of these
drawbacks are nuanced and only become obvious with experience. A good example
of this is the implications of "schemaless" databases (more accurately:
databases that do not check data against a schema). Not having to migrate
tables is a boon for lots of development. It's also a giant pain if it turns
out that bugs cause data integrity issues.

Third, this experience report is really useful since poorly structured scrape
data is one of the areas that I would have considered to be ideal for mongodb.

Most people don't have perfect foresight. I don't fault the author on his lack
of omniscience with respect to how mongodb would turn out for them. His
original reasoning (given in paragraph 1, sentence 1) does not seem stupid.

------
nosespray
People aren't complaining just to complain. When you can't even ctrl-c out of
the shell there is a huge issue. It's a known bug, around since v1.8, 'minor'
priority. Yeah, thanks for trapping me in the shell.

~~~
mratzloff
Apparently this is the ticket being referred to:

[https://jira.mongodb.org/browse/SERVER-2986?page=com.atlassi...](https://jira.mongodb.org/browse/SERVER-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-
tabpanel)

------
thehodge
I read the whole post waiting to see what they ended up using as we are having
similar issues, only to find that it's another post I have to wait for..

~~~
pablohoffman
We went with HBase. Cassandra would have been suitable too, but we already use
Hadoop for data processing so it was a natural choice within the
infrastructure ecosystem. We will write a followup about that.

~~~
monstrado
Clouderan here! Glad to hear you guys went with HBase, I'm looking forward to
your follow up post. Will you detail your key design / architectural setup?

Did you guys roll your own HBase environment or did you go with the CDH? If
you're using the CDH version and have any questions, feel free to shoot an
email to cdh-user.

~~~
pablohoffman
We are using CDH4.2 and have had a very positive experience so far.

Cloudera has in fact been an inspiration for us to follow, you guys have
really struck the right balance between open source and commercial support. We
follow the same philosophy with Scrapy (an open source web crawling
framework), as you do with Hadoop and its ecosystem.

~~~
monstrado
That's really awesome to hear, thanks for your kind words. I'm looking forward
to the follow up blog, depending on your key design you may be able to take
advantage of Impala for ad-hoc queries using SQL.

------
nramirezuy
Why you people hate MongoDB? It is great to start with NoSQL, and for new
little projects.

