

Mongodb IS web scale: hadoop-mongodb - rjurney
http://datasyndrome.com/post/14631249157/mongodb-is-web-scale-hadoop-mongodb

======
jbellis
My comment from TFA. Context: Paul Querna correctly pointed out that Cassandra
has had Hadoop integration since 0.6, and the author replied that Cassandra
was complicated.

If your objection to Cassandra is "it's complicated," you have no business
running Hadoop. :) How to set up a Cassandra cluster in under two minutes:
<http://www.screenr.com/5G6>

If on the other hand you simply made statements like "Mongo is the first NoSQL
to nail painless Hadoop and Pig integration" without doing any research, then
you should probably edit your blog post.

~~~
rjurney
I don't have anything against Cassandra. I will take a look at your link, and
see how easy it is to integrate with Pig. I would be pretty excited to have
another painless option available. The fact that Cassandra works with Whirr is
very, very cool. However:

Cassandra's documentation: <http://wiki.apache.org/cassandra/GettingStarted>

"Cassandra is an advanced topic, and while work is always underway to make
things easier, it can still be daunting to get up and running for the first
time."

Change your docs, or demonstrate how to one-liner push data to Cassandra, and
I will happily update my post. Shadow puppet docs do not count.

Your statement about Hadoop being complex illustrates EXACTLY the problem I'm
trying to solve. 'Big data' usability. ;) Amazon EMR against records in S3
with Pig is not hard. Publishing data from S3 via EMR to Mongo in Heroku...
that is not hard either. Wow, suddenly 'big data' is open to anyone using
Heroku. That is a big deal.

------
cbsmith
Gee. I kind of thought HBase & Cassandra had Hadoop integration pretty much
down...

~~~
squarecog
The HBase integration with Pig is pretty good (disclaimer: I wrote a bunch of
it, and use it on a daily basis). The only thing is that you need to create
the table and set up column families yourself. The mongo driver Russel demoes
automatically creates a table which may or may not be a good thing. Also, he
didn't actually say anything about scalability except for linkbaiting in his
title :).

~~~
rjurney
HBase integration is good. But having to deal with column families, etc. rule
it out for me in terms of solving the usability problem. I just want to push
records and retrieve them as JSON. This is the most common use case when
publishing data from Hadoop to a NoSQL store. I think this could be fixed? Can
column families be inferred? I am highlighting Mongo's superior usability here
to set an example for others.

~~~
squarecog
I would argue that any time you put "just" and "terabytes" next to each other,
you are heading for big problems to go with your big insights :). Schema-less
is great.. until you can't find stuff and your data is full of
inconsistencies.

~~~
rjurney
I've operated this way in practice, at scale, and it works fine. You're
rebuilding your entire store and swapping it out frequently, so data
consistency isn't a problem. The key is to have a painless pipeline setup, so
that one person can do the entire thing... thus negating the need for
contracts between parts of the stack.

------
kevinpet
Sure it is. I'll grant you that Mongo DB is web scale. Now could someone tell
me what web scale means? The whole point of that xtranormal piece was the "web
scale" is a meaningless marketing term. You can't _argue_ that something
qualifies for a meaningless marketing term.

cliff notes: Article didn't define web scale, therefore I didn't read
meaningless article.

------
gsteph22
fml

~~~
rjurney
Hey, whatever you think about Mongo... it can probably work well as a read-
only key/value store. Which is all that is needed to publish data from Hadoop,
because you've already batch processed it into presentation form.

