

Ask HN: What is the best way to dump mongo data into Hadoop? - misiti3780

I have a large amount of data in a mongodb database and I want to utilize apache hadoop (not mongo map-reduce) to analyze this data. Does anyone have suggestions/tutorials/etc. on what the best way to do this is? (i.e. export mongo data to the HDFS)
======
dangoldin
I was at a meetup where the Foursquare data science team spoke about this
problem. If I recall correctly, their solution was to have jobs that would
take the data from Mongo and store it in flat files that would then be used by
the Hadoop jobs. They found that the performance gained was worth the
additional storage costs. They have a pretty well defined Hadoop process
though so were able to optimize for it. If you plan on having a variety of
Hadoop jobs it may not make as much sense.

Note that this information may be outdated so just treat it as a data point.
I'm sure others will have better ideas.

~~~
misiti3780
i was looking at this presentation, but cant really make sense of the two
slides on "BSON Data ..."

[http://engineering.foursquare.com/2012/06/22/our-hadoop-
stac...](http://engineering.foursquare.com/2012/06/22/our-hadoop-stack-at-
foursquare/)

------
taligent
There is this:
[http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.htm...](http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html)

And a generic connector as well:
[http://blog.mongodb.org/post/29127828146/introducing-
mongo-c...](http://blog.mongodb.org/post/29127828146/introducing-mongo-
connector)

------
heretohelp
Should be trivial since document stores are less relational and the data
should be relatively isolated.

You really just need to learn the subject matter, there is no magical wand for
loading data from one to the other.

You understand one, then you understand the other, then you understand how to
port and grapple with the data.

Just start reading.

