I have a large amount of data in a mongodb database and I want to utilize apache hadoop (not mongo map-reduce) to analyze this data. Does anyone have suggestions/tutorials/etc. on what the best way to do this is? (i.e. export mongo data to the HDFS)
I was at a meetup where the Foursquare data science team spoke about this problem. If I recall correctly, their solution was to have jobs that would take the data from Mongo and store it in flat files that would then be used by the Hadoop jobs. They found that the performance gained was worth the additional storage costs. They have a pretty well defined Hadoop process though so were able to optimize for it. If you plan on having a variety of Hadoop jobs it may not make as much sense.
Note that this information may be outdated so just treat it as a data point. I'm sure others will have better ideas.
Note that this information may be outdated so just treat it as a data point. I'm sure others will have better ideas.