
MapReduce for C: Run Native Code in Hadoop - rey12rey
http://google-opensource.blogspot.com/2015/02/mapreduce-for-c-run-native-code-in.html?m=0
======
choppaface
Hmmm, this is _definitely not_ some sort of vanilla C interface for Hadoop.
MR4C is rather a ~40ksloc C++/Java library capable of performing somewhat
high-level operations on datasets. It's closer to a MapReduce app or a tool
like Pig than a simple C/C++ wrapper. In particular, MR4C has its own data
model:

* [https://github.com/google/mr4c/tree/master/tutorial/UserGuid...](https://github.com/google/mr4c/tree/master/tutorial/UserGuide)

... and nontrivial logging, traceback debugging, tempfile handling.. stuff for
_writing a library in MR4C_

* [https://github.com/google/mr4c/blob/master/native/src/cpp/im...](https://github.com/google/mr4c/blob/master/native/src/cpp/impl/util/StackUtil.cpp)

* [https://github.com/google/mr4c/blob/master/native/src/cpp/im...](https://github.com/google/mr4c/blob/master/native/src/cpp/impl/util/MR4CTempFiles.cpp)

* [https://github.com/google/mr4c/blob/master/native/src/cpp/im...](https://github.com/google/mr4c/blob/master/native/src/cpp/impl/util/MR4CLogging.cpp)

... and a 3600 line C++ package for extracting crops from satellite images
based upon geospatial coordinates (that is likely more relevant to Skybox than
Hadoop users)

* [https://github.com/google/mr4c/tree/master/geospatial](https://github.com/google/mr4c/tree/master/geospatial)

... and singletons all over the place! (Kenton Varda would not be happy)

* [https://github.com/google/mr4c/blob/master/native/src/cpp/ap...](https://github.com/google/mr4c/blob/master/native/src/cpp/api/algorithm/AlgorithmAutoRegister.h#L34)

Props to the Skybox devs for getting this out. All of this code almost
certainly got thrown away as part of Skybox's transition onto the Google
stack.

Have not yet found any benchmarks or illustrations of I/O savings.

------
shmerl
Hadoop also allows using different languages with pipes
([https://wiki.apache.org/hadoop/C++WordCount](https://wiki.apache.org/hadoop/C++WordCount)),
thought it's not really a fully native approach anyway. And there are various
limitations in that method in comparison with Java APIs.

Google really had to open source their original map-reduce framework which was
written in C++. It would have prevented Hadoop from being mostly Java based.

~~~
oxplot
It's highly likely that their map-reduce framework is tightly integrated with
and dependent on their own infrastructure (e.g. GFS, BigTable, etc).

~~~
shmerl
That may be, but they could open those too ;) For example HPCC
([https://en.wikipedia.org/wiki/HPCC](https://en.wikipedia.org/wiki/HPCC)) was
opened by LexisNexis with all its distributed filesystems and etc. They kind
of realized that such frameworks only benefit from being open.

------
a1k0n
I'm really confused. It appears to have no HDFS support? So how do you get
data in and out of the cluster? I guess the InputFormat just takes care of
that magically?

Also the code isn't even close to Google style. I guess that's not so
surprising, come to think of it.

This is really pretty interesting. I've wanted to be able to do this for a
while, and marshaling everything through the JNI I guess is the easiest way to
do it.

~~~
quadrature
Im equally confused, they mention that data sources can be specified through
hdfs uris.

I can tell you one thing though, JNI is never the answer.

