

Best language to learn for Big Data - akbarpasha

I have been watching the trends that are occurring with respect to big data. I have read books - Numerati, Super Crunchers and now reading 'Singularity is near'. I am a programmer with about 12 years of experience.<p>I am just confused as to what language to pick to learn and write programs on big data. I tried Erlang - not many support libraries, looked at R, peeked at Haskell, heck even thought I would go back to Java!<p>Is there any specific language you would recommend that would enable to be process, analyze and write good applications using huge datasets?
======
mindcrime
Java's not actually a bad choice, considering some of the tools that are
available. Hadoop[1], HBase[2], etc. Depending on what you are trying to
accomplish of course, you could also include Lucene[3], Solr[4], Mahout[5] and
Weka[6] in that list.

Of course cool Java libraries don't mean you have to work in Java. Scala[7]
seems to be gaining quite a bit of momentum, as does Clojure[8].

From your reading list, it sounds like you may be interested in predictive
analysis, machine learning, etc. If so, Python[9] is pretty popular in those
circles and has some nice libraries available as well[10]. And you can do neat
things like use a Hadoop cluster for map/reduce processing, even if you have a
client written in Python (or whatever)[11].

Of course Map/Reduce[12] isn't the only distributed processing model out
there. You might want to learn MPI[13] then explore the MPI bindings that are
available for many (most?) languages, and look at other models of distributed
/ parallel computing.

Anyway, to summarize, I'd say that if you already know one or more of Java,
Python, C, C++ or Scala, you're in good shape. If not, some combination of the
above should fit the bill. And yes, R might be handy as well.

Not to take anything away from Erlang, Haskell, Forth, Factor, Scheme, Lisp,
OCaml, Perl, or any other language of course. I'm just not as familiar with
those so I can't say much about them.

1\. <http://hadoop.apache.org/core/>

2\. <http://hbase.apache.org/>

3\. <http://lucene.apache.org/java/docs/>

4\. <http://lucene.apache.org/solr/>

5\. <http://mahout.apache.org/>

6\. <http://www.cs.waikato.ac.nz/ml/weka/>

7\. <http://www.scala-lang.org/>

8\. <http://clojure.org/>

9\. <http://www.python.org/>

10\. <http://www.google.com/search?q=python+machine+learning>

11\. [http://www.michael-
noll.com/wiki/Writing_An_Hadoop_MapReduce...](http://www.michael-
noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python)

12\. <http://en.wikipedia.org/wiki/MapReduce>

13\. <http://en.wikipedia.org/wiki/Message_Passing_Interface>

~~~
evgen
This is a good overview, but I would definitely plump Clojure higher in the
implicit rankings of your list. If you know Java and Lisp does not frighten
you then it is probably a better choice for big data analysis than Scala. The
Incanter libraries give you a very nice R-like environment for numerical
analysis and Cascalog is a pretty nifty system for interacting with Hadoop.

~~~
mindcrime
You're probably quite right. I didn't give Clojure a stronger endorsement only
because I don't know it very well yet myself... I'm just starting with it. I
wasn't aware of either Incanter or Cascalog, for example, until you mentioned
them.

------
beagle3
APL is an option. You'll have to fork for Dyalog (or APLX or whatever else is
out there) if you want to use it, there is no free IDE or anything available.

J is also an option. It's APL's cousin; it's free, and can handle lots of data
with ease and elegance.

Both languages take time to learn, and are probably not similar to anything
else you've used, but are very rewarding -- even if you end up not using them,
your style in any other language will improve.

Other alternatives are Matlab, Numpy and R (which you've already mentioned).
Numpy is great if you feel good with Python.

Basically, the magic elements are "64-bit" and "memory mapping". These two
together, if used properly, will make it look like any data set fits in
memory. Actually indexing it so that things don't slow down too much is up to
you. Don't skimp on physical memory.

------
starkfist
Java if you want to use all the existing tools like Hadoop.

C++ if you are writing it from scratch.

It's useful to know a scripting language but eventually you will hit
performance limits and will need to use Java or C++.

------
ig1
C++ tends to be the work horse in this area, but it depends what you want to
do and what the likely bottlenecks are going to be (CPU, IO, memory, etc.)

