

Tools for the data science craftsman: R, Python, Clojure and Julia - nextos
http://blog.redowlanalytics.com/post/67465127385/tools-for-the-data-science-craftsman

======
stewbrew
A rather useless listing. I don't see the point of including three dynamically
typed languages with similar usage scenarios. Until Julia is battle-tested,
I'd suggest to also include C++ or Java (Scala) in the toolbox. You should
also be aware that you can use C++ & Java from R, R from Julia or Clojure etc.
Being able to use some Java Library thus is no argument exclusively in favour
of Clojure.

~~~
nextos
I know the listing is quite useless. But I wanted to hear people's opinions on
which language to use for data analysis in 2014.

I'm quite fluent in R. I love some parts of the language, the breadth of
packages available, and ggplot. But it's really slow and not general purpose,
so I always preprocess data in Ruby.

Clojure is a fantastic language, but it's data analysis story is thin. The
matrix libraries are still being worked out. Incanter, which contains most
basic stats routines, seems abandoned. Yet, some people like the Flightcaster
guys use it, so there might be ways to work around this.

Julia looks great, but I'm worried by the lack of emphasis in functional
programming. Having a multi-paradigm language is great. Some things are
difficult to implement outside an imperative mindset e.g., a Gibbs sampler.
But functional should be the preferred option if possible. Maybe this is a
wrong impression of mine.

Python has all the libraries, but I've never really liked the language, and
it's still quite slow.

~~~
gajomi
I think I can understand what you mean about worrying about the lack of
emphasis of functional programming in Julia, in the sense that it certainly
isn't emphasized explicitly, but I wonder if your concern is more specific
than that? In Julia functions are first class objects and can dispatch on
subtypes of Function. There are several immutable types available. There seems
to be a convention (at least within the standard library) of adding ! postfix
to indicate function as having side effects. You have projects like JUMP
([https://jump.readthedocs.org/en/latest/jump.html](https://jump.readthedocs.org/en/latest/jump.html)),
which explicit revolve around declarative programming DSLs. All these things
together suggest to me that Julia is supportive of functional programming. But
I am only slowly wading into Julia, so I am mostly speaking from a theoretical
perspective. Perhaps you can clue me in?

~~~
nextos
Well, my comment stems from a very superficial overview of the language and
some libraries. Perhaps I'm wrong. But I haven't seen any heavy usage of high-
order functions or lazy data structures, and this worries me.

I don't just want a fast MATLAB replacement. I want a fully-fledged multi-
paradigm programming language.

~~~
StefanKarpinski
You're right that there isn't a particularly strong emphasis on lazy data
structures or higher-order programming in Julia's base library. They are,
however, both well supported and higher-order functions in particular are used
in many places; Julia uses Ruby-style do blocks all over the place, and a lot
of macros desugar to functional constructs. However, Julia's primary paradigm
is multiple dispatch. For more details on that, you can check out this
presentation I gave at Strange Loop this year:

[http://nbviewer.ipython.org/gist/StefanKarpinski/b8fe9dbb36c...](http://nbviewer.ipython.org/gist/StefanKarpinski/b8fe9dbb36c1427b9f22)

Functional style isn't something to embrace for its own sake. Lazy data
structures, in particular, while elegant, require a very sophisticated
compiler in order to have even remotely decent performance. Even with a
sufficiently clever compiler, the performance characteristics of the resulting
code tend to be extremely opaque and difficult to reason about. That said,
sometimes it's a nice way to express things and it's quite possible to
implement lazy data structures in Julia: see, for example, the Lazy [1] and
LazySequences [2] packages.

[1] [https://github.com/one-more-minute/Lazy.jl](https://github.com/one-more-
minute/Lazy.jl)

[2]
[https://github.com/dcjones/LazySequences.jl](https://github.com/dcjones/LazySequences.jl)

------
MojoJolo
I just became a Data Scientist this year and it enables me to dive into
different datasets, graphs, and algorithms. Been using somewhat three of the
"language" mentioned.

I'm new to R and just been creating simple plot graphs. I'm liking how it
illustrates my data from the csv file. I never tried matplotlib but my
colleagues prefer it instead of R.

Python is my go to language for fast scripting and programming. I also like
how easy a REST API can be done in Python using Flask.

On the other hand, I don't use Clojure and Julia. I think I heard Julia before
but not really familiar with it. For Clojure, I prefer Scala. Not because of
performance or anything, it's just the language that I learned first. I think
they have the same purpose as presented in the article. Just like Clojure, the
"JVM goodness" is also present in Scala. I like that it can use Apache stuff
like OpenNLP, IO, etc.

Been combining Python, Scala, and R in my work. Python for REST API, Scala for
the algorithm or heavy computing, and R for the visualization of data.

------
gengstrand
About 8 months ago, I blogged about
[http://www.dynamicalsoftware.com/analytics/oss](http://www.dynamicalsoftware.com/analytics/oss)
using hadoop and R for open source data science. I found R to be very useful,
especially if you want to graph large data sets.

Since then, I have been playing with clojure with regards to data science.
What I liked about clojure was that the programs looked more like transforming
data sets with map, reduce, into, filter, merge-with, join, difference, union.
You get more than just lists. You have maps, vectors, and sets too. The code
looks like you are loading a lot of data into memory but with lazy sequences
you are really processing these collections one element at a time. The so
called threading macros (nothing to do with multi-threading) keeps the number
of parenthesis down.

------
juliangamble
If your query is in Clojure there are some great tools in addition to those
mentioned for data science that leverage Hadoop - in particular - Parkour

[https://github.com/damballa/parkour/](https://github.com/damballa/parkour/)
(Hadoop MapReduce in idiomatic Clojure)

If your query is in R then you can convert to a PMML query and then Cascading
to Run on Hadoop:

[http://blog.cloudera.com/blog/2013/11/how-to-use-
cascading-p...](http://blog.cloudera.com/blog/2013/11/how-to-use-cascading-
pattern-with-r-and-cdh/#)

------
graycat
Linear Algebra

Linear programming

Advanced calculus

Nonlinear programming

Probability

Stochastic processes

Time series analysis

Fourier theory

Basic mathematical statistics

Multivariate statistics

Analysis of variance and experimental design

Analysis of categorical data

~~~
gtani
Yes! (more detail: [http://www.cs.ubc.ca/~murphyk/MLbook/pml-
toc-22may12.pdf](http://www.cs.ubc.ca/~murphyk/MLbook/pml-toc-22may12.pdf) )

~~~
graycat
Wow! Yup, he made it, just over 1000 pages!

From the table of contents, the only thing I could find missing was the
kitchen sink!

His prerequisites are:

"multivariate calculus, probability, linear algebra, and computer
programming".

He does say that there is code on-line for nearly everything in the book, code
for Matlab/Octave.

