
Matlab, R, and Julia: Languages for data analysis - bugsbunnyak
http://strata.oreilly.com/2012/10/matlab-r-julia-languages-for-data-analysis.html
======
lsiebert
You know what will be popular? whatever runs reasonably fast and helps you
import and clean data quickly from a variety of sources.

Because the analysis is often the quickest part of being a data scientist.
Coursera, as I recall, apparently cleans it's data, and also lets you easily
import it.

In real life, data is messy, and messed up. You looking at birthdays from some
website? expect a spike for whatever the default is... but that doesn't mean
you can eliminate that data completely, because some people were presumably
born Jan 1st.

You looking at birth years? I recall dealing with them in SAS... remember if
it's four digit that you check for births occurring in the current and past
century.

And hey... do you have two or more elements of data for an individual? 2% to
5% will probably be missing some element, and some will have wrong data. a zip
code off by one, an address not in the city you are looking to geocode for,
whatever. If you are lucky, it will be obvious stuff like that.

The life if the data scientist is mostly cleaning, formatting, and
transferring data, with the occasional sweeeet analysis. Of course your
analysis will probably give you nothing useful, because despite several
thousand usable records, it's not clear if any element has a significant
effect on the dependent variable you are looking at. If you are smart, maybe
you can finagle an analysis based on a non parametric distribution or logistic
regression.

Oh, and often the speed of your analysis running is inversely correlated with
how easy it is to code and enter your data. There is a reason people use SAS,
and it's not because of it's amazing IDE.

------
mark_l_watson
I wouldn't be surprised if Octave (open source version of Matlab) doen't
become very popular because a lot of Coursera classes use it for homework
assignments.

I thought that Octave was an ugly little language at first, now I really like
it - a great tool for doing linear algebra, data visualization, machine
learning, neural networks, etc.

~~~
pav3l
Many use Matlab because of its amazing IDE, a great collection of toolboxes
and remarkable speed, not so much because of its language features. Last I
checked Octave was still missing all of that. If you can afford it, Matlab is
usually well worth paying for. If not, other alternatives (Python, R) are much
better in my opinion.

~~~
Gravityloss
When I used Matlab daily, I never used the ide. The command line and editor
were good.

Matlab is awesome above all else because the design is coherent. Both the
syntax and the standard libraries.

It is extremely quick to whip up anything and then turn that into a script and
then into a software with functions (since functions can return many
variables, and they also have zero overhead, you don't need any includes or
requires, you just call them). Type conversions are practically never a
problem, since they are sane and automatic. None of this 1+1.5 giving syntax
error. Real booleans. Data input and output libraries just simply work like
you would expect them to. ( A=imread('/home/gravityloss/abc.png') creates a
width x height x 3 matrix with all the rgb values. No requires, includes,
plugins, hunting and compiling libraries.). You don't need libraries to do a
huge amount of stuff, but if you need them for something experimental, they
work extremely easily.

You also rarely need stuff like loops since mass operations on data are
native. If you as a newbie create a custom function for a scalar, there's good
chance it will work for vectors or n-matrices automatically. This reduces the
amount of error-prone housekeeping code for indices and lengths immensely.
It's also much much faster than some looping in another scripting language. As
a result, the code is often very readable as well.

There's help which actually returns something sensible when you type help, you
can type help help or help command or search this or that, the help texts are
actually very thoughtful and helpful too and not at all like Linux man
pages... I could go on for hours on features that don't really exist anywhere
else, even though everything's been in plain sight for decades in Matlab.

Julia's an awesome thing though, I hope it gets more traction...

~~~
apl
Virtually none of that is in fact unique to MATLAB or even a strength in the
first place.

    
    
      > Matlab is awesome above all else because the design is
      > coherent. Both the syntax and the standard libraries.
    

It certainly is coherent, and also consistent, but only in the weakest and
least interesting sense. Namely, everything's about equally messy. Namespaces
are non-existent in the standard library and clumsily realised otherwise. OOP
remains rudimentary and feels as tacked-on as it happens to be. The one-
function-per-file system ruins everyone's day. No standard arguments (and
checking _nargin_ loses its appeal rather fast), and shitty inlined pseudo-
lambdas.

    
    
      > It is extremely quick to whip up anything and then turn
      > that into a script and then into a software with
      > functions
    

No different from R or Python, and most of the time a genuine weakness; it's a
key reason for scientific/engineering code being as ad hoc and convoluted as
it is.

    
    
      > If you as a newbie create a custom function for a
      > scalar, there's good chance it will work for vectors or
      > n-matrices automatically.
    

Vectorising functions in Python (that is, _NumPy_ ) is about as
straightforward.

    
    
      > It's also much much faster than some looping in another
      > scripting language.
    

Nah, MATLAB is now fairly good at unrolling and optimising such loops. Don't
worry too much about vectorising every single bit of your algorithm.

    
    
      > There's help which actually returns something sensible
      > when you type help
    

That's also true in the cases of both Python and R. Long story short: Most of
your perceived advantages aren't unique, and that comes on top of MATLAB's
exorbitant pricing schemes and _extremely_ dubious language design. Trust me,
if you think MATLAB's a particularly well-designed language for anything other
than linear algebra, you owe it to yourself to check out alternatives and
other languages.

~~~
Gravityloss
True, Matlab has its limitations, but those are partly unavoidable. If you
want to build a large object oriented program, you often use something more
heavyweight anyway. But that heavyweight language (or framework) is usually
not so quick to build something in anymore, because your heavyweight
structures are just in the way in the earlier phase.

I tried Python and Numpy, and the vectors, matrices and all that felt just
tacked on and the syntax was much more complex compared to Matlab. Maybe it's
changed since. Also in Scilab the type conversions and function overhead are a
nuisance. Every time you edit a script or function, you have to specifically
reload it before running it. Makes rapid prototyping about three fold as time
consuming. Would it be hard to make the software notice I actually edited
something?

Many people actually want to solve problems, and they just end up creating a
program as a side product. They do not set out to study libraries and do not
want to actually write any code that is not directly related to the problem
they are solving.

It's why Matlab is able to charge the price. It sometimes saves time. Some of
the users are not primarily software developers but are quite educated and
intelligent and their salary is not small.

~~~
marshallp
So you're implying anyone who is a software developer is not educated or
intelligent, nor has a large salary. I think a lot of people would beg to
differ.

In reality, matlab only exists because of inertia. It's the same reason why
microsoft windows is still around. There's no substance behind it.

~~~
tomrod
He wasn't implying that. You've inverted his statement. Many economists and
engineers I know care less about how they code and care more about getting a
solution to the model at hand--this seems to be the poster's point. The
implication is exactly what he exposited, whereas the logical inverse is what
you've mistakenly deduced as the implication.

~~~
marshallp
His last paragraph is not clearly worded then. It comes of as thinly veiled
insult to software developers.

~~~
gjm11
Perhaps it comes off that way _to you_ but I'd be willing to bet it doesn't to
_the great majority of readers_ , because it simply isn't saying what you say
it's saying.

~~~
marshallp
OK, but he's worded it wrong, it gives the impression of a snarkiness. Maybe
he's saying that matlab users can't program well but are still
intelligent/well paid (but that doesn't really make sense since numpy is
equally easy and an intelligent/educated person wouldn't find programming
hard). Anyway, maybe I misread it and you're right.

edit: great, I was getting points before, and then you come along with your
italics.

~~~
tomrod
It's all about the illiquid karma ;). No worries.

------
pav3l
Here is a nice 4-year old still active discussion on pros and cons of
different data analysis technologies:
[http://brenocon.com/blog/2009/02/comparison-of-data-
analysis...](http://brenocon.com/blog/2009/02/comparison-of-data-analysis-
packages-r-matlab-scipy-excel-sas-spss-stata/)

~~~
tomrod
HellMcFly--you're hellbanned so I couldn't reply directly. What is MPlus?

~~~
pav3l
Mplus is quite popular in social sciences. From what I understand its main
functionality is fitting Latent Variable Models and structural equation
modelling. I've never used it myself, but it can in fact do things for which
it is hard to find R packages at this point.

------
travisoliphant
I think the article tagline would be better "Domain Specific Languages for
data analysis". Fortunately, the article does mention Python which is critical
because new people might not recognize just _how_ prevalent Python is for
solving data analysis problems after reading this. The great work of the SciPy
community has enabled Python to be used for _all_ of the things that Matlab,
R, and Julia can do. In addition, Python can integrate easily with these
languages, so if you are a data analyst you need to learn Python.

~~~
xaa
> The great work of the SciPy community has enabled Python to be used for all
> of the things that Matlab, R, and Julia can do.

As much as I hate R and love Python, this is not entirely true (unless you
count rpy2 as part of "Python"). R has many more statistical models and better
plotting capability compared with Python. It also has a lot of domain-specific
packages (for example, Bioconductor) that are not available in Python.

~~~
dbecker
Though Python doesn't have the library support that R has, it far exceeds
what's available in Julie (and,depending what you are looking for, in Matlab
as well)

~~~
tomrod
Sure. But Julia is brand new and already supports C/fortran libraries.

~~~
malkarouri
.. which are supported by all the environments in question

------
scottfr
Personally I'm in love with R's data.frame. It allows very concise, robust and
elegant manipulation and subsetting of a data set.

I wish every language would have such a built-in object type, I definitely
feel its loss when I manipulate data in other languages such as Javascript or
Mathematica.

~~~
dj_axl
> Personally I'm in love with R's data.frame. It allows very concise, robust
> and elegant manipulation and subsetting of a data set.

The performance is terrible though. For data of more than ~10,000 observations
SQL is much better performance wise, is more robust, and is as good at
subsetting. Although it's maybe not as elegant for everyone's definition of
elegant.

~~~
minimax
What dataframe operations do you find to be slow? Usually I'm able to get huge
performance wins by rewriting my slow R code in a loop free way (*apply and
friends).

------
tikhonj
I wonder if there is room for some smaller languages optimized specifically
for data analysis. In particular, I wonder how a carefully designed non-
Turing-complete language would fare.

That would be a really cool project to work on: design a minimal language for
expressing most types of data analysis at a higher level. If the language is
sufficiently small and simple, I could see some very powerful tooling being
possible for it.

Perhaps it might make sense to go even more specific: have a small language
designed not just for data analysis but for analysis in a very specific
vertical (say finance or bioinformatics). It would be awesome to let people
express their ideas in terms of the domain and not worry about low-level
details like loops.

~~~
pav3l
It seems like a good idea, but I wonder how actually useful highly specialized
programming languages would be. Why?

1) Most data analysis tasks boil down to roughly the same things: accessing
the data source --> data cleaning -->simple transformations -->
(optional)stats/fitting/ML/specialized procedures-->pretty pictures and
reporting.

2) Not everyone wants programming to be the main component of their job.

People who can take advantage of the flexibility that programming offers can
usually take advantage of existing technologies. People who don't enjoy coding
will always look for of-the-shelf solutions that have pretty GUI's with magic
buttons that solve all their problems. I just don't think there is a huge
market in between to be filled... in the domains that i've been exposed to
anyway.

~~~
tel
I disagree with your supposition. I think highly specialized languages exist
and are highly useful to non-programming communities. I think there is plenty
of proof of their usefulness and room for growth.

For instance, consider illustrator products or d3? Both of these are
specialized ("deep") tools for creating pictures that I've used extensively in
the "pretty pictures and reporting" stage you outlined.

Also of serious note are BUGS[1], JAGS[2] and (recently) Stan[3] as small
semi-declarative languages for MCMC model building, fitting, and checking.

SQL is an obvious example of a component of the "simple transformations" step.

[1] BUGS <http://www.mrc-bsu.cam.ac.uk/bugs/> [2] JAGS <http://mcmc-
jags.sourceforge.net/> [3] Stan <http://mc-stan.org/>

------
dbecker
When introducing python, the author writes "Despite the obvious advantages of
MATLAB, R, and Julia, it’s also always worth considering what a general-
purpose language can bring to the table."

Even with thousands of hours of experience in Matlab, R and Python... I'm not
sure what "obvious advantage" Matlab and R share over Python.

~~~
tomrod
For me, depends on how new a person is to the language. Numpy certainly has a
learning curve after spending a long time in Matlab.

~~~
dbecker
I hope the "obvious advantage" the author is speaking of isn't "it takes a
while to learn numpy if you are used to Matlab."

That would be a pretty weak argument in my opinion.

~~~
tomrod
I agree :D

------
lorenzfx
python fanboy here: "[python is] not as tuned to numerics as MATLAB": if you
build numpy with ATLAS there is, in my experience, hardly ever any noticeable
speed difference between numpy and MATLAB

~~~
aleyan
" Python a compelling alternative: not as tuned to numerics as MATLAB, or to
stats as R, or as fast or elegant as Julia "

The part about python not being as fast as Julia jumped at me. Wes McKinney's
benchmarks show that python is faster than Julia for numerics:
<http://wesmckinney.com/blog/?p=475>

EDIT: should not have said "python faster than Julia". They are comparable
because the slow bits get done in BLAS anyway.

~~~
StefanKarpinski
A couple of nits...

Cython is actually what is faster than Julia in Wes' comparison, not Python.
Cython looks kinda, sorta like Python, but it is actually a static language
with C-like types (but quite different syntax for those types), no
polymorphism, and, afaict, ill-defined semantics. The best answer I seem to
get about Cython's semantics is that Cython's semantics are whatever it does.
I'm not alone in this complaint – Travis Oliphant expressed a similar concern
at this year's SciPy (in this panel
[<http://www.youtube.com/watch?v=7i2vhoQY-K4>], if I recall correctly), which
is part of his motivation to work on Numba [<https://github.com/numba/numba>].

If you look at the comments on Wes' post, when I used the dot(x,y) function,
which ships with Julia and uses a BLAS to compute the inner product just like
the fastest "Python" version does, Julia is equally fast. That stands to
reason – they're both just calling a BLAS.

Finally, that blog post is months old – since then Julia passed the milestone
of being no slower than 2x C++ on its microbenchmarks suite
[<http://julialang.org/>]. That's not a guarantee that all code is that fast,
but most things we see can be pretty easily tweaked to get there
(counterintuitively for those coming from Matlab, Python or R, usually by
_devectorizing_ the code rather than vectorizing it). And of course, there's a
lot of room for improving Julia's performance, the compiler is still quite
young and there are many optimizations that we haven't implemented. Basically,
there's nothing but work standing in the way of reaching C or Fortran's speed
across the board.

~~~
mitmatt
I just ran Wes's benchmarks (not the BLAS call versions) on my machine with a
Julia I built on 10/13 (17c3c13), and the timings have indeed improved. For
the details, see this gist: <https://gist.github.com/3901139> (including the
comment I posted on it).

The highlight is

numpy: (x * y).sum() => 41.1 ms

julia: inner(x,y) => 37.4 ms

julia: x*y => 19.5 ms

cython: inner(x,y) => 13.8 ms

The numpy and Julia versions are much easier to write and run.

Disclaimers: I've never written or built cython code before just now, and I
think Julia is the coolest.

EDIT: whoops, missed the most important one (inner() written in pure Julia).
Added it. Any thoughts on why inner() in Julia isn't faster?

~~~
StefanKarpinski
Nice. Thanks for running those. The reason inner isn't faster is probably that
we do bounds checks on every array access. This is surprisingly inexpensive on
modern hardware but it still takes some time. We're working on a couple of
things to address this: generating code so that llvm can more easily hoist
bounds checks out of loops, and allowing turning bounds checking off entirely
for blocks of Julia code.

------
myspy
I have to create figures with Matlab and that's a pain in the ass. Changing
XTickLabels, kills another part of the figure, and in general it's very hard
to do a little more with figures.

But the basic data analysis is fine. The IDE has awful code completion and
lacks more refinement in the editor.

~~~
keypusher
Python has a wrapper for matlab-style graphs called matplotlib, if you are
interested in something else.

------
StefanKarpinski
This is a really excellent and well-balanced article. Very much captures the
pluses and minuses of these various systems for data analysis.

------
rcthompson
One of my bioinformatics courses "required" MATLAB because the class project
was based on a simulation framework called the COBRA Toolbox which was
developed in MATLAB[1]. I didn't know who to ask about obtaining a MATLAB
license, so instead I just got it to work in Octave and used that. I was
pleasantly surprised at how little I had to tweak before the framework just
worked in Octave, given that as far as I know everyone in the lab that
develops the framework just uses MATLAB.

[1] <http://opencobra.sourceforge.net/openCOBRA/Welcome.html>

------
prakashk
Perl was mentioned in the article, but PDL (Perl Data Language) wasn't.

<https://metacpan.org/module/PDL>

 _PDL is the Perl Data Language, a perl extension that [...] includes fully
vectorized, multidimensional array handling, plus several paths for device-
independent graphics output._

 _PDL is fast, comparable and often outperforming IDL and MATLAB in real world
applications. PDL allows large N-dimensional data sets such as large images,
spectra, etc to be stored efficiently and manipulated quickly._

For integration with R, there are Statistics::R
(<https://metacpan.org/module/Statistics::R>) and Statistics::useR
(<https://metacpan.org/module/Statistics::useR>)

------
elchief
Everyone loves to shit all over Java, but Mahout, RapidMiner, Weka, Hive,
HBase are all written in it.

~~~
lsiebert
I've used Weka, and RapidMiner once. As I recall, RapidMiner seemed to be
general purpose, but lots of posts were for using it for data mining stock
data to build a model.

I think it would be interesting to see breakdowns of different software, and
where they are used. Often times it seems to me that people just use the tools
their peers and co-workers use, and people tend learn to like whatever they
use most.

------
tvladeck
Thought I'd ask since I'm learning Clojure - are there experiences worth
sharing re: using Incanter in these types of settings?

~~~
paulbunn
I'm also learning Clojure and have been playing with Incanter. Seems quite a
decent statistical library/environment. I had a few issues with lazy
evaluation with the dynamic charting functions, but I think that has more to
do with my inexperience with Clojure than a problem with Incanter. Also, I'm
not to sure how active the project is?

------
agentq
no love for J?

~~~
zem
J is more of a general-purpose array language than one specialised for
numerical work.

------
zem
surprising omission at the end - any mention of scipy should at least include
a pointer to sage as well.

