
How Python became the language of choice for data science - davmre
http://blog.mikiobraun.de/2013/11/how-python-became-the-language-of-choice-for-data-science.html
======
simonster
I started out with MATLAB, played around enough with Python to realize that it
wasn't enough better than MATLAB to be worth switching. (My impression was
that Python was _almost_ flexible enough as a language to make writing code
that uses NumPy/SciPy feel natural, but it didn't quite achieve that.) I
finally switched to Julia, which is both faster and (IMO) more enjoyable to
write than either MATLAB or Python, although the current package ecosystem is
pretty small. I think the main difficulty in creating a new programming
language for numerical computing is that you need knowledgeable people to do
it, and most of those people are more interested in analyzing their data than
building a programming language (or core packages for a programming language)
to do it better.

My impression on two small bits of this article:

 _Python is also somewhat restrictive with what you can say on a single line.
In Matlab you would often load some data, start editing the functions and
build you data analysis step by step, while in Python you tend to have files
which you start from the command line (or at least that’s how I tend to do
it)._

I actually find that this kind of analysis is more conveniently done in
IPython (or IJulia) Notebook than in MATLAB.

 _I still have dreams of a plotting library where the type of visualization is
decoupled from the data (such that you can say “plot this matrix as a scatter
plot, or as an image”)._

This is the design goal of both R's ggplot2 and Julia's Gadfly.jl.

~~~
jmduke
Do you have any recommendations wrt getting started with Julia? It's a
language I've wanted to explore a bit, but I haven't been able to find any
good community resources or jumping-off points.

~~~
fphilipe
This "Learn X in Y minutes" snippet [1] showing a bunch of examples in Julia
is the best starting point. I always return to this one as it is usually much
faster to find what you're looking for than searching the docs.

1:
[http://learnxinyminutes.com/docs/julia/](http://learnxinyminutes.com/docs/julia/)

~~~
astrieanna
I wrote Learn Julia in Y minutes. Is there anything unclear or that you'd find
useful to have included/changed?

------
dvdt
I'm confused. Is python really so far ahead of the competition to be
considered the "language of choice for data science?"

My impression (and I am not a statistician/data scientist in my day job, so I
would very much like to hear opposing perspectives) is that the R ecosystem is
far more mature and widely adopted than scikit-learn for things like
regression, classification, clustering etc.

The author also cites expensive MATLAB licenses as a driving force behind
python adoption, but here too I'm skeptical. As a grad student, I get MATLAB
for free. But I switched to R/pandas for data analysis because R has a native
data structure for working with multidimensional datasets (i.e. data.frame).

To illustrate the utility of this, let's say you asked developers all over the
US for their zipcode and salary and recorded the results in salary.poll.data.
Here's an interesting question: what is the mean salary in each zipcode? In R,
all sorts of libraries make this computation concise and highly readable.
Using the excellent data.table package, you would do `salary.poll.data[,
list(mean.salary.by.zip.code=mean(salary)), by=zipcode]`.

No such libraries exist in popular usage for MATLAB. You'd have to roll your
own, or more likely, write a lot of crufty loops and conditional statements.
(Or use higher order functions with map/reduce/filter, which, by the way, you
would have to implement yourself).

For me at least, just having the right data structures for working with data
makes R/pandas a clear winner for doing statistical analysis of data.

~~~
davmre
My impression is that R and MATLAB (and Julia) certainly still have their
advantages. But, Python with pandas/scikit-learn/matplotlib is _almost_ as
good at R at data munging and exploration, and with numpy/scipy/Cython, as
good or better than MATLAB at complicated matrix calculations. Meanwhile,
Python has its own unique features, like iPython notebooks. So it's at least
competitive with R and MATLAB at a feature level, if maybe not for every
possible use case.

The thing that draws me towards Python is that it's a well designed, _general
purpose_ programming language. The syntax is sensible, the language allows
real OO and FP abstractions, you have easy access to basic data structures
like lists and hashmaps, and there's a huge ecosystem of third-party libraries
to build on. Things that are stupidly difficult in MATLAB and R, like
file/network IO, string processing, or building a GUI or web interface, are
straightforward in Python. If you've ever tried to run MATLAB on a cluster,
it's an absolute nightmare. You would never, ever think about building a
production system in MATLAB or R. But these things are easy in Python. There's
something that's just really nice about having access to first-class analytics
tools in the same language that you're building your systems in.

~~~
xixi77
_The syntax is sensible, the language allows real OO and FP abstractions, you
have easy access to basic data structures like lists and hashmaps, and there
's a huge ecosystem of third-party libraries to build on_

Other than the last one, all of these are available in Matlab and R as well;
and all three have a huge set of libraries -- the difference is about the
focus of those libraries imo. Python does have more general-purpose libraries,
just as R has more statistical packages.

I have certainly run a lot Matlab on a cluster, in fact, I'd say safely about
70% of code I see running on clusters around here is Matlab. I've also seen a
few production systems in (mostly) R -- I actually suspect that at least on
Windows, deploying an R system may be easier, since you just install R and
then use internal functionality to get packages you need; python (with all
necessary extensions) seems relatively tricky to get running.

All that being said, I do agree that if you are building a production system
where most of the code is related to interfacing with other systems, or GUI's
etc., and the data analysis is a small and non-interactive part (ie. no data
exploration), Python is a very reasonable choice if you can keep the whole
project in it.

------
knowtheory
Given that dude indicates that even he's an exception to his supposed rule, i
think i'd like to see something more comprehensive argument to buy this
particular line.

Python is definitely major player in data processing (heck i'd even just be
interested in how he's defining data science), but it's definitely not the
only game in town by a long shot.

~~~
chilldream
I recently saw this (which goes into far more depth):
[http://www.talyarkoni.org/blog/2013/11/18/the-
homogenization...](http://www.talyarkoni.org/blog/2013/11/18/the-
homogenization-of-scientific-computing-or-why-python-is-steadily-eating-other-
languages-lunch/)

Summary: "Constantly switching languages is a chore, and while Python isn't
the best at everything, it's the best at some things and good enough at almost
everything else."

~~~
jschulenklopper
Recently indeed, that link is the first reference in the OP's article :-)

------
washedup
The first line: "Nowadays Python is probably the programming language of
choice (besides R) ..."

Personally I am not ready to hand the title "langauge of choice" to Python,
although it is trending that way. We should give R credit where credit is due.
There is still a huge population of R users and code. Python has advantages in
that the syntax is easier to understand, and so is the structure (object
oriented versus scripting). What Python lacks is a simple setup. I think that
once Python becomes more accessible to everyone (as far as downloading Python,
setting up directories, packages, libraries, etc.), it will have huge leaps in
usage.

~~~
gms7777
I've heard good things about the anaconda distribution as far as setup and
everything goes for scientific computation purposes. Comes with most packages
you'd need:
[https://store.continuum.io/cshop/anaconda/](https://store.continuum.io/cshop/anaconda/)

I don't actually have it myself, but it does seem to be trying to solve the
problem you point out.

~~~
washedup
That's great! If this can match RStudio as an IDE, then Python is heading in
the right direction. Thanks for showing me this.

~~~
gallamine
Anaconda is a fabulous way of installing a majority of Python tools you'd need
for data science. It also includes Spyder - a Matlab-like IDE. Before you go
back to RStudio, you should at least check out the IPython Notebooks style
workflow. It grows on you. PyCharm is anothe IDE I've been looking at - it has
a free community edition too.

------
drakaal
People are always telling me why I should jump from Python to Node. This is
basically my argument against doing so.

Python, Matlab, Fortran, Cobol, will be around for a VERY long time because so
many of the smartest people THINK in these languages. The number and quality
of people who think in a language is more important than the number who
develop in it.

I don't yet think in Python. It is not where I learned programming. I am more
a Lisp thinker, but for practical application python is a better choice.

I don't trust people who think in JavaScript. Or rather I don't like to bet on
them.

~~~
codygman
Well, I for some practical application lisp/racket could be a better choice.
If we categorize lisps/schemes/haskells as "not for real world use" too easily
we won't enjoy nearly as much innovation.

I also believe it's undeniable that parts of these languages would be a
godsend in some more "real world" languages.

There are two sides to the argument, however I'd like to caution against
dismissing languages as "not for real world use" too quickly since it's a
trend I've seen.

------
randomsearch
I've been using Python for what I guess you could consider as pre-processing
for data science.

One thing people might want to consider before investing time in Python is
that I found it to be quite memory inefficient: data structures take up a lot
of space, and the garbage collection didn't seem to be as effective as other
languages (I spend some time studying/improving GC in JVMs). So if you're
dealing with large amounts of data and/or complex data structures, I wonder if
Matlab might be more appropriate (AFAIK R is also not very good at memory
management yet).

~~~
kyzyl
As a rule of thumb, if you run out of memory in python, you will also run out
of memory in matlab. I've done a lot of work in both and found that while
python's memory performance may not be ideal, at least when you run into
trouble in python you have options. With matlab I found that when I got that
dreaded "Out of memory" message the prompt, there was little I could do. The
internals are completely opaque, shipping my code to C is a pain in the ass,
and there are very few language constructs to help you control how you use
memory.

In most cases running out of memory in matlab meant either making the problem
smaller or running it on a beefier machine. I think this is the reason why you
see a lot of labs at universities with machines that have 96GB of memory, even
though their datasets seem to be much smaller.

FWIW, as far as processing lots of data is concerned, python is not without
issues. If you do it naively you will run out of memory _really_ quickly. But
by picking your tools correctly you can go a long way. Use Pandas and/or
Sparse arrays whenever possible. Learn how numpy broadcasting operations
contribute to memory explosions. Take a gander at the source of that sklearn
method you're using, since it's often quite obvious that the particular
implementation will choke.

I've found that these days I try my best to avoid loading datasets into
memory. This is second nature for people who work with 'big data', but it's an
m.o. that takes some getting used to. That is, blocking and/or streaming your
data, and appropriately subdividing your problem for distributed computation.
It's worth mentioning that this problem with python is under active research
and development. The guys at continuum developed IOpro to deal with the issue
of memory efficiency when loading data, and to make streaming data from flat
files/S3/mongodb/whatever easier and more stable. Also, their (very young)
project called Blaze is meant to be a drop-in replacement for numpy, but is
designed for efficiency and specifically for dealing with out-of-core
computation. We'll see...

~~~
gallamine
I also recently learned that Panda's can talk to a PyTable's HDF data storage
on disk. This should be fairly seemless and can help reduce some memory issues
with larger datasets. Yves Hilpisch did a nice talk on "Performance Python"
that goes into this:
[http://hilpisch.com/YH_Performance_Python_Slides.html](http://hilpisch.com/YH_Performance_Python_Slides.html)

~~~
kyzyl
Yes, I almost exclusively use pytables+HDF5 when I'm permitted. As I commented
on some thread earlier this week, it strikes a great balance between
simplicity, performance and flexibility.

------
joellarsson
It's not.

R vs some Python alternatives in Google Trends:
[http://bit.ly/1fpOR57](http://bit.ly/1fpOR57)

Pandas is looking good, but then we have Julia, Matlab and probably some
people still using SAS.

edit:

Comparing statistical packages by: Kaggle.com usage, Job posts, Activity on
blogs, mailing lists, Stack overflow, Google Scholar and a few more..

[http://r4stats.com/articles/popularity/](http://r4stats.com/articles/popularity/)

~~~
SwellJoe
I must be missing something...I see the Python options having a lot more
volume than R. What am I supposed to be seeing in this graph that means Python
isn't a (if not _the_ ) leading language in this space?

~~~
joellarsson
I guess there are better ways of comparing. I like the survey data and
kaggle.com data from my other link.

------
diminish
Octave, R, Julia, Python.. we have so many choices now.

~~~
rubidium
I think that's the real benefit. In the past 5 years, there's been an
explosion of available tools, which is good for everybody. I looked at Python
a while back, and it wasn't there yet. Now it seems to arrived.

------
xfax
Any sufficiently fast language that allow the user to focus on the problem
rather than the syntax or semantics will do better than its peers. I for one
am looking forward to Julia maturing.

------
lightoverhead
I consider myself a data scientist in bioinformatics field. I have to deal
with several hundred GB scale data everyday. IMO, the best tool kit so far is
the combination of Perl and R. Because these tools have the richest
packages/modules, you can do almost everything with them. As for python, I
don't think it can deal with the data I have as efficient as Perl.

~~~
Demiurge
PDL or Perl?

~~~
lightoverhead
PDL is the Perl packages for numerical science. Unless you did a lot of
mathematics computation, PDL is rarely used in my daily routine.

------
bayesianhorse
Both Octave/Matlab and R are more "convenient" or elegant in certain precise
cases, like linear algebra, but Python provides a much more coherent
experience due to syntax and the type system.

Julia... well... has its merits, but the ecosystem isn't comparable. Julia
does not have Django, Guido or PyCon.

~~~
RivieraKid
I think it's just a matter of time until it replaces Python in data science.
(And I believe it's also the best designed general purpose dynamic language
out there.)

~~~
bayesianhorse
It would be a matter of quite a long time then. Considering that it took the
numpy/pandas combination several years to almost overtake matlab and R, and
that Python is language and interpreter with 2 decades of history and
development, Julia should take about five to ten years to be a serious
competitor, and the inertia of the python community will be hard to break.

~~~
RivieraKid
Perhaps, it's hard to tell... Julia can be almost as fast as C, so it can
potentionally replace many uses of C / C++ for scientific computing.

~~~
bayesianhorse
Julia can be almost be as fast as C and still not have Django, Guido or
Pycon...

------
mathgenius
The lead developer on Elefant got poached by a high frequency trading shop in
2006. He was "discovered" showing his wares at pycon that year (vectorized
operations faster than anything else out there).

Sigh.

------
elchief
Hadoop? Java

HBase? Java

Hive? Java

RapidMiner? Java

Cassandra? Java

Neo4J? Java

Python 4 the win.

~~~
freyrs3
Python excels in the exploratory data analysis side of things, not so much on
the computation side. Thus we get tools like Pandas, NumPy and Scikit-Learn.

------
michaelochurch
Python, to me, is the all-around B+ language. It has good (but not great)
performance, language semantics, runtime technical detail. As history showed
us, being good at a lot of things enabled the growth of a large community
(without corporate sponsorship) and an immense library ecosystem. So "all-
around B+" isn't a knock against it. If anything, it's to be admired.

Unfortunately, this means that most data science findings, when transmuted
into permanent production programs, don't stay in Python. Often Java or C++
are used instead.

Personally, that's why I think Clojure's got a real shot at being "the
language of choice for data science" in 2023. It has the power of Lisp, it
makes a lot of data-frame manipulations really easy, and because it sits on
top of the JVM, it can be "productionized" pretty easily (you might have to
write a couple functions in Java, for performance).

~~~
RivieraKid
After spending at least 50+ hours programming in both Julia and Clojure I have
to disagree. Julia is more readable, faster, great for both imperative and
functional programming and generally I found it easier to get things done.

------
dschiptsov
Partly because it became a the language of choice for teaching CS basics
instead of Scheme in good schools like MIT.

