
Python, Machine Learning, and Language Wars. A Highly Subjective Point of View - Lofkin
http://sebastianraschka.com/Articles/2015_why_python.html
======
idunning
As someone who almost exclusively uses Julia for their day-to-day work (and
side projects), I think most of the author's thoughts about Julia are correct.
I think the language is great, and using it makes my life better. There are
some packages that are actually better than any of their equivalents in other
languages, in my opinion.

On the other hand, I've also got a higher tolerance for things not being
perfect, I can figure things out for myself (and luckily have the time do so),
and I'm willing to code it up if it doesn't already exist (to a point).
Naturally, that is not true for most people, and thats fine.

The author isn't willing to take the risk that Julia won't "survive", which is
fair. Its definitely not complete yet, but its getting there. I am confident
that it will survive (and thrive) though, and continue growing the not-
insubstantial community. I have a feeling the author will find their way to
Julia-land eventually, in a couple of years or so.

~~~
rasbt
Thanks for the comment (I am the author of this article).

> I have a feeling the author will find their way to Julia-land eventually, in
> a couple of years or so.

I have a strong feeling that this will eventually happen :). In an ideal, less
busy, world, I would love to use Julia alongside to explore and battle-test it
further. Or even develop useful packages, libraries, and functions for it. The
truth is, I am currently lacking the time to do that :(. I mean, Python works
for me, and I am currently more into the scientific problem solving so that I
don't have the time :(. When I say that Python works for me I mean that I am
currently happy since it can do everything for me I need, however, this
doesn't mean that Julia couldn't do certain things better ;).

Anyways, I really like your comment. I am wondering if you would be okay with
it if I include it in a "Other people's experiences and opinions" section at
the bottom. I think this would be extremely helpful for people who are new to
the "data science field" \-- my article is strongly biased towards Python as
you noticed :P

~~~
dnautics
I think that the popularity of julia is exploding (but I'm biased - am writing
an... interesting library for julia right now).

"There is really nothing wrong with R"

I think there is one thing wrong with R - it's name. Pretty much impossible to
quickly google for help on it.

------
leni536
And there is nothing wrong with C++. For linear algebra I use the armadillo
library and it's really a nice wrapper around LAPACK and BLAS (and fast!). For
some reason scientists are somewhat afraid of C++. For some reason you "have
to" prototype in an "easier" language. Sure, you can't use C++ as a calculator
as opposed to interpreted languages, but I see people being stuck with their
computations at the prototyping language and eventually not bringing it to a
faster platform.

Point being: C++ is not hard for scientific calculations.

~~~
rasbt
I agree with you. However, note that many people who are using Python for
writing scientific code make use of C/C++ in one way or the other (aside from
NumPy, SciPy, and Theano). For example, many people write the "most intensive"
computations down in C/C++/Cython if they call those functions frequently --
Python becomes a wrapper. One example that pops into my mind is khmer
([https://github.com/dib-lab/khmer](https://github.com/dib-lab/khmer))

~~~
srean
This is a common refrain: drop down to C, C++, Fortran for the computation
intensive parts. It works, but only to a degree. The inefficiencies lie in the
vectorization semantics of the host language(s) that leads to extra copies and
extra levels of indirection. So this dual language mode of operation typically
does not approach what one could have obtained had one disposed the baggage
entirely, except for I/O. Usually in the quest for better speed, that is what
remains, as one moves progressively larger portions of the application in the
C, C++, Fortran part of the code. A reason I like Julia is that I can largely
avoid this dual language annoyance, and enjoy the succinctness of pithy
vectorized expressions using
[https://github.com/lindahua/Devectorize.jl](https://github.com/lindahua/Devectorize.jl)

~~~
nxb
The speed performance of C/C++ is terrible, compared to GPU speed, which is
often 50x to 100x faster. The goal has shifted and continues to shift in that
direction.

C/C++ is no longer best for speed, not even close.

Now, all that matters is which language has the libraries that make it easiest
to get custom code onto the GPU. Python and Lua seem to be winning there, by
far.

~~~
leni536
>Now, all that matters is which library makes it easiest to get custom code
onto the GPU. Python and Lua seem to be winning there, by far.

This is interesting. How is it possible that python and lua have more
efficient wrappers around GPU libraries? Also there are many GPU libraries for
C/C++ too. Armadillo can use NVBLAS as a backend too. I'm not sure if I get
your point of C/C++ being slow.

~~~
nxb
It's not about wrapping. The real power is in the cross-compilation of
expressions and entire complex data pipelines, from a simple-as-possible high-
level language into GPU language. That's the power at the core of, e.g.
Theano.

~~~
srean
Compiling high level instructions to different hardware backends is hardly an
exclusive feature of python and lua libraries. Google would swamp you with
hits if you were to search

~~~
nxb
Show me one C/C++ library that competes with Theano or Torch7?

Google / Facebook and many other huge companies are using Theano and Torch7 in
production, at scale. The ML industry has been continuously moving in this
direction for years now.

On these optimized ML systems, only a tiny fraction of CPU time is spent
outside of the GPU. The goal in many of these companies is to migrate all
tasks that can be done on GPUs to GPUs, as soon as possible. It's far faster
and more cost efficient.

~~~
novocaine
Do you have some sources demonstrating that google and facebook are using them
in scaled production? My impression was that presently these were more for
research and prototyping.

I would have thought that if you were going to run prod systems in the gpu you
would actually write CUDA (C++) or similar to avoid the inefficiency of the
abstraction layer.

(also, this comment is bordering on the uncivil).

------
rm999
I switched from mostly using R to Python about a year ago for gluing together
my data pipeline (from data source all the way to production models and
frontends/visualizations). It hasn't really impacted what I'm capable of doing
or my productivity, except the standard extra googling that comes in the first
couple years I use any language.

The main reason I went for Python is purely practical: it's a language people
outside my team will respect and deal with. It makes it easier for me to
collaborate in many different ways: share tools with other teams, transfer
ownership of my code, get help when I need it, etc. Data science at some
companies has the reputation of "hack something together and throw it over the
wall for someone else to deal with". In my experience R only furthers this
reputation. Which is too bad, it's really great at what it does.

~~~
drauh
Yeah, there is a large community of Python users in scientific computing. It's
great.

I like well-established languages with a large user base.

So, I was dismayed by Big Data Genomics' ADAM Project's choice of Scala, which
has almost no uptake in the genomics/bioinformatics community.

They do it because they run over Spark. But Spark has an excellent Python
binding.

~~~
justthistime_
Python's days have come and gone.

Computation has grown more complicated. They need real computer scientists and
a real language that supports real development, not some scientists which
learned just enough Python to automate running some 20 year old Fortran code.

~~~
Lofkin
So Julia then :)

------
misiti3780
Octave/Matlab are "great" but good luck trying to integrate them into a
production web application. Since you cant really do that - avoid using them
unless you are fine with implementing the same algorithm twice. Matlab
licenses cost money also, and the toolboxes cost additional money.

R is useful because there are a lot of resources as it has been along for so
long and is used by a large portion of the stats community. It also has a lot
of useful libraries that have not been ported over to other languages yet
(ggmap!!!). But you still still run into the same problem that you cannot
integrate R into a production web application.

I am pretty sure Hadoop streaming does not support R,Octave, or Matlab either

~~~
RA_Fisher
I'd like to kindly challenge the notion that you can't integrate R into a web
application. I've started using R to power jobs that are used by a large web
application. The R packages httr or RCurl make it pretty easy to make http
requests -- (enabling me to send things to a web server to be consumed into a
database and run by back-end code). It's also possible to prepare data in R
and then send to a space like S3 with a System("s3cmd sync some-data
s3://some-data") call. I've also been using Python a good bit lately. I don't
see either has having a universal advantage for a data pipeline.

~~~
haddr
I was once quite suprised that R was missing any kind of RESTful service
package for exposing R functions as a REST service. What I was looking for was
basically some way to invoke a couple of R functions from the web application.
In the end I managed to do it with some JRI (part of rJava) bindings.

Maybe it's a good idea to implement such thing.

------
geomark
I just completed the Coursera data science track which took me from a complete
R newbie to being at least somewhat proficient. Having previously used Python
for a quite a bit of web programming, I disliked R at first except for its
power in statistical programming. But I've since discovered a number of great
R packages that make it a pleasure to use for things I would normally turn to
Python for. Like I recently discovered the rvest package for webscraping.

Data visualizations with R seem vastly superior, unless I am missing something
with Python (highly likely). And putting up a slick statistics app is easy
with shiny or RStudio Presenter. But R can't really scale to a large
production app, isn't that right?

So I feel I need to keep working with both Python and R.

Added: That's a nice list Lofkin. Thanks. Also, in the article he says that
Python syntax feels more natural, which I also felt. But then I started to use
things like the magrittr and dplyr packages in R which gives you nice things
like pipes and that feeling starts to ebb.

~~~
Lofkin
For stats plotting in python:
[https://github.com/mwaskom/seaborn](https://github.com/mwaskom/seaborn)
[https://github.com/yhat/ggplot](https://github.com/yhat/ggplot)

For stats plotting and web apps in python:
[https://github.com/bokeh/bokeh](https://github.com/bokeh/bokeh)

For calling r libraries in python:
[https://pypi.python.org/pypi/rpy2](https://pypi.python.org/pypi/rpy2)

For out of core datasets in python:
[https://github.com/blaze/dask](https://github.com/blaze/dask)
[https://github.com/blaze/blaze](https://github.com/blaze/blaze)

~~~
rasbt
Nice collection, let me add one more item to this list:

Seaborn: statistical data visualization:
[http://stanford.edu/~mwaskom/software/seaborn/](http://stanford.edu/~mwaskom/software/seaborn/)

------
a_bonobo
>I think it [Perl] is still quite common in the bioinformatics field though!?

That's true - many day-to-day tasks in bioinformatics are more or less plain-
text parsing [1], and Perl excels in parsing text and quickly using regular
expressions. "My" generation of bioinformaticians doing data cleanup and
analysis (20-30) uses Python, sometimes because plotting is nicer, the
language is easier to get into, it's more commonly taught in universities, or
other reasons - people older than that normally use Perl.

Both BioPython and BioPerl are extremely useful.

[1] Relevant quote from Robert Edgar: "Biology = strcomp()" from
[https://robertedgar.wordpress.com/2010/05/04/an-
unemployed-g...](https://robertedgar.wordpress.com/2010/05/04/an-unemployed-
gentleman-scholar/)

~~~
rasbt
Thanks for the insights! Also here, this comment would make an interesting
addition to a "Feedback" section at the end of the article to give people a
broader view on this topic. May I have your permission to post your comment
below the article?

~~~
a_bonobo
Of course you have my permission :)

------
sampo
Andrew Ng said in the Coursera Machine learning class that according to his
experience, students implement the course homework faster in Octave/Matlab
than in Python.

But yes, the point of that course is to implement and play around with small
numerical algorithms, whereas the linked blog is about someone who mainly
calls existing machine learning libraries from Python.

Ref.
[https://news.ycombinator.com/item?id=4485877](https://news.ycombinator.com/item?id=4485877)

~~~
jacobolus
Interesting. In my own experience trying to implement the same image
processing algorithms in Matlab vs. numpy, the work took about the same amount
of effort any time arrays were limited to 1-2 dimensions, all the code was
simple numerical stuff, and it wasn’t necessary to break the code up into
multiple functions.

The Matlab one-file-per-function thing, the lack of namespaces, and general
lack of code structuring primitives makes it much less pleasant than Python
for programs bigger than about 100 LOC though.

Dealing with higher-dimensional arrays, more sophisticated plotting, data
munging, string processing, interfaces with external systems, etc. all left me
banging my head in Matlab though, whereas Python makes it all a breeze.

Numpy’s broadcasting feature is also super nice, compared to wrapping
everything in bsxfun calls in Matlab.

I wonder how much the @ operator in Python 3.5 will help students. Hopefully
numpy can deprecate and phase out their "Matrix" object, and end the confusion
about the meaning of basic operators.

~~~
sampo
I would imagine, for a beginner, in Python the difference between a list and a
numpy array can be confusing. And in numpy, column vectors and row vectors are
a different thing. Whereas in Octave/Matlab "everything just works" in the do-
what-I-mean sense.

> _The Matlab one-file-per-function thing_

By the way, Octave does not have that limitation.

~~~
TheBlackCat13
> And in numpy, column vectors and row vectors are a different thing. Whereas
> in Octave/Matlab "everything just works" in the do-what-I-mean sense.

I think this is actually one of MATLAB's biggest flaws. Without a true 1D
array like numpy has, there is no way in MATLAB to tell the difference between
a 1D sequence of values, and a 2D sequence of values with only one value along
one of the dimensions.

This has led function developers to try to guess. But they guess
inconsistently. Some functions treat row and column vectors differently, some
treat them the same. Of those that treat them the same, some return them with
the same orientation, while other force a particular orientation. Some
operations ignore dimensions (length), others don't (for loops). Some maintain
dimensions (size), some don't ([:]).

So everything may seem to work, until your code that has been working fine for
years suddenly breaks, and you realize it is choking up because one of your
experiments has only one trial, or one of your experiments has multiple trials
each with one result, and some of the functions you are using start reacting
differently to this. Then you have to go through each function and figure out
on a case-by-case basis how it handles row and column vectors.

Or worse yet, it seems to run fine, but is silently doing the wrong thing.
Which you probably would never know, because most MATLAB code isn't unit-
tested.

------
zzleeper
Quite interesting post. I feel that a lot of the numerical Pythonistas are in
the same spot:

They tolerate most languages, but find R's syntax a bit unnatural, Matlab
lacking when trying to go beyond pure matrix stuff, and are waiting to see if
Julia picks up (which it seems to be from what I can tell)

~~~
GFK_of_xmaspast
One of the hats I wear is 'does numerics in python'. R is fine when you're
playing to its strengths, matlab is an abomination, and I have absolutely no
interest in julia.

~~~
sgt101
Why no interest? If it is something that could make you more productive and
make your life better then shouldn't you take an interest? Or do you mean "I
have looked closely at Julia and it's no good because..." If so the because
bit will be of interest to the Julia community (it's 0.4 now so lots of
distance to go in its development before it becomes stable)

~~~
GFK_of_xmaspast
I took a look at Julia a couple years ago. As far as I could see, the only
reason to consider it over python is the speed improvements, and if I'm in a
situation where python isn't fast enough, I've already got c, c++, and java to
reach for. (Also at the time the library support was lacking (I just checked
and the graphs package, which I needed at the time, is still pretty minimal)).

------
Adam_O
From the perspective of a student, most of the good online analytics/data
analysis/stats courses use R, so it is hard to get away from it while learning
the material. Once you get the base concepts down, switching to python
shouldn't be hard. I think most people still prefer ggplot2 for visualization
though. Whenever I use R I feel like a statistician, I can feel that 'cold
rigor' emanating from the language. But in the end I think it is advantageous
to wield both languages.

Also I really see Jupyter as a new standard for communication. Your narrative
and supporting code all in one place, ready for sharing.

~~~
rasbt
Yes, I think you are right. Out of curiosity, when I browsed over Coursera's
course catalog, most data science related material seems to be taught in
Matlab or R (however, there are also others, e.g., Klein's Linear Algebra
class in Python). Personally, I think that instructors shouldn't enforce a
language requirement. I believe for big platforms such as coursera it
shouldn't be to hard to run an respective interpreter to check the code/answer
uploaded by a student.

~~~
digitalzombie
Most classes have to teach the subject and how to program. Programming is
beginning to be an essential skills so they have to choose a language to
teach.

Also those classes that chose R, from my experiences, are non CS classes, the
professor are from other discipline. They just want a tool that solve their
need quick. An example is the Princeton's Stat class, the professor is a
humanity major. The class gave us tons of data and we had to do ANOVA and such
and we needed a computer to crunch so number can't do by hand. So he chose R
which he uses a lot.

------
Lofkin
Personally I'm tempted to make the switch to Julia, but slow higher order
functions, high churn in the core data infrastructure and no Pymc 3 are
keeping me on pydata for a bit longer. I have numba to hold me over.

------
thanatropism
One thing missing here: Matlab syntax is actually very close to modern
Fortran. At least twice I've written Fortran code (for Monte Carlo
simulations; different contexts) by overwriting Matlab code adding types /
general verbosity / fixing the syntax of do-loops / etc.

~~~
jordigh
They've been trying to Javaify the Matlab syntax for close to two decades now.
They're moving towards making everything an object like in Java. They're
getting pretty close to that.

~~~
sgt101
One of the lessons of Julia for me is that everything is an object is a
problem, not a boon. I think organizing code in modules with sets of related
types and functions manipulating those types allows more natural and modular
decompositions for reuse.

------
DrNuke
I love the hacking approach in the post: a tool is only a tool to do something
valuable and not the goal itself. The Python ecosystem is the right tool at
the right time, nowadays, because of the data science explosion and the need
to interact very quickly with non-specialists.

------
dafrankenstein2
.NET's F# is also good..though maybe not a better alternative

------
JuliaLang
Julia love!

