
The Most Popular Language for Machine Learning - jfpuget
https://www.ibm.com/developerworks/community/blogs/jfp/entry/What_Language_Is_Best_For_Machine_Learning_And_Data_Science?lang=en
======
madenine
Ugh, indeed.com data.

Was not a great place to find data science jobs during my last job hunt, but
that might just be personal experience.

Beyond that, wouldn't there be some bias - in that the jobs that don't get
good candidates are going to have to be re-posted more often to maintain their
visibility, leading to an over representation of undesirable roles in the
data?

~~~
WhitneyLand
What did you find to be better than indeed? I would have thought it a safe bet
since it aggregates from many sources.

~~~
crypto5
I guess indeed doesn't represent how many people write ML code in large
companies like FB, Google, MSFT, IBM. I think another good metrics would be
number of LoC in ML projects on github.

~~~
ma2rten
Maybe also stackoverflow.

------
minimaxir
Everytime a "what's the best language for Machine Learning/Data Science?"
thread pops up, it always devolves into a flame war between "real data
analysts use R" and "Python has thousands more libraries!". (most recent
example 16 days ago:
[https://news.ycombinator.com/item?id=13110230](https://news.ycombinator.com/item?id=13110230)
)

My response is always use-the-most-appropriate-tool-for-the-job-dammit and
don't pigeon-hole yourself into one language, since each language has their
pros and cons. I am very tempted to write an HN autocomment bot at this point.
(in Python instead of R, of course, since that's the _most appropriate tool
for this job_ )

~~~
lacampbell
Are R and Python really your only choices? Basic competency in machine
learning is on my todo list, and I don't relish the thought of using either
language.

~~~
joshvm
There is a pro for Python in that it makes machine learning _really easy_ , or
at least incredibly accessible. You can write and test classifiers in about 5
lines of Python using scikit-learn. The second point is that virtually all the
latest deep learning packages come with Python frontends by default nowadays.
For stats you could also use SPSS.

The other advantage of Python is that as a scripting language it's very
powerful for data wrangling and pre-processing, without needing all the
boilerplate that e.g. C++ would require.

~~~
suresk
I have played around with scikit-learn and love how simple and easy it is to
work with, but the story for scaling it doesn't seem super straightforward -
is this something anyone here has experience with?

I built a recommendation system in Spark earlier this year that used terabytes
of input and would run it on a 40 node EMR cluster so it took less than half
an hour. It wasn't trivial to make it run in a clustered environment, but it
wasn't very hard either.

~~~
pwang
Out of curiosity, were you using spark-scala or pyspark?

~~~
suresk
I was using scala

------
c3534l
You seem to get significantly different results when you ask people directly
what they're using:

[http://www.kdnuggets.com/2016/06/r-python-top-analytics-
data...](http://www.kdnuggets.com/2016/06/r-python-top-analytics-data-mining-
data-science-software.html)

I don't really like the idea of looking at search correlations to infer
popularity in a given field. People who use R might have a higher level of
education, resulting in search results that are are narrower and more focused
than Python users, or simply be more likely to call it "AI" or "statistical
learning" or something of that nature. Or it may be that people learn a
language or tool because it is useful in a field, whereas people who use a
more popular tool might tangentially search for a given combination, even
though they're not really in that field or doing any real kind of ML work.

Although KDNuggets survey is self-selected, which is inherently an unsound
method, but it's not like the google search results are really a random
sample, either.

~~~
jfpuget
I fully agree on using search correlations. Fortunately, this is not the case
here. This is not about any search. This is about actual keywords occurring in
actual job offers. I thought I explained it clearly, sorry if you missed it.

~~~
c3534l
Oh, damn. I wish hacker news let you delete comments.

~~~
grzm
If the comment is still within the time it can be edited, you can, unless it
has a response. In that case, if you want the comment effectively gone, you
can edit the comment to remove the content. As a courtesy to the reader, you
can leave a "[self-deleted]" or some such to indicate what happened.

------
aabajian
Half of machine learning is data wrangling. Python is so easy to use and
elegant, that it _feels good_ when you're using it. When was the last time you
enjoyed data manipulation in R/Java/C/etc?

~~~
frugalmail
Anything for PRODUCTION data processing that doesn't have strong/static typing
is a crappy experience.

That being said, for adhoc work, with Scala you get the best of both worlds.

Python is certainly approachable from early programmers, and R from
mathemticians/business folks. Sadly there are more early access libs here, but
all the popular stuff is aviailable in the languages above.

C++ is really efficient, but it's a bit unforgiving.

~~~
GFK_of_xmaspast
Why are you doing data wrangling in production?

------
stevehiehn
After a year of experiments i realized that machine learning and big data
pipelines are inseparable. So at first i was thinking R/Python is the
greatest. And it might be until you need to do more that a few isolated
models. At that point i reverted to building the pipeline parts with Spring
Java + InMemory DataGrids because there is so many options.

~~~
hcarvalhoalves
There's certainly a tension between quickly prototyping something in R/Python
w/ limited data vs. making a ML system that scales once proven useful.

I believe the majority of data science jobs today are involved on doing only
the first (to gather pontual insights) and dropping the ball on the second
since it involves a lot more software engineering, and those jobs are
currently being fulfilled by those without this skill.

I foresee this being a source of frustration in the next years for companies
that fell for the data science hype, once they figure out it takes significant
investment and commitment to build intelligence into their systems, or even
curate high-quality data to do it right in the first place.

~~~
altstar
I cannot agree more. Took us a year to figure out the whole process.
Especially updating and maintaining the models in production can also be a
handful.

------
Palomides
spoiler: "Python is still the leader, but C++ is now second, then Java, and C
at fourth place. R is only at the fifth rank."

~~~
vonnik
It's important to distinguish between the API language and the core language
used for large-scale computation. Every single deep learning library exhibits
that division. The API language(s) indicate the communities the lib seeks to
serve. The core languages are lower-level, faster, and help optimize on the
hardware. The core languages are always C/C++/Cuda C. The API languages tend
to be Python, Java, Scala, R. Conflating the API languages with the computing
cores is comparing apples to oranges.
[http://imgur.com/a/Z6fGr](http://imgur.com/a/Z6fGr)

~~~
Meditato
The higher level languages are also used to munge the data into lib-readable
format. Python and R's standard libs are especially good at putting out
various tabular and database formats in preparation for ML input

------
matheweis
python is that nice mix between scripting and programming, as well as having
all sorts of ugly hackish modules that make it easy to do otherwise complex
tasks with all sorts of data.

Since the hard part about ML is more about manipulating the data into
something manageable, this makes python well suited for the task.

If Tensorflow for .NET had come out sooner, I might have jumped on that as C#
is a nice balance of performance vs ease of manipulating data, but now all my
code snippets are python so I'd need a large probject before the benefits of
changing over push me that way...

------
rubyfan
I wonder how much of the job details are actual vs. aspirational. For example,
in many of our R&D job listings we list Python and Hadoop as desired skills,
when in reality those are emerging needs rather than the day to day work which
is mostly SAS.

------
Dowwie
If Rust is a replacement for C/C++, why isn't it being adopted for machine
learning?

~~~
eva1984
Nobody uses it, period.

IMHO, ML people are pretty pragmatic, possible due to the fact ML itself a
mixed paradigm with people from different background, so they don't hold
religious belief towards programming languages comparing to some pure CS
background folks.

~~~
staticassertion
For the record, I've used rust for machine learning, and it was great. It
isn't for "religious beliefs" \- both myself, an engineer, and a data
scientist on my team, had great results using rust.

Rust is entirely practical and well suited for the task. And yes, people use
rust.

~~~
eva1984
That sounds great.

What I am objecting here, are those annoying cool kids try to sell their
flashy green solutions, like 'A new machine learning library written in Rust'
to other people, because 'yeah, Rust is better programming language than C++'.

That is what I called 'religion belief', for those people who don't really
care what problems they solved, they just naively assume they are better
because of the language they are using is different.

~~~
staticassertion
Maybe they're not annoying cool kids but people doing work in a new domain and
trying to share the benefits.

I don't think people are saying that their solutions are better just because
they're in rust. Maybe somewhere someone is, but the majority of what I see
doesn't really sound like that.

------
lowglow
We use python for all of our local and back-end ML processing pipeline. It's
actually the reason our OS/Architecture intermediary was written in Python as
well.

------
sprobertson
Lua is ranked surprisingly low, when I was getting started it seemed easiest
to find good code examples in Torch (e.g. neuralstyle, char-rnn).

~~~
vonnik
Torch is a great library, and Lua a fine language, but they are competing
against ecosystems built on the world's largest languages.

------
smaili
For those who use R, how hard is it to search for material on the web? I
imagine in some cases the search engine may treat the R as a typo?

~~~
thom
That isn't an issue, Google's pretty smart. The actual documentation is
terrible though, and you do not have the depth of StackOverflow posts to make
up for R's inscrutable and sometimes non-existent errors.

------
eva1984
TL:DR, it is Python. And everyone already knows it.

~~~
daveguy
And Python 2.7 to be specific. All the big framework releases are still 2.7
first. You have to wait for ML researchers to be bothered with 3.x.

~~~
jfpuget
Not sure where you get this from. The most popular machine learning framework,
scikit-learn, runs fine in 3.5. I am also using xgboost with Python 3.5 (yes,
xgboost is a major open source for machine learning, just look at what
framework is used by most Kaggle competition winners). TensorFlow, mxnet also
support 3.5. I would have agreed with you a year ago, but there has been a
major shift in use from 2.7 to 3.5 in 2016.

edited for typo.

------
ralphc
Javascript is mentioned in the charts but as of this writing has no hits on
the discussion. Is anyone using Javascript or Node for ML?

~~~
sprobertson
As far as I can tell there are no great options with all the necessary
features - extensible API and GPU support especially. My team is Node based
but went with Lua for the ML parts because of this.

------
markovbling
Super easy to call R scripts from python - can use rpy2 to send dataframes
from R to pandas or can just run an R script that outputs a csv to a folder
and then read that in python...

Legit 3 lines

    
    
      import rpy2.robjects as robjects
      #
      r_source = robjects.r['source']
      r_source(‘myscript.R’)
      #
      print ‘r script finished running’

------
ibiza
I would like to see how Octave or Matlab would do in these results.

Before clicking on the link, I thought it would be Python, R, or Octave.

~~~
jfpuget
You can add them to the queries I used to see the results. Look for pointer in
the article. Reason I didn't include matlab was that I didn't include any
commercial product. To your point I could have included Octave still. I just
did it, and Octave is at 0%: [https://www.indeed.com/jobtrends/q-python-
and-%28%22machine-...](https://www.indeed.com/jobtrends/q-python-
and-%28%22machine-learning%22-or-%22data-science%22%29-q-R-and-%28%22machine-
learning%22-or-%22data-science%22%29-q-Java-and-%28%22machine-
learning%22-or-%22data-science%22%29-q-Javascript-and-%28%22machine-
learning%22-or-%22data-science%22%29-q-C-and-%28%22machine-
learning%22-or-%22data-science%22%29-q-C++-and-%28%22machine-
learning%22-or-%22data-science%22%29-q-Julia-and-%28%22machine-
learning%22-or-%22data-science%22%29-q-scala-and-%28%22machine-
learning%22-or-%22data-science%22%29-q-Octave-and-%28%22machine-
learning%22-or-%22data-science%22%29.html)

------
chestervonwinch
C/C++ users: how do you use it for in ML applications? Exploratory analysis?
Converting research to production code? Do you use existing libraries, or do
application specifics (or other reasons) require that you write your own?

~~~
jhartmann
I use a tiny bit of python, a little more LUA, and a TON of C++ in the machine
learning work I do. Things like opencv, fbthrift, folly, boost, fblualib, and
thpp make writing this sort of code in C++ very time efficient and if you know
what you are doing it will end up performing much better than the
alternatives. I only use python for some light scripting, data collating and
reformatting type tasks, and LUA due to using Torch as my Neural network
framework of choice.

------
sirwitti
The positions of the graph legends are terrible - they overlap with parts of
the most interesting data points.

If the author or an editor of the article reads this: It would help a lot if
you could move the graph legends to the left. Thanks!

~~~
minimaxir
The source charts are from an interactive explorer on Indeed and not staticly
generated: [https://www.indeed.com/jobtrends/q-Python-and-%22deep-
learni...](https://www.indeed.com/jobtrends/q-Python-and-%22deep-
learning%22-q-R-and-%22deep-learning%22-q-Java-and-%22deep-
learning%22-q-Javascript-and-%22deep-learning%22-q-C-and-%22deep-
learning%22-q-C++-and-%22deep-learning%22-q-Julia-and-%22deep-
learning%22-q-Scala-and-%22deep-learning%22-q-Lua-and-%22deep-
learning%22.html)

------
blazespin
I think it would have been more interesting to constrain the query to must
have PHD. There's probably a lot of grunt work that needs to be done in other
languages that aren't really specific to machine learning. Eg, python is
fantastic for data wrangling, but PHDs with R probably earn 2x what
masters/bsc data wranglers do. But you probably need like 4x of the latter in
terms of staff.

For the person who downvoted me, I wasn't saying R is better than Python. I
was just saying that if you have just R, you're probably not doing data
wrangling..

What this post is completely failing to capture is the exceedingly high value
work in machine learning versus the typical work any skilled undergrad can do.

~~~
aub3bhat
Having spent last 4 months interviewing at several companies for ML roles, I
can assure you that no one doing PhD in Computer Science (Machine Learning,
Vision, Data Mining) uses R.

I also dont understand what you mean by "exceedingly high value work". For
both production as well as research (ICML/NIPS/CVPR) the languages used are in
most cases Python/C++/Lua.

Also Stats PhD (who I believe are the sole users of R) aren't typically hired
into Machine Learning roles.

R is okay for wrangling tabular data, applying statical model for hypothesis
testing and generating pretty charts for papers. But not suitable for state of
art Audio, NLP, Vision or Reinforcement Learning.

~~~
minimaxir
> Also Stats PhD (who I believe are the sole users of R) aren't typically
> hired into Machine Learning roles.

FWIW, I used R for my undergrad, and still use R for personal projects
afterwards. (with modern R, dplyr/ggplot2 are an order of magnitude easier to
use, in my opinion, than the Python equivalents)

------
pilooch
In reality, most production code is C++ for machine learning backends.

------
deepnotderp
Guys for deep learning we all know it's cuda under the hood.

~~~
minimaxir
You still need a programming language to interface with CUDA for deep
learning.

~~~
frozenport
The article confuses users with developers.

~~~
jfpuget
Agreed. I welcome suggestions on how to distinguish both using keywords in job
offers.

~~~
frozenport
Carefully? Possibly in some non-automatic fashion?

~~~
jfpuget
I don't have the data. I use indeed.com trend queries. All we can do is to
select which keywords to use.

~~~
frozenport
That sounds like your problem and not mine. Right now the article reads like
"who is your ISP" and 30% of people responded "Netgear".

~~~
jfpuget
If you think job offers are written in such a dumb way that netgear could be
regarded as an ISP then I would agree with you. What evidence do you have of
this?

------
epynonymous
wonder why there's no mention of golang... perhaps the libraries are not good
enough? golang is great for distributed systems.

~~~
chewxy
Well, we're getting there. If you're interested in ML with Go, there is
Gorgonia:
[https://github.com/chewxy/gorgonia](https://github.com/chewxy/gorgonia)

I've been using it for several years to build pipelines

