Hacker News new | past | comments | ask | show | jobs | submit login
The Most Popular Language for Machine Learning (ibm.com)
142 points by jfpuget on Dec 22, 2016 | hide | past | web | favorite | 103 comments

Ugh, indeed.com data.

Was not a great place to find data science jobs during my last job hunt, but that might just be personal experience.

Beyond that, wouldn't there be some bias - in that the jobs that don't get good candidates are going to have to be re-posted more often to maintain their visibility, leading to an over representation of undesirable roles in the data?

What did you find to be better than indeed? I would have thought it a safe bet since it aggregates from many sources.

I guess indeed doesn't represent how many people write ML code in large companies like FB, Google, MSFT, IBM. I think another good metrics would be number of LoC in ML projects on github.

Maybe also stackoverflow.

Everytime a "what's the best language for Machine Learning/Data Science?" thread pops up, it always devolves into a flame war between "real data analysts use R" and "Python has thousands more libraries!". (most recent example 16 days ago: https://news.ycombinator.com/item?id=13110230 )

My response is always use-the-most-appropriate-tool-for-the-job-dammit and don't pigeon-hole yourself into one language, since each language has their pros and cons. I am very tempted to write an HN autocomment bot at this point. (in Python instead of R, of course, since that's the most appropriate tool for this job)

That's a trade-off. Is the cost of learning and context switching between languages/frameworks less than the benefit of using a tool that is X% more appropriate?

Are R and Python really your only choices? Basic competency in machine learning is on my todo list, and I don't relish the thought of using either language.

There is a pro for Python in that it makes machine learning really easy, or at least incredibly accessible. You can write and test classifiers in about 5 lines of Python using scikit-learn. The second point is that virtually all the latest deep learning packages come with Python frontends by default nowadays. For stats you could also use SPSS.

The other advantage of Python is that as a scripting language it's very powerful for data wrangling and pre-processing, without needing all the boilerplate that e.g. C++ would require.

Not to join a flame war, but R makes it pretty easy to test multiple models on a single dataset as well. I have also noticed it does better stats and missing data handling out of the box.

I have played around with scikit-learn and love how simple and easy it is to work with, but the story for scaling it doesn't seem super straightforward - is this something anyone here has experience with?

I built a recommendation system in Spark earlier this year that used terabytes of input and would run it on a 40 node EMR cluster so it took less than half an hour. It wasn't trivial to make it run in a clustered environment, but it wasn't very hard either.

Out of curiosity, were you using spark-scala or pyspark?

I was using scala

If you consider SPSS as an alternative, you'll probably really have no use for R. I agree that Python is more approachable for people with a CS background (unless your fan of array processing languages) but R actually is a nice language for data centric tasks.

Julia is another option. You can even call R and python code from a Julia REPL.

You generally want an interactive language, though, because there is an iterative cycle in prototyping models.

They are not the only ones, but python is definitely the one which is likely to get you the most productivity fastest.

R and Python are not your only options. Check out the article for some other languages people are using.

R and Python are probably the two with the most support/community materials around them - lots of tutorials, libraries, guides etc.

Which is the one main reason why PHP is still around. If you are starting a new language or a new project, it is better to have examples and guides available.

I am preferring python to "R" because it gives me better search results.

Given that the article mentions many library implementations in Java, that puts JVM languages on the table. While Scala may be a good one, which the article mentions, that also puts Clojure on the table as well.

Just added clojure to the query to see what it gives. You can do it too, simply follow the link I give in the article. Closure is a bit less popular than Julia, hence not significant at this point. See https://www.indeed.com/jobtrends/q-python-and-%28%22machine-...

> don't relish the thought of using either language

Why not?

> only choices

Obviously not, according to the linked article. However, many people like Python and/or R. Perhaps you should find out why before dismissing their choices.

"Everytime a "what's the best language for Machine Learning/Data Science?" thread pops up,"

there will always be someone saying your statement as well.


Wow, and to think some fools think the most inappropriate tool should be used.

Given how often it comes up, you might be better off using Erlang\Elixir for concurrency sake.

Writing ML in Elixir and be able to natively integrate the code in a Phoenix would be great. With Elixir I think the problem is the lack of libraries, especially ML libraries..

Elixir, I'm afraid, is the wrong tool for ML. Numeric computation is Erlang's weak spot.

You seem to get significantly different results when you ask people directly what they're using:


I don't really like the idea of looking at search correlations to infer popularity in a given field. People who use R might have a higher level of education, resulting in search results that are are narrower and more focused than Python users, or simply be more likely to call it "AI" or "statistical learning" or something of that nature. Or it may be that people learn a language or tool because it is useful in a field, whereas people who use a more popular tool might tangentially search for a given combination, even though they're not really in that field or doing any real kind of ML work.

Although KDNuggets survey is self-selected, which is inherently an unsound method, but it's not like the google search results are really a random sample, either.

I think the appearance of "Hadoop" and Spark in the list of languages may be a clue as to the utter bollocks being talked here.

It's a list of tools, not languages, as the title implies.

I fully agree on using search correlations. Fortunately, this is not the case here. This is not about any search. This is about actual keywords occurring in actual job offers. I thought I explained it clearly, sorry if you missed it.

What is described in job announcements and what the job actually consists of can often be quite different though. Job announcements are a combination of what the management think or wish will be needed in the future and what they think will attract applicants. This means that technologies that are perceived as trendy will appear more often than those actually used, especially if the latter is perceived as legacy tech. The result is that job announcements may not be such a good indicator of what actually is used out there.

There is also a tendency that jobs that require deep understanding of some area (say video encoding, cryptography, statistics) not are announced, but instead the programming languages and the frameworks used in the surrounding system is announced as essential for the job. This means that if your core competency is in such areas, it can be hard to find the employers that need such skills, even if the skill is in demand.

Oh, damn. I wish hacker news let you delete comments.

If the comment is still within the time it can be edited, you can, unless it has a response. In that case, if you want the comment effectively gone, you can edit the comment to remove the content. As a courtesy to the reader, you can leave a "[self-deleted]" or some such to indicate what happened.

Half of machine learning is data wrangling. Python is so easy to use and elegant, that it feels good when you're using it. When was the last time you enjoyed data manipulation in R/Java/C/etc?

Anything for PRODUCTION data processing that doesn't have strong/static typing is a crappy experience.

That being said, for adhoc work, with Scala you get the best of both worlds.

Python is certainly approachable from early programmers, and R from mathemticians/business folks. Sadly there are more early access libs here, but all the popular stuff is aviailable in the languages above.

C++ is really efficient, but it's a bit unforgiving.

Why are you doing data wrangling in production?

After a year of experiments i realized that machine learning and big data pipelines are inseparable. So at first i was thinking R/Python is the greatest. And it might be until you need to do more that a few isolated models. At that point i reverted to building the pipeline parts with Spring Java + InMemory DataGrids because there is so many options.

There's certainly a tension between quickly prototyping something in R/Python w/ limited data vs. making a ML system that scales once proven useful.

I believe the majority of data science jobs today are involved on doing only the first (to gather pontual insights) and dropping the ball on the second since it involves a lot more software engineering, and those jobs are currently being fulfilled by those without this skill.

I foresee this being a source of frustration in the next years for companies that fell for the data science hype, once they figure out it takes significant investment and commitment to build intelligence into their systems, or even curate high-quality data to do it right in the first place.

I cannot agree more. Took us a year to figure out the whole process. Especially updating and maintaining the models in production can also be a handful.

I think you are spot on.

That's why we built deeplearning4j and datavec, fwiw. https://github.com/deeplearning4j

spoiler: "Python is still the leader, but C++ is now second, then Java, and C at fourth place. R is only at the fifth rank."

It's important to distinguish between the API language and the core language used for large-scale computation. Every single deep learning library exhibits that division. The API language(s) indicate the communities the lib seeks to serve. The core languages are lower-level, faster, and help optimize on the hardware. The core languages are always C/C++/Cuda C. The API languages tend to be Python, Java, Scala, R. Conflating the API languages with the computing cores is comparing apples to oranges. http://imgur.com/a/Z6fGr

The higher level languages are also used to munge the data into lib-readable format. Python and R's standard libs are especially good at putting out various tabular and database formats in preparation for ML input

You're absolutely right, and the data doesn't allow to split between api vs core implementation.

That's the ranking when they queried for deep learning specifically.

For machine learning it was: Python, Java, R, C++, C. I have to wonder if that difference in ordering is actually real.

TensorFlow is written in C++ so I'd expect it to impact the deep learning result.

Every serious lib was written in C/C++ under the hood long before TensorFlow came along.

True, at least to some extent. But I think TensorFlow treated C++ as a first class citizen and not as a tool to improve performance.

You wouldn't use sklearn as a C++ developer, for example. But you could totally use TensorFlow.

I'm guessing Python for research and prototyping with C++ for production-izing and scaling.

Some of these Python libs call out to C for heavy lifting, which can be very performant.

You lose quite a bit of optimization opportunities along the FFI boundaries.

python is that nice mix between scripting and programming, as well as having all sorts of ugly hackish modules that make it easy to do otherwise complex tasks with all sorts of data.

Since the hard part about ML is more about manipulating the data into something manageable, this makes python well suited for the task.

If Tensorflow for .NET had come out sooner, I might have jumped on that as C# is a nice balance of performance vs ease of manipulating data, but now all my code snippets are python so I'd need a large probject before the benefits of changing over push me that way...

I wonder how much of the job details are actual vs. aspirational. For example, in many of our R&D job listings we list Python and Hadoop as desired skills, when in reality those are emerging needs rather than the day to day work which is mostly SAS.

If Rust is a replacement for C/C++, why isn't it being adopted for machine learning?

I did a talk at RustConf about Rust for machine learning.

The short story is that rust is awesome for handling data - especially unclean data. But Python, or really anything with a REPL, is still ideal for the exploratory phase using ipython notebook and whatnot.

Rust would be suitable for building machine learning algorithms but it's still immature in that area. For now you could perform the data processing in rust and the ML in Python.

It looks like most of the frameworks for ML started before Rust's 1.0 stable so it's no surprise we're not seeing it in those arenas. Since GPUs and CUDA in particular play such a heavy role in ML I also wouldn't be surprised if some projects chose C/C++ to avoid maintaining multiple languages (CUDA uses C kernels compiled by a proprietary compiler).

From my understanding, most ML work isn't really done at the algorithm level but at the data level, as in manipulating data and experimenting with a variety of existing implementations. Since we have so much experience already available in optimizing C and there's not that much actual low level code to write, it makes sense to stick to C.

Nobody uses it, period.

IMHO, ML people are pretty pragmatic, possible due to the fact ML itself a mixed paradigm with people from different background, so they don't hold religious belief towards programming languages comparing to some pure CS background folks.

For the record, I've used rust for machine learning, and it was great. It isn't for "religious beliefs" - both myself, an engineer, and a data scientist on my team, had great results using rust.

Rust is entirely practical and well suited for the task. And yes, people use rust.

That sounds great.

What I am objecting here, are those annoying cool kids try to sell their flashy green solutions, like 'A new machine learning library written in Rust' to other people, because 'yeah, Rust is better programming language than C++'.

That is what I called 'religion belief', for those people who don't really care what problems they solved, they just naively assume they are better because of the language they are using is different.

Maybe they're not annoying cool kids but people doing work in a new domain and trying to share the benefits.

I don't think people are saying that their solutions are better just because they're in rust. Maybe somewhere someone is, but the majority of what I see doesn't really sound like that.

Thanks for opening up and being honest about this. If you let Rust into your heart, you will find that it can help machines learn wonderful, beautiful things. I will share my Book of Rust with you if you promise to read it. It opened my mind.

I did not include rust in the article, but it is easy to add it to the query. I just did it, and rust appears in 0% of job offers, see https://www.indeed.com/jobtrends/q-python-and-%28%22machine-...

In addition to the other responses, the ecosystems around Rust ML (e.g. scientific computing) are still very early and the Rust ML community itself is still fairly small. Take a peek at arewelearningyet.com

Mostly: lack of libraries.

The Rust ecosystem is still very immature.

Give it a few years.

We use python for all of our local and back-end ML processing pipeline. It's actually the reason our OS/Architecture intermediary was written in Python as well.

Lua is ranked surprisingly low, when I was getting started it seemed easiest to find good code examples in Torch (e.g. neuralstyle, char-rnn).

Torch is a great library, and Lua a fine language, but they are competing against ecosystems built on the world's largest languages.

For those who use R, how hard is it to search for material on the web? I imagine in some cases the search engine may treat the R as a typo?

That isn't an issue, Google's pretty smart. The actual documentation is terrible though, and you do not have the depth of StackOverflow posts to make up for R's inscrutable and sometimes non-existent errors.

Zero issue whatsoever.

It can be a nuisance. Searching eg for R go interoperability yields interesting results. Adding "statistics" usually gives reasonable results. But there is also rseek.org

I've had less and less trouble over the last couple of years. If I'm worried, I'll add 'cran' into the search string.

TL:DR, it is Python. And everyone already knows it.

And Python 2.7 to be specific. All the big framework releases are still 2.7 first. You have to wait for ML researchers to be bothered with 3.x.

Not sure where you get this from. The most popular machine learning framework, scikit-learn, runs fine in 3.5. I am also using xgboost with Python 3.5 (yes, xgboost is a major open source for machine learning, just look at what framework is used by most Kaggle competition winners). TensorFlow, mxnet also support 3.5. I would have agreed with you a year ago, but there has been a major shift in use from 2.7 to 3.5 in 2016.

edited for typo.

Not really...I actually find some Python 3.x only project recently, like this one:


Javascript is mentioned in the charts but as of this writing has no hits on the discussion. Is anyone using Javascript or Node for ML?

As far as I can tell there are no great options with all the necessary features - extensible API and GPU support especially. My team is Node based but went with Lua for the ML parts because of this.

Super easy to call R scripts from python - can use rpy2 to send dataframes from R to pandas or can just run an R script that outputs a csv to a folder and then read that in python...

Legit 3 lines

  import rpy2.robjects as robjects
  r_source = robjects.r['source']
  print ‘r script finished running’

I would like to see how Octave or Matlab would do in these results.

Before clicking on the link, I thought it would be Python, R, or Octave.

You can add them to the queries I used to see the results. Look for pointer in the article. Reason I didn't include matlab was that I didn't include any commercial product. To your point I could have included Octave still. I just did it, and Octave is at 0%: https://www.indeed.com/jobtrends/q-python-and-%28%22machine-...

C/C++ users: how do you use it for in ML applications? Exploratory analysis? Converting research to production code? Do you use existing libraries, or do application specifics (or other reasons) require that you write your own?

I use a tiny bit of python, a little more LUA, and a TON of C++ in the machine learning work I do. Things like opencv, fbthrift, folly, boost, fblualib, and thpp make writing this sort of code in C++ very time efficient and if you know what you are doing it will end up performing much better than the alternatives. I only use python for some light scripting, data collating and reformatting type tasks, and LUA due to using Torch as my Neural network framework of choice.

Several prominent open sources are written primarily in C++. TensorFlow is a good example.

The positions of the graph legends are terrible - they overlap with parts of the most interesting data points.

If the author or an editor of the article reads this: It would help a lot if you could move the graph legends to the left. Thanks!

The source charts are from an interactive explorer on Indeed and not staticly generated: https://www.indeed.com/jobtrends/q-Python-and-%22deep-learni...

I think it would have been more interesting to constrain the query to must have PHD. There's probably a lot of grunt work that needs to be done in other languages that aren't really specific to machine learning. Eg, python is fantastic for data wrangling, but PHDs with R probably earn 2x what masters/bsc data wranglers do. But you probably need like 4x of the latter in terms of staff.

For the person who downvoted me, I wasn't saying R is better than Python. I was just saying that if you have just R, you're probably not doing data wrangling..

What this post is completely failing to capture is the exceedingly high value work in machine learning versus the typical work any skilled undergrad can do.

Having spent last 4 months interviewing at several companies for ML roles, I can assure you that no one doing PhD in Computer Science (Machine Learning, Vision, Data Mining) uses R.

I also dont understand what you mean by "exceedingly high value work". For both production as well as research (ICML/NIPS/CVPR) the languages used are in most cases Python/C++/Lua.

Also Stats PhD (who I believe are the sole users of R) aren't typically hired into Machine Learning roles.

R is okay for wrangling tabular data, applying statical model for hypothesis testing and generating pretty charts for papers. But not suitable for state of art Audio, NLP, Vision or Reinforcement Learning.

> Also Stats PhD (who I believe are the sole users of R) aren't typically hired into Machine Learning roles.

FWIW, I used R for my undergrad, and still use R for personal projects afterwards. (with modern R, dplyr/ggplot2 are an order of magnitude easier to use, in my opinion, than the Python equivalents)

Lua isn't missed, read the article again. it is just that it just that Lua appears in exactly 0% of the job offers on indeed.com.

Similiar experience. The only people using R were technology illiterate, mining existing datasets, mostly in life sciences.

This is very different than trying to recognize street signs.

In reality, most production code is C++ for machine learning backends.

Guys for deep learning we all know it's cuda under the hood.

You still need a programming language to interface with CUDA for deep learning.

The article confuses users with developers.

Agreed. I welcome suggestions on how to distinguish both using keywords in job offers.

Carefully? Possibly in some non-automatic fashion?

I don't have the data. I use indeed.com trend queries. All we can do is to select which keywords to use.

That sounds like your problem and not mine. Right now the article reads like "who is your ISP" and 30% of people responded "Netgear".

If you think job offers are written in such a dumb way that netgear could be regarded as an ISP then I would agree with you. What evidence do you have of this?

wonder why there's no mention of golang... perhaps the libraries are not good enough? golang is great for distributed systems.

Well, we're getting there. If you're interested in ML with Go, there is Gorgonia: https://github.com/chewxy/gorgonia

I've been using it for several years to build pipelines

I didn't include it indeed, but you can try yourself. Just modify the query of my article (look for pointers).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact