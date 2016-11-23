Hacker News new | comments | show | ask | jobs | submit login
The Most Popular Language for Machine Learning (ibm.com)
79 points by jfpuget 5 hours ago | hide | past | web | 47 comments | favorite





Ugh, indeed.com data.

Was not a great place to find data science jobs during my last job hunt, but that might just be personal experience.

Beyond that, wouldn't there be some bias - in that the jobs that don't get good candidates are going to have to be re-posted more often to maintain their visibility, leading to an over representation of undesirable roles in the data?

reply


What did you find to be better than indeed? I would have thought it a safe bet since it aggregates from many sources.

reply


I guess indeed doesn't represent how many people write ML code in large companies like FB, Google, MSFT, IBM. I think another good metrics would be number of LoC in ML projects on github.

reply


Everytime a "what's the best language for Machine Learning/Data Science?" thread pops up, it always devolves into a flame war between "real data analysts use R" and "Python has thousands more libraries!". (most recent example 16 days ago: https://news.ycombinator.com/item?id=13110230 )

My response is always use-the-most-appropriate-tool-for-the-job-dammit and don't pigeon-hole yourself into one language, since each language has their pros and cons. I am very tempted to write an HN autocomment bot at this point. (in Python instead of R, of course, since that's the most appropriate tool for this job)

reply


That's a trade-off. Is the cost of learning and context switching between languages/frameworks less than the benefit of using a tool that is X% more appropriate?

reply


Are R and Python really your only choices? Basic competency in machine learning is on my todo list, and I don't relish the thought of using either language.

reply


There is a pro for Python in that it makes machine learning really easy, or at least incredibly accessible. You can write and test classifiers in about 5 lines of Python using scikit-learn. The second point is that virtually all the latest deep learning packages come with Python frontends by default nowadays. For stats you could also use SPSS.

The other advantage of Python is that as a scripting language it's very powerful for data wrangling and pre-processing, without needing all the boilerplate that e.g. C++ would require.

reply


I have played around with scikit-learn and love how simple and easy it is to work with, but the story for scaling it doesn't seem super straightforward - is this something anyone here has experience with?

I built a recommendation system in Spark earlier this year that used terabytes of input and would run it on a 40 node EMR cluster so it took less than half an hour. It wasn't trivial to make it run in a clustered environment, but it wasn't very hard either.

reply


> don't relish the thought of using either language

Why not?

> only choices

Obviously not, according to the linked article. However, many people like Python and/or R. Perhaps you should find out why before dismissing their choices.

reply


Julia is another option. You can even call R and python code from a Julia REPL.

You generally want an interactive language, though, because there is an iterative cycle in prototyping models.

reply


R and Python are not your only options. Check out the article for some other languages people are using.

R and Python are probably the two with the most support/community materials around them - lots of tutorials, libraries, guides etc.

reply


Given how often it comes up, you might be better off using Erlang\Elixir for concurrency sake.

reply


python is that nice mix between scripting and programming, as well as having all sorts of ugly hackish modules that make it easy to do otherwise complex tasks with all sorts of data.

Since the hard part about ML is more about manipulating the data into something manageable, this makes python well suited for the task.

If Tensorflow for .NET had come out sooner, I might have jumped on that as C# is a nice balance of performance vs ease of manipulating data, but now all my code snippets are python so I'd need a large probject before the benefits of changing over push me that way...

reply


After a year of experiments i realized that machine learning and big data pipelines are inseparable. So at first i was thinking R/Python is the greatest. And it might be until you need to do more that a few isolated models. At that point i reverted to building the pipeline parts with Spring Java + InMemory DataGrids because there is so many options.

reply


spoiler: "Python is still the leader, but C++ is now second, then Java, and C at fourth place. R is only at the fifth rank."

reply


It's important to distinguish between the API language and the core language used for large-scale computation. Every single deep learning library exhibits that division. The API language(s) indicate the communities the lib seeks to serve. The core languages are lower-level, faster, and help optimize on the hardware. The core languages are always C/C++/Cuda C. The API languages tend to be Python, Java, Scala, R. Conflating the API languages with the computing cores is comparing apples to oranges. http://imgur.com/a/Z6fGr

reply


The higher level languages are also used to munge the data into lib-readable format. Python and R's standard libs are especially good at putting out various tabular and database formats in preparation for ML input

reply


You're absolutely right, and the data doesn't allow to split between api vs core implementation.

reply


That's the ranking when they queried for deep learning specifically.

For machine learning it was: Python, Java, R, C++, C. I have to wonder if that difference in ordering is actually real.

reply


I'm guessing Python for research and prototyping with C++ for production-izing and scaling.

reply


Some of these Python libs call out to C for heavy lifting, which can be very performant.

reply


You seem to get significantly different results when you ask people directly what they're using:

http://www.kdnuggets.com/2016/06/r-python-top-analytics-data...

I don't really like the idea of looking at search correlations to infer popularity in a given field. People who use R might have a higher level of education, resulting in search results that are are narrower and more focused than Python users, or simply be more likely to call it "AI" or "statistical learning" or something of that nature. Or it may be that people learn a language or tool because it is useful in a field, whereas people who use a more popular tool might tangentially search for a given combination, even though they're not really in that field or doing any real kind of ML work.

Although KDNuggets survey is self-selected, which is inherently an unsound method, but it's not like the google search results are really a random sample, either.

reply


I think the appearance of "Hadoop" and Spark in the list of languages may be a clue as to the utter bollocks being talked here.

reply


It's a list of tools, not languages, as the title implies.

reply


I fully agree on using search correlations. Fortunately, this is not the case here. This is not about any search. This is about actual keywords occurring in actual job offers. I thought I explained it clearly, sorry if you missed it.

reply


Oh, damn. I wish hacker news let you delete comments.

reply


If the comment is still within the time it can be edited, you can, unless it has a response. In that case, if you want the comment effectively gone, you can edit the comment to remove the content. As a courtesy to the reader, you can leave a "[self-deleted]" or some such to indicate what happened.

reply


For those who use R, how hard is it to search for material on the web? I imagine in some cases the search engine may treat the R as a typo?

reply


I've had less and less trouble over the last couple of years. If I'm worried, I'll add 'cran' into the search string.

reply


Zero issue whatsoever.

reply


I would like to see how Octave or Matlab would do in these results.

Before clicking on the link, I thought it would be Python, R, or Octave.

reply


We use python for all of our local and back-end ML processing pipeline. It's actually the reason our OS/Architecture intermediary was written in Python as well.

reply


If Rust is a replacement for C/C++, why isn't it being adopted for machine learning?

reply


It looks like most of the frameworks for ML started before Rust's 1.0 stable so it's no surprise we're not seeing it in those arenas. Since GPUs and CUDA in particular play such a heavy role in ML I also wouldn't be surprised if some projects chose C/C++ to avoid maintaining multiple languages (CUDA uses C kernels compiled by a proprietary compiler).

From my understanding, most ML work isn't really done at the algorithm level but at the data level, as in manipulating data and experimenting with a variety of existing implementations. Since we have so much experience already available in optimizing C and there's not that much actual low level code to write, it makes sense to stick to C.

reply


Nobody uses it, period.

IMHO, ML people are pretty pragmatic, possible due to the fact ML itself a mixed paradigm with people from different background, so they don't hold religious belief towards programming languages comparing to some pure CS background folks.

reply


I think it would have been more interesting to constrain the query to must have PHD. There's probably a lot of grunt work that needs to be done in other languages that aren't really specific to machine learning. Eg, python is fantastic for data wrangling, but PHDs with R probably earn 2x what masters/bsc data wranglers do. But you probably need like 4x of the latter in terms of staff.

For the person who downvoted me, I wasn't saying R is better than Python. I was just saying that if you have just R, you're probably not doing data wrangling..

What this post is completely failing to capture is the exceedingly high value work in machine learning versus the typical work any skilled undergrad can do.

reply


Having spent last 4 months interviewing at several companies for ML roles, I can assure you that no one doing PhD in Computer Science (Machine Learning, Vision, Data Mining) uses R.

I also dont understand what you mean by "exceedingly high value work". For both production as well as research (ICML/NIPS/CVPR) the languages used are in most cases Python/C++/Lua.

Also Stats PhD (who I believe are the sole users of R) aren't typically hired into Machine Learning roles.

R is okay for wrangling tabular data, applying statical model for hypothesis testing and generating pretty charts for papers. But not suitable for state of art Audio, NLP, Vision or Reinforcement Learning.

reply


> Also Stats PhD (who I believe are the sole users of R) aren't typically hired into Machine Learning roles.

FWIW, I used R for my undergrad, and still use R for personal projects afterwards. (with modern R, dplyr/ggplot2 are an order of magnitude easier to use, in my opinion, than the Python equivalents)

reply


Lua isn't missed, read the article again. it is just that it just that Lua appears in exactly 0% of the job offers on indeed.com.

reply


TL:DR, it is Python. And everyone already knows it.

reply


C/C++ users: how do you use it for in ML applications? Exploratory analysis? Converting research to production code? Do you use existing libraries, or do application specifics (or other reasons) require that you write your own?

reply


I use a tiny bit of python, a little more LUA, and a TON of C++ in the machine learning work I do. Things like opencv, fbthrift, folly, boost, fblualib, and thpp make writing this sort of code in C++ very time efficient and if you know what you are doing it will end up performing much better than the alternatives. I only use python for some light scripting, data collating and reformatting type tasks, and LUA due to using Torch as my Neural network framework of choice.

reply


Several prominent open sources are written primarily in C++. TensorFlow is a good example.

reply


The positions of the graph legends are terrible - they overlap with parts of the most interesting data points.

If the author or an editor of the article reads this: It would help a lot if you could move the graph legends to the left. Thanks!

reply


The source charts are from an interactive explorer on Indeed and not staticly generated: https://www.indeed.com/jobtrends/q-Python-and-%22deep-learni...

reply


Guys for deep learning we all know it's cuda under the hood.

reply


You still need a programming language to interface with CUDA for deep learning.

reply




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: