Hacker News new | past | comments | ask | show | jobs | submit login
Python, Machine Learning, and Language Wars (2015) (sebastianraschka.com)
178 points by dekhtiar on Mar 22, 2016 | hide | past | web | favorite | 57 comments

I'm surprised the article doesn't mention Anaconda, which is Python with all the things he lists pre-installed for you. I've been a fan for some time now: https://www.continuum.io/why-anaconda

Another data point on Anaconda from microsoft's Azure ML team - we've been using Anaconda for the past few years and have been very happy with it. One key feature we like is that we can deploy a nearly identical suite of pkgs on Windows, Linux and MacOS w/o having to worry about mismatched versions, installation/update issues, etc.

Case in point - the Azure ML Jupyter notebooks run on a farm of Linux/docker machines:


Once a model is built, it can be deployed for production/scale to the Azure ML backend which runs entirely on a farm of Windows machines.

Both environments have matching Anaconda distros running underneath which makes running on two different OS's practically a non-issue. In fact 95% of our users probably have no idea their notebooks & production site run two OS's (in large part thanks to Python, Linux, Docker, Anaconda, ... and lots of other great open source software).

I totally echo this. In my opinion, if Python was the Linux Kernel, Anaconda would be Ubuntu: i.e., it ensures that you are actually up and running within minutes and not struggling in the missing-header-files or dependencies-management hell.

Of course, there's time and place for eschewing Anaconda, especially if you are trying to ensure your installation and dependencies are minimal. That said, for most data scientists who are workaday *NIX users at best, Anaconda is such a productivity boost.

EDIT: Actually, my analogy needs further qualifications. You _can_ get into dependencies/missing-header-files hell in Ubuntu, but it usually doesn't happen in the first few hour/days. Both Ubuntu and Anaconda have great "first 5 minutes to 5 days" experience, and that goes a long way in ensuring the adoption of underlying technology.

I also think that Anaconda is great. However, I hope that in the future we could install numpy, matplotlib, jupyter, etc. just using pip.

I did that earlier today:

pip install jupyter

pip install numpy

pip install scipy

pip install scikit-learn

pip install matplotlib

The only problem I had was with OpenCV, which requires manual make installation if you want the contrib package. The other problem was when trying to install scikit-learn, it requires manual pip installation of scipy.

I would always be worried with installing numpy/scipy via pip that I wouldn't be linking to BLAS/LAPACK correctly.

What do you mean? I have always installed via pip with zero errors on all three platforms.

The worry is not that it will throw an error, it's that it will silently link against lower-performance math libraries and make your code inexplicably slow.

The reason you can't is because of the C libraries levaaged must be installed prior to install numpy and scipy.

For example, you can't get through PyYaml unless python-dev is installed on Ubuntu. I am not sure if wheel would fix it but I don't think so.

Wheel is a binary distribution, so files would already be compiled and therefore python-dev would not be needed anymore.

Is wheel not default? Or play it elementary :-) how would I avoid this issue in the future?

I did not write anything worthy to send to PIP so don't know exactly but looks like it is up to the developers[1]

Anyway I just noticed that PyPI only supports binary packages for Windows and Mac OS X. Although, you could still generate wheels of packages that you use by using something like this:

    pip wheel -r requirements.txt
You can then install them with pip install <file> or (unfortunately I forgot the option, perhaps it was -i) you can use an option to point to a directory containing wheels and pip install to install the main package. It should use all dependencies in that directory as well.

[1] http://pythonwheels.com/

the issue with Wheels on Linux, and this is very much a Linux problem, is that the 'pre built' nature of libraries in wheels doesnt play nice with the raw chaos of Linux's package management + distro + kernel ecosystem. I raised the question of FreeBSD and Solaris based wheels at a PyCon when I was in a face to face discussion with someone more knowledgeable, and the answer was 'in theory that should work like Windows & OSX, no one has done the hard work yet.'

So yeah Linux is not the most friendly environment for Python Wheels.

conda supports pip installation within the context of an environment (much more gracefully, in fact, than does virtualenv). So you're not giving anything up by using conda in this regard.

I don't see the desire to have pip as the baseline. For me, the conda packaging is much more informative and placing everything you need for multiplatform support into an /info directory with a meta.yaml is a lot more effective than going through the steps of PyPI. conda also makes uploading and hosting on anaconda.org extremely easy.

Normally there is the whole "gee, I don't want to learn another package manager" -- but conda / anaconda.org is extremely worth it. It really is a major engineering step forward from the existing package deployment strategies in Python.

I even configure my travis.yml CI scripts to download Miniconda, create a conda environment from a requiremenets.txt, and then build and test my code via conda on the contiguous integration VM itself.

The only worry is how strongly tied conda and anaconda.org are to the future of Continuum. Given how much Continuum speaks of open-source work, one would hope that these projects essentially live independently (or that forks of them would) but you never know. I do admit that is a major downside.

I like Anaconda (and I recommend it as the easiest installation for data sci), but on OS X it is easy to install Python and relevant numerical packages with Homebrew and pip.

Currently, pip install of numpy (and scipy) does not pick up MKL, which you would want if you do lots of linear algebra and FFTs.

I have anaconda installed at work/windows and home/linux. One minor annoyance is that it doesn't play well (at all?) with virtualenv. They worked so hard to get a huge suite of off the shelf third party libraries integrated, it's curious that they would come up with a replacement for virtualenv.

I'm halfway hoping that I don't know what I'm talking about.

Right, I know about that. What I'm pointing out is that anaconda doesn't support http://docs.python-guide.org/en/latest/dev/virtualenvs/

It's nice that they have an env tool, but why didn't they just use the existing virtualenv?

Because using theirs means not having to recompile numpy (for example) with each environment.

Why doesn't anaconda's env work for you?

Not saying it doesn't work, just wondering why the duplication of functionality. Which you answered, thanks.

conda env also installs other dependencies for code. Really, the whole point of it all is that conda replaces pip and virtualenv.

Conda manages more than just Python packages, but other dependencies as well. For example, here's list of packages in a new environment I created with

  $ conda create -n hn python
  $ source activate hn
  $ conda list -c

setuptools and wheel are the only Python packages installed. The others are non-Python packages that conda is managing for my environment.

For all the talk of vendoring in package manager tools these days, I really, really like the way conda manages this stuff.

Buddha about Language Wars:

Then the Buddha gave advice of extreme importance to the group of Brahmins: 'It is not proper for a wise man who maintains (lit. protects) truth to come to the conclusion: "This alone is Truth, and everything else is false'.' Asked by the young Brahmin to explain the idea of maintaining or protecting truth, the Buddha said: ' A man has a faith. If he says, "This is my faith", so far he maintains truth. But by 9 that he cannot proceed to the absolute conclusion: "This alone is Truth, and everything else is false". In other words, a man may believe what he likes, and he may say 'I believe this'. So far he respects truth. But because of his belief or faith, he should not say that what he believes is alone the Truth, and everything else is false.

(from "What the Buddha taught, a really great book!)

> 'I believe this'. So far he respects truth

What if he believes is objectively wrong?

He only talks about wise men who protect truth, so the truth being true is a given I'd say.

EDIT: Of course he's talking to people who swear by the truth they protect. Instead of telling them they're wrong, telling them others might be right is far more likely to get them to consider his point of view -- that other "truths" are just as equal.

I too am a Python-preferring machine learning engineer, but my reasons are almost precisely the opposite of the post's.

1. No matter how much you ever think, as a scientist, that you "only do an analysis one time" it is false 99.99999% of the time. You will always want to run it multiple times. Other people will want help modifying and running variations of it. Employers will need you, the scientist, to "productionize" it and make it suitable for automated deployment, probably cross-platform.

2. Your analysis will have to adapt to changing data inputs, which means you invariably have to create a (well-designed, unit-tested, and best-practices compliant) tool kit for custom data cleaning, pre-processing, database I/O, file system I/O, and visualization.

3. You will inevitably need to be concerned with raw-metal performance, but generally in isolated pockets of your code, so you'll need a language like Python that supports targeted performance optimization with tools like Cython.

4. Code is read (especially by newbies who need your help) much more than it is written, so you need a language that is easy to explain and reason about, with very few syntactical tricks and complicated conceptual nuances.

Overall, Python suits this niche very well. It is a full-service object-oriented language with a huge and well-maintained standard library. The third party tools for machine learning and general numeric computing are by far the best in the open source world (apart from a handful of boutique R libraries, which can use via rpy2 anyway), and Python is a simple language that is easy to teach and explain but also supports lots of targeted optimization in the CPython layer.

From the very first line of code you write, when you still naively believe "I will only run this once and I just need to crank it out," you need to be obsessed with writing well-designed, extensible, unit-tested code that is only a short distance from already being "production ready" -- and Python is a great language choice for this.

As someone who's switched from Ruby to Python (for now, because the latter is far easier to teach, IMO) and also put significant time into learning R, because of how strong ggplot2 is...I was really surprised at the lack of Google results for "switching from python to r" -- or similarly phrased queries to find guides on how to go from Python to R...in fact, that particular query will bring up more results for R -> Python than the other way around (e.g. "Python Displacing R as The Programming Language For Data")...Talk of R is so ubiquitous in academia (and in the wild, ggplot2 tends to wow nearly on the same level as D3) that I had just assumed there was a fair number of developers who have tried jumping into R...but there aren't...I think minimaxir's guides are the most and only comprehensive how-to-do-R-as-written-by-an-outsider things I've seen on the web [1]. But by and far the common scenario is that of the author's: "Well, I guess it’s no big secret that I was an R person once"

That said, one of the things I've appreciated about R is how it "just works"...I usually go through Homebrew, but RStudio works just as well. I can see why that's a huge appeal for both beginners and people who want to do computation but not necessarily become developers.

Also, I used to hate how `<-` was used for assignment...but now, that's one of the things I miss most about using R...I've grown up with single-equals-sign assignment in every other language I've learned, but after having to teach some programming...the difference between `==` and `=` is a common and often hugely stumping error for beginners. Not only that, they have trouble remembering how assignment even works, even for basic variable assignment...I've come to realize that I've programmed so long that I immediately recognize the pattern, but that can't possibly be the case for novices, who if they've taken general math classes, have never seen the equals sign that way. The `<-` operator makes a lot more sense...though I would've never thought that if hadn't read Hadley Wickham's style guide [2]

[1] http://minimaxir.com/2015/02/ggplot-tutorial/

[2] http://adv-r.had.co.nz/Style.html

As a predominantly Python-focused engineer, I've spent considerable time teaching myself R and there is a lot to like about R. For boutique statistical libraries especially. For instance, the enjoyment of using PySTAN is nowhere near as high as simply using STAN directly from R.

However, when you dig into the R internals, and you learn about its generic function model of OO, and about the mangled history of S3 and S4 classes, it becomes very frustrating.

R mostly "just works" if you stick to the libraries. But if you want to really understand e.g. polymorphic dispatching and how you can design your own tools to use it, it's a deal-breaker pain in the ass in R. It just simply is not suited for real computer science situations when you need to design the software, rather than just making scripts that treat libraries as APIs.

Since the times when you need "just scripting" are about 0.00001% of real-world cases, it unfortunately means that as nice as R is, it's just not a good enough tool to standardize into a real-world workflow. You're way better off using Python, even if you have to give up easy access to certain libraries, re-write your own implementations, or kludge them on with tools like rpy2.

>Since the times when you need "just scripting" are about 0.00001% of real-world cases

I would argue the opposite is true.

Most of the time an analyst builds regression trees in the mktg department for a retailer. Or similar situations like this.

That is in fact an exact situation I've been in, where it had previously been organized as a bunch do ad hoc scripts, and it caused huge problems.

It is exactly all of these fractured cases where you're running analyses in some bespoke corner of a business that it is most critical that the analysis tool is productionized and unified.

Plenty of businesses don't do it this way, but their reliance on manual scripts, manual environment and dependency management, manually tracking which versions of code produced which results, and manually calibrating on historic test data, it all adds up to incredible problems that eventually require re-writes focusing on a systematic architecture.

Vendors are standardizing on R, so I don't think the situation is as severe as you make it out to be.

Vendors tend to standardize on crap. How many vending machines do you see with salad or tofu instead of Doritos or soda?

Big multinational IT vendors are even worse than the vending machine example. They cater to situations where a firm has some internal power struggle over IT and the struggle is won by one side who gets the certified approval of a statusy external vendor.

For instance, most big, showy technology re-orgs around Hadoop are unequivocally bullshit. Most firms do not face data problems that are appropriate for Hadoop, not even when their primary enterprise data store approaches the raw size at which Hadoop becomes plausible, because the types of analytical workflows still tend to involve much smaller data.

These firms hardly adopt Hadoop for pragmatic engineering reasons, and engineers may even resist the re-org and offer compelling evidence that it is a waste of money, yet are ignored.

Vendor offerings are much more about catering to the CTO or some other manager level employees, and focusing on b.s. on-paper "milestones" that the product can help them achieve. Many times, merely just the installation and integration of these tools can result in a multi-hundred-thousand dollar bonus, or more, for people high up the IT food chain -- even before anyone has seen whether it will bear any fruit or even come close to being cost effective.

Vendors will do this with anything they can, so it's not surprising they would do it with R technology and they also absolutely do it with Scala, Python, Spark, etc. Big vendors peddling all of that truly are the hotel lobby snack machines of the software world.

I'm not saying it's bad to have R as part of a professional computing environment. R is a great tool. I'm saying that for most business situations, people who tend to make the assumptions necessary to justify using R are often very wrong, and the clunky and inconvenient support for professional computer science and software design in R bites them, making Python a much more pragmatic choice.

Whatever vendors do is pretty irrelevant to this.

I guess we have to agree to disagree.

R is indeed quite weird when you get to internals. What's the problem you have with S3 though? It has it's limitations, but since I've first seen it, I thought it was by far the single most convenient and intuitive OO system out there... But yes, can't use it for everything. S4 is a bit ugly but still kinda workable -- the real problems for me start when you need to deal with environments and expressions and such, but is it really so often necessary to go there when you are designing your own tools (which I guess is what you're doing?)

As for the real-world percentage, it varies depending on the job -- R is primarily designed for the people who spend most of their time looking and thinking about the data, rather than the code.

It sounds a lot like perl5 back in the day. let's fuze sed grep awk and bash into one thing. It's got a great deal of whipupitude, you can make amazing stuff in an afternoon.

The reason is that it's more common for a statistician to encounter a problem that needs switching to Python (either because there are Python tools for it, or because of a need to interact with non-statisticians who are using Python), than it is for a software engineer to get into statistical research.

It is quite telling that a lot of people here mention packages such as ggplot2 as the advantage of R -- ggplot2 is really quite awesome, but I can see it being similarly powerful had it been done for Python (maybe with a bit more unwieldy syntax, but that stuff is really subjective).

R really shines when you are doing actual data analysis -- things like the data.frame object, first-class missing values, proper attention paid to inference (something I quite often see missing from stuff written in Python or even Matlab), model syntax, estimation objects with actually useful pretty-printed summary, huge variety of statistical plots, everything designed to be convenient to use in the REPL, etc. -- and since these things are expected, you will usually find them in CRAN packages as well. Python (and Matlab) are IMO much less consistent in that regard.

But if your task is primarily to implement a particular type of analysis, put it in production to be easily repeated in the future, and go to next problem, R indeed offers little benefit over Python unless you rely on a specific package (which, btw, you should only do if you understand exactly what it does, because most R packages are written by statisticians, and as a result have very little fool-proofing).

You know, I remember when I was trying my first language other than BASIC (VB6, perhaps? or maybe 1995 era JS?) and it bugged me that "x = y" wasn't the same as "y = x". Remembering it as "LET x = y" was helpful.

People have been upset about that since at least the 1950s. A lot of people who make programming languages use the := for the assignment operator instead.

Helps when you prounounce the assignment operator as "becomes" rather than "equals", too.

Heh, that brings me back to when I was a kid, thinking, "x = x + 1? No it doesn't."

I've also seen this confused with folks who are just learning their first language and = is a common assignment operator in typical 'first languages' these days.

> it bugged me that "x = y" wasn't the same as "y = x"

it is in Prolog

There are lot of options when it comes to machine learning frameworks open source and commercial as well. Also many of these frameworks are designed to solve a specific problem. Some of the machine learning frameworks are optimized to run on certain types of hardware. In my opinion selecting a machine learning framework depends on the technology stack of your company because it makes lot of sense to leverage existing system rather then developing everything on an entirely new infrastructure and language.

why pick just one language? with the polyglot Beaker Notebook, you can work with many languages, even in the same notebook, and your data is automatically translated between them.

each of these languages has its strong point. there is always some library you want to use in some other language. or you want to collaborate with someone. or next year you change your mind and Julia is finally good enough.


Knowing other languages is great, but there is very real overhead in learning them. Python is mature enough that you have mature libraries available for pretty much everything.

believe it or not, there are some people who know R and face overhead to learn Python. and they love ggplot2.

and frankly, Python has no libraries for interactive visualization in your browser because only JavaScript runs there.



"Bokeh is a Python interactive visualization library that targets modern web browsers for presentation."

(Of course, it works by generating JavaScript)

right, and in order to really customize what bokeh does you need to write JS in strings in your python code, so this just proves my point.

http://bokeh.pydata.org/en/latest/docs/user_guide/interactio... see the CustomJS function.

No you don't. It now does python to JS compilation. And that's if you want client side callbacks.

Python has no libraries for interactive visualization in your browser because only JavaScript runs there

Isn't that like saying that C has no visualization libraries on Intel processors because only assembly runs there?

I really like Beaker. However Jupyter also caters to a wide range of languages, albeit one language per notebook. So the main differentiator for Beaker is that it can auto translate data between languages. However there is a giant catch to that: it is done by serialising the data which intrinsically means it is extremely expensive for any significant sized data set. As a result, in most cases I end up writing my data out to a file in one language, then reading it in again in another, even when I'm using Beaker, because the auto-translation is just too inefficient to do through Beaker's process.

However both Beaker and Jupyter lag behind RStudio in raw usability and convenience, especially for plotting. There doesn't seem to be any tool in this vein that "has it all" right now.

thanks. indeed autotranslation currently only works for small data. fixing it with shared memory and hdf5 is on the agenda for later this year: https://github.com/twosigma/beaker-notebook/wiki/Roadmap

I would really like to hear what usability and convenience advantages RStudio has for you...

Reflecting on it, what I think I find most attractive about RStudio is two things:

1) the IDE like environment. Although I do a lot of work "literate document" style, there's a very fluid relationship between the one-off / throw away code that ends up in those documents and the code that ends up becoming reusable. RStudio puts the interactive environment into my hands but gives me tabs with full documents to work with so that I can take those ad hoc bits of code and seamlessly move them into reusable scripts, or even fully structured libraries and program code. Conversely, any piece of code in those scripts or libraries is trivial to throw back into my interactive console to use ad hoc as well. This fluid workflow perfectly matches my development style: starts with experiments to figure out how to do things which is very ad hoc, gradually matures and eventually ends up being a static solid thing that I want to capture as a reusable component. With Beaker and Jupyter I feel like I am always stuck in "throw away code" mode, and there's a high friction to move code between the two paradigms.

2) The plot display seems to work a lot better. With both Beaker and jupyter I find the inline plots are too static - always the wrong size or the wrong shape, etc. In RStudio I can drag and resize that window independently.

I feel bad because I'm not telling you solutions and I don't really have any - but these are the things I love about RStudio. Of course, the downside of RStudio is ... R.

don't feel bad, it's very useful to hear what people need. feedback guides development. so thank you very much.

i hear you on 1 -- agreed & some help is on the agenda: https://github.com/twosigma/beaker-notebook/issues/3688. and then editing regular flat files for inclusion via the backend's native import (even into non-beaker programs).

on 2, yes many plotting APIs have that problem. beaker's native plotting library produces a thumb so you can drag to resize: https://pub.beakernotebook.com/#/publications/568ddaed-b648-...

   I know what you want to ask next: “Okay, what about turning my model into a nice and shiny web application? I bet this is something that you can’t do in R!” Sorry, but you lose this bet; have a look at Shiny by RStudio A web application framework for R.
True, but it is a dumpster fire.

Can we get "(2015)" on the title?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact