Hacker News new | past | comments | ask | show | jobs | submit login
Python, Machine Learning, and Language Wars. A Highly Subjective Point of View (sebastianraschka.com)
167 points by Lofkin on Aug 24, 2015 | hide | past | web | favorite | 93 comments



As someone who almost exclusively uses Julia for their day-to-day work (and side projects), I think most of the author's thoughts about Julia are correct. I think the language is great, and using it makes my life better. There are some packages that are actually better than any of their equivalents in other languages, in my opinion.

On the other hand, I've also got a higher tolerance for things not being perfect, I can figure things out for myself (and luckily have the time do so), and I'm willing to code it up if it doesn't already exist (to a point). Naturally, that is not true for most people, and thats fine.

The author isn't willing to take the risk that Julia won't "survive", which is fair. Its definitely not complete yet, but its getting there. I am confident that it will survive (and thrive) though, and continue growing the not-insubstantial community. I have a feeling the author will find their way to Julia-land eventually, in a couple of years or so.


As a fellow Julian, I agree. To reiterate what I said during my JuliaCon talk, I never truly understood what Rubyists meant when they said "optimised for happiness", Ruby never had that effect on me and I remained a happy Pythonista. However, Julia makes me feel this way and got me "hacking" again after having been in a bit of a slump.

That being said, we need to realise how young Julia actually is. Last Sunday marked six years since the first commit and the language has only been publicly available for three years. Julia is as old as Python was in 1994, or as old as Ruby was in 1998. Things are rough around the edges, but getting better. Things are missing in order for you to get productive quickly, but it is improving rapidly.

I do tell people around me that Julia is most likely the language that I love the most, I tell them about the features, my involvement, etc. But, when they say "That sounds awesome! Should I use it?", I hesitate and say, well, it depends. Are you already very productive in what you are currently using? Is speed an issue for you? If not, don't feel rushed, if you want to be a "pioneer" you certainly can be and we will welcome you. But consider your own situation before jumping ship, what do you need to be productive in your day-to-day job? However, keep an eye on Julia, because I am convinced that the cost of adaptation for you will continue to go down and if there is any language that has the chance to become "just right" for Machine Learning over the next couple of years, it is most likely going to be Julia.


Thanks for the comment (I am the author of this article).

> I have a feeling the author will find their way to Julia-land eventually, in a couple of years or so.

I have a strong feeling that this will eventually happen :). In an ideal, less busy, world, I would love to use Julia alongside to explore and battle-test it further. Or even develop useful packages, libraries, and functions for it. The truth is, I am currently lacking the time to do that :(. I mean, Python works for me, and I am currently more into the scientific problem solving so that I don't have the time :(. When I say that Python works for me I mean that I am currently happy since it can do everything for me I need, however, this doesn't mean that Julia couldn't do certain things better ;).

Anyways, I really like your comment. I am wondering if you would be okay with it if I include it in a "Other people's experiences and opinions" section at the bottom. I think this would be extremely helpful for people who are new to the "data science field" -- my article is strongly biased towards Python as you noticed :P


I think that the popularity of julia is exploding (but I'm biased - am writing an... interesting library for julia right now).

"There is really nothing wrong with R"

I think there is one thing wrong with R - it's name. Pretty much impossible to quickly google for help on it.


Sure, go ahead!


I've used Python professionally for 8 years, and it's my favorite language. I have used numpy and a scikit-learn a little bit. That said, I've really enjoying learning Julia recently. It's been easy to learn and it really does perform well (read: it's fast). In fact, I think learning Julia has been about as much work as learning something like numba would be, and gives similar (some say slightly better?) performance.


I quickly glanced at your bio, but day-to-day work is that? Count yourself exceedingly lucky!


Could you name some of the packages you think are better than any of their equivalents?


And there is nothing wrong with C++. For linear algebra I use the armadillo library and it's really a nice wrapper around LAPACK and BLAS (and fast!). For some reason scientists are somewhat afraid of C++. For some reason you "have to" prototype in an "easier" language. Sure, you can't use C++ as a calculator as opposed to interpreted languages, but I see people being stuck with their computations at the prototyping language and eventually not bringing it to a faster platform.

Point being: C++ is not hard for scientific calculations.


I agree with you. However, note that many people who are using Python for writing scientific code make use of C/C++ in one way or the other (aside from NumPy, SciPy, and Theano). For example, many people write the "most intensive" computations down in C/C++/Cython if they call those functions frequently -- Python becomes a wrapper. One example that pops into my mind is khmer (https://github.com/dib-lab/khmer)


This is a common refrain: drop down to C, C++, Fortran for the computation intensive parts. It works, but only to a degree. The inefficiencies lie in the vectorization semantics of the host language(s) that leads to extra copies and extra levels of indirection. So this dual language mode of operation typically does not approach what one could have obtained had one disposed the baggage entirely, except for I/O. Usually in the quest for better speed, that is what remains, as one moves progressively larger portions of the application in the C, C++, Fortran part of the code. A reason I like Julia is that I can largely avoid this dual language annoyance, and enjoy the succinctness of pithy vectorized expressions using https://github.com/lindahua/Devectorize.jl


The speed performance of C/C++ is terrible, compared to GPU speed, which is often 50x to 100x faster. The goal has shifted and continues to shift in that direction.

C/C++ is no longer best for speed, not even close.

Now, all that matters is which language has the libraries that make it easiest to get custom code onto the GPU. Python and Lua seem to be winning there, by far.


>Now, all that matters is which library makes it easiest to get custom code onto the GPU. Python and Lua seem to be winning there, by far.

This is interesting. How is it possible that python and lua have more efficient wrappers around GPU libraries? Also there are many GPU libraries for C/C++ too. Armadillo can use NVBLAS as a backend too. I'm not sure if I get your point of C/C++ being slow.


It's not about wrapping. The real power is in the cross-compilation of expressions and entire complex data pipelines, from a simple-as-possible high-level language into GPU language. That's the power at the core of, e.g. Theano.


Compiling high level instructions to different hardware backends is hardly an exclusive feature of python and lua libraries. Google would swamp you with hits if you were to search


Show me one C/C++ library that competes with Theano or Torch7?

Google / Facebook and many other huge companies are using Theano and Torch7 in production, at scale. The ML industry has been continuously moving in this direction for years now.

On these optimized ML systems, only a tiny fraction of CPU time is spent outside of the GPU. The goal in many of these companies is to migrate all tasks that can be done on GPUs to GPUs, as soon as possible. It's far faster and more cost efficient.


Do you have some sources demonstrating that google and facebook are using them in scaled production? My impression was that presently these were more for research and prototyping.

I would have thought that if you were going to run prod systems in the gpu you would actually write CUDA (C++) or similar to avoid the inefficiency of the abstraction layer.

(also, this comment is bordering on the uncivil).


> show me ...

I can only bring the horse to the water (or Google as its sometimes called these days) :)

>Google/Facebook and many other huge companies ... You are totally wrong.

That totally settles it then thank you, who am I to argue and surely there are no ML jobs that spend time outside of GPU.


Did you just compare a language with a piece of hardware ? I hope you realize:

(i) how nonsensical such a comparison is

(ii) there are many algorithms for which GPU offers no speedup at all in fact the the data transfer can actually hurt. There are instances where using CPU's SIMD instructions makes more sense than GPU.


You don't have to use the vectorization though. I prefer to work with Cython and just use pointers. You don't get to write vector + vector --- you have to actually write the loop --- but that's not such a big deal imo.


I am lazy and super prone to make one-off errors, so that to me is the _biggest_deal_ TM :)


Numba for python also has a no copies vectorization mode.


I believe S (ancestor of R) started as C glue at Bell Labs. I've heard it said a few times that R is slow, but that doesn't really make much sense if your bottleneck routines are R calls to C++.


I see this a lot in my lab (days/weeks to run computations and analysis that could take minutes/hours), and I used to offer to help the postdocs/phd students port their code from matlab to C/C++, but I've mostly given up on offering unless they ask for help, or if it seems like fun. I also think it's because they mostly think they won't work on these things again (which is kind of weird to think about since they spent most of their life working to get to this point, but that's another conversation).

Aside, I think armadillo is pretty great, especially going back and forth with other matrix libraries, also nice wrapper around OpenBLAS which made not running something on a cluster more bearable.


I saw you mention Armadillo a few times, seems you are having a lot of fun with it. Another somewhat look-alike is Eigen, its pretty nice too. These are all modern takes on the original sin Blitz++ which is actually more full featured in what they can represent than Eigen, Armadillo, mublas (by Boost) and Blaze: it supports multidimensional arrays as opposed to just (2d) matrices. Where Blitz++ falls short of the competition is full use of SIMD instructions. However, G++ does a plenty good job of vectorization, but the king of the hill is still Intel's compiler.


I am not sure about Intel's compiler, but OpenBLAS is pretty much on par with Intel MKL in most benchmarks I have seen recently, e.g. [1][2].

The thing that I like with Armadillo is that you can just pick the BLAS implementation that is the fastest on your platform and run with it, while still being high-level compared to BLAS.

[1] http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-an...

[2] https://github.com/tmolteno/necpp/issues/18


> you can just pick the BLAS implementation that is the fastest on your platform and run with it

Just in case this is useful info, Eigen and Blaze works the same way, Blitz++ doesn't. Blaze seems to be doing the best in the benchmarks.

> OpenBLAS is pretty much on par with Intel MKL

I wont be surprised. I have seen ATLAS beat MKL on some BLAS functions on my runs. Sparse matrix multiply is particularly badly implemented in MKL (atleast was). One could beat their sparse multiply by using multiple instance of their own sparse matrix vector multiplies.

But the reason I mentioned ICC is that there is much more to SIMD than just linear algebraic operations. Non-linear operations that come up quite frequently in stats/ML are sin, cos, exp, log, tanh etc on vectors. GCC/G++ does a decent job now (the best part is that it emits info why it wasn't able to vectorize a particular loop. This lets you restructure the loops to ai the compiler), but ICC still rules.


Just in case this is useful info, Eigen and Blaze works the same way, Blitz++ doesn't.

I was under the impression that Eigen only works with MKL?

Enables the use of external BLAS level 2 and 3 routines (currently works with Intel MKL only)

Source: http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html

But the reason I mentioned ICC is that there is much more to SIMD than just linear algebraic operations.

Indeed. I compiled some stuff with ICC, but it did not give me an tangible improvement over the latest GCCs. But for those projects, linear operations were the vast majority.


Its a BLAS library standard API that you are linking to, so the vendor doesn't matter so much (in theory). In practice, the amount of annoyance that you have to go through so that it links correctly can vary a lot. Even MKL did not implement all of BLAS / Lapack and that would lead to some linking errors.

Re GCC, the same here. ICC is mighty expensive.


I know that BLAS is a standard API, I switch between BLAS implementations fairly often :). I suspected that Eigen relied on some non-BLAS functions in MKL, since they explicitly state that only MKL works.


> I switch between BLAS implementations fairly often

... that makes us co-sufferers then, or brothers in masochism :) if you will


Not so surprising that OpenBLAS beat MLK on an AMD processor. Although it's definitely keeping up.


For machine learning / neural nets in C++ there is also Caffe (for ConvNets) and CNTK (for almost any type of network).


Writing scientific code is hard both in C++ and python, but reading and manipulating data and visualizing it is way easier in python. It's also easier to install and use third party libraries.


It's unimaginable for me to do most of this stuff without a REPL. That's the show-stopper for me with C++.


Maybe there will be a working REPL for C++ in the future:

https://github.com/vgvassilev/cling


I switched from mostly using R to Python about a year ago for gluing together my data pipeline (from data source all the way to production models and frontends/visualizations). It hasn't really impacted what I'm capable of doing or my productivity, except the standard extra googling that comes in the first couple years I use any language.

The main reason I went for Python is purely practical: it's a language people outside my team will respect and deal with. It makes it easier for me to collaborate in many different ways: share tools with other teams, transfer ownership of my code, get help when I need it, etc. Data science at some companies has the reputation of "hack something together and throw it over the wall for someone else to deal with". In my experience R only furthers this reputation. Which is too bad, it's really great at what it does.


Yeah, there is a large community of Python users in scientific computing. It's great.

I like well-established languages with a large user base.

So, I was dismayed by Big Data Genomics' ADAM Project's choice of Scala, which has almost no uptake in the genomics/bioinformatics community.

They do it because they run over Spark. But Spark has an excellent Python binding.


Python's days have come and gone.

Computation has grown more complicated. They need real computer scientists and a real language that supports real development, not some scientists which learned just enough Python to automate running some 20 year old Fortran code.


So Julia then :)


The part about sharing makes a lot of sense since Python use is so wide spread. The throwing-over-the-wall effect isn't a language specific issue, more of a work culture issue. Seems to me if you practice "literate programming" with R markdown you can greatly improve the sharing aspect and reduce the throw-it-over-the-wall issue.


Totally agree about it being a cultural thing, have first hand experience at some of the usual suspects. I have a far more cynical label for it: deliver your turd (typically formed in MATLAB) for someone else to polish. It is surprising how common this is in some places and groups. I mention groups because when the group moves from one place to another it brings that turd polishing culture along.


I went through the very same process :). I really like your comment, you highlight something that I forgot to mention in this clarity "It makes it easier for me to collaborate in many different ways: share tools with other teams, transfer ownership of my code, get help when I need". Would you mind if I add it as to a "other people's experiences" section at the bottom of the article?


Definitely feel free to add that!


"for gluing together my data pipeline"

Yep. This is exactly why I gravitated toward python and scikit-learn. So much of data science is just getting the data into the right format. Grab it from this file, and this database, and this web service, then get it formatted into this table structure, and then clean it with this filter, and then plug these holes this way and those other holes that way, and now you're finally ready for a random forest baseline.

Python is a really good language for merging and parsing data from lots of different sources. For many problems, a general purpose language with very good data science library may actually be the better choice than a dedicated data science language/environment.


I've been interviewing candidates recently and 'I just prototype in matlab and throw it over the wall to the implementation team' is a big negative in my evaluations (we don't have a wall in this company and I will fight to keep it that way).


Octave/Matlab are "great" but good luck trying to integrate them into a production web application. Since you cant really do that - avoid using them unless you are fine with implementing the same algorithm twice. Matlab licenses cost money also, and the toolboxes cost additional money.

R is useful because there are a lot of resources as it has been along for so long and is used by a large portion of the stats community. It also has a lot of useful libraries that have not been ported over to other languages yet (ggmap!!!). But you still still run into the same problem that you cannot integrate R into a production web application.

I am pretty sure Hadoop streaming does not support R,Octave, or Matlab either


I'd like to kindly challenge the notion that you can't integrate R into a web application. I've started using R to power jobs that are used by a large web application. The R packages httr or RCurl make it pretty easy to make http requests -- (enabling me to send things to a web server to be consumed into a database and run by back-end code). It's also possible to prepare data in R and then send to a space like S3 with a System("s3cmd sync some-data s3://some-data") call. I've also been using Python a good bit lately. I don't see either has having a universal advantage for a data pipeline.


I was once quite suprised that R was missing any kind of RESTful service package for exposing R functions as a REST service. What I was looking for was basically some way to invoke a couple of R functions from the web application. In the end I managed to do it with some JRI (part of rJava) bindings.

Maybe it's a good idea to implement such thing.


I was just introduced to yhathq.com. It's a platform that allows you host your R function on the web and exposes an API. Seems quite interesting, and I'm wondering if it helps solve R-based web app problems.


   Octave/Matlab are "great" but good luck trying to integrate them
   into a production web application
What problems are you facing with Octave? It has, in fact, been integrated into a couple of production web applications I know of:

https://octave.im/

http://octave-online.net/

https://www.rollapp.com/app/octave

I have promised a while ago to improve its Python integration so that Python and Octave can be in the same process (there are lots of advantages to that kind of tight integration instead of relying on parsing output through pipes). Perhaps that could help you?


wow - i have never seen this. thanks for the links! (i take back my octave comment). that would help me


R in general is not a good stream language. It is much better for batch processing (using "vectorisation"). Anyway it doesn't mean you can't stream process data with R! Spark or Flink provide some means for grouping streams into windows that fit R very well.

But I agree that integrating R with something else is an unexplored terrain. It seems that R rather tends to be everything: from data acquisition to visualisation.


I just completed the Coursera data science track which took me from a complete R newbie to being at least somewhat proficient. Having previously used Python for a quite a bit of web programming, I disliked R at first except for its power in statistical programming. But I've since discovered a number of great R packages that make it a pleasure to use for things I would normally turn to Python for. Like I recently discovered the rvest package for webscraping.

Data visualizations with R seem vastly superior, unless I am missing something with Python (highly likely). And putting up a slick statistics app is easy with shiny or RStudio Presenter. But R can't really scale to a large production app, isn't that right?

So I feel I need to keep working with both Python and R.

Added: That's a nice list Lofkin. Thanks. Also, in the article he says that Python syntax feels more natural, which I also felt. But then I started to use things like the magrittr and dplyr packages in R which gives you nice things like pipes and that feeling starts to ebb.


For stats plotting in python: https://github.com/mwaskom/seaborn https://github.com/yhat/ggplot

For stats plotting and web apps in python: https://github.com/bokeh/bokeh

For calling r libraries in python: https://pypi.python.org/pypi/rpy2

For out of core datasets in python: https://github.com/blaze/dask https://github.com/blaze/blaze


Nice collection, let me add one more item to this list:

Seaborn: statistical data visualization: http://stanford.edu/~mwaskom/software/seaborn/


You bring up a good point in favor of R: Hadley Wickham and the rest of the RStudio people.

Packages like {ggplot2,rvest,dplyr,devtools, etc.} are basically creating a sub-language for R.

I use both at the moment, but I echo the OP's ideas that R's target audience is statisticians, where Python's target audience is broader and includes statisticians and computer scientists. And Python's syntax is nicer to work with. That's why it's become the primary glue language.

That said, the overhead for learning Python as your first data science language is a bit problematic for me, as you basically have to learn Python followed by Python's data science tools (pandas, matplotlib, etc). whereas with R, you're learning the language and the data science tools at the same time, even if they're a bit idiosyncratic.


>I think it [Perl] is still quite common in the bioinformatics field though!?

That's true - many day-to-day tasks in bioinformatics are more or less plain-text parsing [1], and Perl excels in parsing text and quickly using regular expressions. "My" generation of bioinformaticians doing data cleanup and analysis (20-30) uses Python, sometimes because plotting is nicer, the language is easier to get into, it's more commonly taught in universities, or other reasons - people older than that normally use Perl.

Both BioPython and BioPerl are extremely useful.

[1] Relevant quote from Robert Edgar: "Biology = strcomp()" from https://robertedgar.wordpress.com/2010/05/04/an-unemployed-g...


Thanks for the insights! Also here, this comment would make an interesting addition to a "Feedback" section at the end of the article to give people a broader view on this topic. May I have your permission to post your comment below the article?


Of course you have my permission :)


(Minor typo corrections; "Biopython", not "BioPython" and "strcmp" not "strcomp". I co-founded Biopython. We chose to avoid CamelCase.)


Sorry about that, that's what happens in caffeine-less typing - I sadly can't fix it at this point anymore, edit is disabled now


Andrew Ng said in the Coursera Machine learning class that according to his experience, students implement the course homework faster in Octave/Matlab than in Python.

But yes, the point of that course is to implement and play around with small numerical algorithms, whereas the linked blog is about someone who mainly calls existing machine learning libraries from Python.

Ref. https://news.ycombinator.com/item?id=4485877


Interesting. In my own experience trying to implement the same image processing algorithms in Matlab vs. numpy, the work took about the same amount of effort any time arrays were limited to 1-2 dimensions, all the code was simple numerical stuff, and it wasn’t necessary to break the code up into multiple functions.

The Matlab one-file-per-function thing, the lack of namespaces, and general lack of code structuring primitives makes it much less pleasant than Python for programs bigger than about 100 LOC though.

Dealing with higher-dimensional arrays, more sophisticated plotting, data munging, string processing, interfaces with external systems, etc. all left me banging my head in Matlab though, whereas Python makes it all a breeze.

Numpy’s broadcasting feature is also super nice, compared to wrapping everything in bsxfun calls in Matlab.

I wonder how much the @ operator in Python 3.5 will help students. Hopefully numpy can deprecate and phase out their "Matrix" object, and end the confusion about the meaning of basic operators.


I would imagine, for a beginner, in Python the difference between a list and a numpy array can be confusing. And in numpy, column vectors and row vectors are a different thing. Whereas in Octave/Matlab "everything just works" in the do-what-I-mean sense.

> The Matlab one-file-per-function thing

By the way, Octave does not have that limitation.


> And in numpy, column vectors and row vectors are a different thing. Whereas in Octave/Matlab "everything just works" in the do-what-I-mean sense.

I think this is actually one of MATLAB's biggest flaws. Without a true 1D array like numpy has, there is no way in MATLAB to tell the difference between a 1D sequence of values, and a 2D sequence of values with only one value along one of the dimensions.

This has led function developers to try to guess. But they guess inconsistently. Some functions treat row and column vectors differently, some treat them the same. Of those that treat them the same, some return them with the same orientation, while other force a particular orientation. Some operations ignore dimensions (length), others don't (for loops). Some maintain dimensions (size), some don't ([:]).

So everything may seem to work, until your code that has been working fine for years suddenly breaks, and you realize it is choking up because one of your experiments has only one trial, or one of your experiments has multiple trials each with one result, and some of the functions you are using start reacting differently to this. Then you have to go through each function and figure out on a case-by-case basis how it handles row and column vectors.

Or worse yet, it seems to run fine, but is silently doing the wrong thing. Which you probably would never know, because most MATLAB code isn't unit-tested.


Just want to leave a comment about the "Matrix" object. I remember the first day I got started with NumPy. Coming from MATLAB, I really felt tempted to use it. However, I also saw an recommendation (can't remember if I read this in the NumPy documentation or in an online forum) that ndarrays are more widely supported. I never really bothered about `Matrix` then, and honestly, I have never ever seen anyone using it. Maybe you will find some examples on GitHub if you really search for it, but it is practically non-existend.


Interesting! It's hard for a single person to tell, because you can only start one way digging into machine learning. However, from a teacher's perspective, this is a useful observation. I also started with Matlab since it was the language that was used in my classes. However, I think this is also a little bit context dependent: For someone who has never programmed before, Matlab may be more intuitive. I think in an ideal world, you should let your students choose what language they want to use to solve the problem :).


   For someone who has never programmed before, Matlab may be more
   intuitive.
Yes, this is Matlab's target audience: programmers who will not call themselves "programmers". In recent versions, they have tried even harder to hide the code away from the user, by trying to make everything work by clicking on buttons. I have heard from many Matlab users call themselves "not a programmer". They don't feel like writing software is what they're doing when they're using Matlab.


On balance, I've met people who claim they are intermediate programmers and use matlab only, and are freaked to shit when they see higher order functions in other languages, or even the idea of passing a "function", like, say, a pointer to a function, into another function, even though function pointers have existed since C.


They exist in Matlab too. They're called function handles. Matlab has functionalish functions like `arrayfun`, `cellfun` (akin to `map`) or `accumarray` (akin to `fold`), but the Matlab documentation prefers to teach loops, presumably because they're easier to understand, and usually faster in Matlab thanks to their JIT compiler.

I'm surprised that people are so surprised that you can pass function handles around, since there are some Matlab functions for solving ODEs or root-finding that are very commonly taught in intro courses that require function handles. I guess not everyone's first exposure to Matlab involves solving ODEs or root-finding.


> function pointers have existed since C

Also at least in ALGOL 68, COBOL, and FORTRAN 77. Maybe earlier, too.


Matlab has a better interface, Octave doesnt.


Have you found the Octave GUI to be unusable?


Personally, yes, but I tried about 5 years ago.


It didn't exist 5 years ago, so you must have tried one of the 3rd party crappy attempts at making a GUI. Sadly, it's difficult for Octave to shake off its reputation.

Try the GUI again. Here is a web version of it that is almost identical to running it on your own desktop:

https://www.rollapp.com/app/octave

This web version is based on Octave 3.8.1, though. Octave 4.0 has a much better version of the GUI:

https://en.wikipedia.org/wiki/GNU_Octave#/media/File:Octave-...

If you have Windows, try our installer:

https://ftp.gnu.org/gnu/octave/windows/


This could also be because students are already familiar with doing numerical computations in MATLAB from courses they'd done before. It is conceivable that students that take a machine learning course have already done some numerical analysis or scientific programming, likely using MATLAB.


Quite interesting post. I feel that a lot of the numerical Pythonistas are in the same spot:

They tolerate most languages, but find R's syntax a bit unnatural, Matlab lacking when trying to go beyond pure matrix stuff, and are waiting to see if Julia picks up (which it seems to be from what I can tell)


One of the hats I wear is 'does numerics in python'. R is fine when you're playing to its strengths, matlab is an abomination, and I have absolutely no interest in julia.


Why no interest? If it is something that could make you more productive and make your life better then shouldn't you take an interest? Or do you mean "I have looked closely at Julia and it's no good because..." If so the because bit will be of interest to the Julia community (it's 0.4 now so lots of distance to go in its development before it becomes stable)


I took a look at Julia a couple years ago. As far as I could see, the only reason to consider it over python is the speed improvements, and if I'm in a situation where python isn't fast enough, I've already got c, c++, and java to reach for. (Also at the time the library support was lacking (I just checked and the graphs package, which I needed at the time, is still pretty minimal)).


From the perspective of a student, most of the good online analytics/data analysis/stats courses use R, so it is hard to get away from it while learning the material. Once you get the base concepts down, switching to python shouldn't be hard. I think most people still prefer ggplot2 for visualization though. Whenever I use R I feel like a statistician, I can feel that 'cold rigor' emanating from the language. But in the end I think it is advantageous to wield both languages.

Also I really see Jupyter as a new standard for communication. Your narrative and supporting code all in one place, ready for sharing.


Yes, I think you are right. Out of curiosity, when I browsed over Coursera's course catalog, most data science related material seems to be taught in Matlab or R (however, there are also others, e.g., Klein's Linear Algebra class in Python). Personally, I think that instructors shouldn't enforce a language requirement. I believe for big platforms such as coursera it shouldn't be to hard to run an respective interpreter to check the code/answer uploaded by a student.


Most classes have to teach the subject and how to program. Programming is beginning to be an essential skills so they have to choose a language to teach.

Also those classes that chose R, from my experiences, are non CS classes, the professor are from other discipline. They just want a tool that solve their need quick. An example is the Princeton's Stat class, the professor is a humanity major. The class gave us tons of data and we had to do ANOVA and such and we needed a computer to crunch so number can't do by hand. So he chose R which he uses a lot.


Jupyter is amazing. It is great to see workflow and makes it easy to graph and find trends in the data, as well as mistakes in a data flow.


Personally I'm tempted to make the switch to Julia, but slow higher order functions, high churn in the core data infrastructure and no Pymc 3 are keeping me on pydata for a bit longer. I have numba to hold me over.


One thing missing here: Matlab syntax is actually very close to modern Fortran. At least twice I've written Fortran code (for Monte Carlo simulations; different contexts) by overwriting Matlab code adding types / general verbosity / fixing the syntax of do-loops / etc.


This is a very good point actually, I have done the same a few times. Of course this works well until you get to code heavily dependent on toolbox functions.

Which leads to another advantage of Matlab (at least as long as you don't have to pay for it) -- the documentation (including toolbox documentation) is way ahead of any competition.

Whoever came up with the R package documentation standard has done that language a great disservice :( IMO, the fact that every R package includes a huge pdf with alphabetical listing of functions and data sets is actively harmful: without it, perhaps more package authors would at least feel compelled to write a 2-page readme.txt (just like various matlab package writers do), and that would have been actually useful.


They've been trying to Javaify the Matlab syntax for close to two decades now. They're moving towards making everything an object like in Java. They're getting pretty close to that.


One of the lessons of Julia for me is that everything is an object is a problem, not a boon. I think organizing code in modules with sets of related types and functions manipulating those types allows more natural and modular decompositions for reuse.


I love the hacking approach in the post: a tool is only a tool to do something valuable and not the goal itself. The Python ecosystem is the right tool at the right time, nowadays, because of the data science explosion and the need to interact very quickly with non-specialists.


.NET's F# is also good..though maybe not a better alternative


Julia love!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: