What are the libraries like for Julia at this point? The big thing holding me to R is the breadth of libraries for a) various statistical methods and algorithms and b) plotting.
I'm kind of desperate to get away from R at this point, but reluctant to move to Python (not that I have anything against Python, but I would really like to add a high performance language to my toolkit than something inbetween).
It depends on what you need. DataFrames [1] is a pretty good replacement for R's data frames. Distributions [2] is pretty amazing. Gadfly.jl [3] is a great ggplot-style plotting library that integrates nicely with IJulia [4]. If you have questions about specific things you need to do and want pointers, post on the julia-users list [5] and you should get some good answers.
Out of curiosity, why are you so desperate to get away from R? Is it for the speed of a high performance language?
As you mention, the vast selection of libraries (and the continual development of these libraries) make even the python module universe look small (in reference to statistical methods, statistical algorithms/machine learning, plotting).
Further, not only are there a lot of libraries.. but many of them are bleeding edge written by the same people who originally invented the given algorithm. Most similar libraries in other languages (like Python) have incomplete ports of these libraries to them often not written by the same people. Not to mention, the documentation behind these algorithms is often excellent - besides the usual man files, there's "vignettes" (detailed instruction manual) as well as an entire journal that now has become a place to publish papers detailing the usage and implementation of many of these algorithms.
I love a lot about R, and I totally acknowledge the benefits. I don't think I would ever stop using it for data exploration and ad hoc analysis.
But after trying really hard over the past 3 years I'm sort of giving up ever being able to use it in a productive reliable fashion for building more complex software. It's a failing on my own part in many ways, but the loose typing and poor support for structuring code interact very badly with my personal style. When I'm working in R I literally spend 80% of my time fighting with the type system, debugging weird and wonderful features of R. I'm at the point where whenever I write an R function the first 50 lines of the code are type checks to make sure the data coming in is what I expect. I came to this point after I started systematically tracking why I was wasting so much time, and nearly every time it would come back to the type of my data being something other than what I had expected or assumed.
So I'm sort of figuring if I'm at the point where I'm writing manually statically typed code in R ... I should look around for a language that's just like R but has at least slightly stronger data types built in.
I agree completely on the type system. When first starting with it, I was thinking "wow this is great, I don't have to worry about types and things just work". But then weird things start to happen, and you realize it's due to strange type issues that are sometimes harder than they should be to track down. And then half your code ends up being something like as.numeric(as.character(as.vector(xyz)))
I cannot speak for zmmmmm but I can think of two reasons (i) performance, not that R cant be fast but you have to pull out heavy machinery such as parallelization much earlier and (ii) its inconsistent and confusing language semantics, and silent and unexpected type promotion/demotions that make it very error prone to develop and debug code in. To me the one redeeming quality about R is ggplot and, ok, availability of push button libraries
I am not so sure about the scope of the 'bleeding edge' part. It is popular among old school, statisticians. Another population (no pun intended) that R is popular in is one where one knows neither statistics nor machine learning but just wants to try out a laundry list of canned methods without a need to understand: pretty plot goes up, yay awesome; pretty plot goes down, ok try the next algorithm (or the other way round). Here I think R is pretty unbeatable in its breadth.
It is not that popular among machine learners. Part of it is cultural, a typical machine learning person will be coming from a CS background and then R grates more.
It has been claimed that R is lisp meets stats, I think that would be Julia now and Lush then. http://lush.sourceforge.net/. To quote Yann LeCun:
"Lush combines three languages in one: a very simple to use, loosely-typed interpreted language, a strongly-typed compiled language with the same syntax, and the C language, which can be freely mixed with the other languages within a single source file, and even within a single function."
You definitely have some good points - I agree on points (i) and (ii) - these are shortfalls of the language and in many cases (especially when building production level code) may be reasons to switch).
I disagree with you premise that R is not popular among machine learners and the implication that has for the "bleeding edge" comment. Perhaps it is less popular with ML people coming from a CS background - but it seems to be the most widespread language for ML (or "statistical learning") among statistics people. And that fact is not to be discounted - perhaps the preeminent book on machine learning (or at the very least one of the most popular) - "The Elements of Statistical Learning" - is written by statisticians (and in fact uses R exclusively!). The Journal of Statistical software, which features papers detailing many ML libraries, has far larger coverage of R packages than any other language: http://www.jstatsoft.org/ . I would suggest this is the evidence for the "bleeding edge" comment. Other languages' packages, say Python's ML libraries (scikit-learn, pyBrain, etc.), do note even come close to the breadth of capability in this space.
Coming from a statistics background,- I would venture to ask.. what's wrong with "pushbutton" libraries? And why are they only useful to people who do not know/understand much about ML/statistics?
Say you have a dataset and you believe ,say, a Random Forest would be well suited to predicting some response. Let's assume you have a very good understanding of Random Forests. Why would you not want a push-button library? Why would you WANT to recode the thing yourself? It's not as if your Random Forest will be any better or "more correct" than the one on CRAN or whatever language's repository you are using. I would argue it's more likely to have mistakes (coding something like this from scratch is no small task and the ones on CRAN have often gone through many iterations, improvements, and reviews from many experts in the field). And let's say you have some domain knowledge or some informed belief about how the Random Forest needs to be adapted to your particular problem (a different loss function or something) - you can easily edit the R package source code to do that - no need to rebuild the car just to give it a new paint job...
I guess everyone has left the building, leaving a reply in case you see it.
>I disagree with you premise that R is not popular among machine learners
I know what you are saying, and I agree, that is why I chose my words carefully.
I am not so sure about the *scope* of the 'bleeding edge' part
The word I wanted to highlight is "scope". The population that identifies themselves with any of these labels: machine learners, statistician, data scientist, data modeler, actuarial scientist, prediction consultant and all other variations...is huge. So whether R is popular or not depends on whom you want to include and whom you want to exclude.
There is a quite a lot of variation here. In one subset, if you know the ins and outs of R you will be taken for a wizard, in another subset just listing R/MATLAB as one of your major skill will actually work against you.
> what's wrong with "pushbutton" libraries?
Nothing at all and I did not claim its wrong either. In fact I said that is one of R's strengths. For people who arent into the theory of statistics or ML then R can be a heaven sent. Especially if all they want to do is try out a catalog of algorithms.
The analogy I use is that of an automobile industry. If your goal is to be a run off the mill automobile driver/chauffeur R is your competitive advantage. If your goal is to be an automotive engineer (design new models and algorithms) R is an inferior tool. In fact if you want to be a F1 racer, you have to look elsewhere too. Like every tool it has its sweet spot, as long you don't venture outside its golden.
R is bad building material, but if you dont want to build something novel in the first place (in a statistical or machine learning sense), just call a canned solution on your (medium sized data) data, then R is the boss. If you try to do anything meaty or clever then you have to be careful with R's gotchas. It helps commoditize data analysis and popularize techniques amongst its large base. If you want statisticians to be aware of some cool technique, you have to release an R package, because if it isnt in CRAN it does not exist.
This is one of the misconceptions about Python, that it's "slow". Python has several type annotation systems with JITing capability, and can easily achieve C-speeds (as well as interface easily to C libraries). We've given two tutorials at Supercomputing in the last two years, and the trend for high performance computing in Python is on the rise. Don't count it out!
I'm kind of desperate to get away from R at this point, but reluctant to move to Python (not that I have anything against Python, but I would really like to add a high performance language to my toolkit than something inbetween).