Hacker News new | past | comments | ask | show | jobs | submit login
R: Lessons Learned, Directions for the Future [pdf] (auckland.ac.nz)
51 points by tosh 34 days ago | hide | past | web | favorite | 25 comments

I feel like Julia is probably the thing that solves the technical / performance issues with R.

That said, R’s strength these days is about expressive power through the Tidyverse collection of libraries and DSLs.

If I were to pick an “R Next” project, it’d be to focus on a better, more expressive Tidyverse for Racket that plays even more nicely with relational databases and frameworks like Spark.

This is fine and all, but I think they're completely ignoring the elephant in the room. R is a crazy, random, whimsical language that puts PHP to shame.

I use Python instead of R unless I'm told to use R. I hate using R, even if there are better libraries written for it than the Python equivalents. I can't stand the terrible naming conventions (seriously, can't you at least be consistent with CORE FUNCTION names?) and ridiculous amount of data structures. There are vectors, lists, matrices, tables, data frames, S4 classes, environments, oh my... I've been programming in R for a couple of years now and it still takes me around 2-3 tries to figure out what's stored in a variable and how to access it. Do I need two ['s, a trailing comma inside the [], etc.

Debugging R basically seems to mean "use a hack to generate stack traces."

Maybe I'm just stupid, but I see _absolutely_ no reason to encourage use of R over Python. I love lisp and the ideas it espouses, but R seems to take the worst from that world.

You haven't gone deep enough: the interactivity is vastly better, and the package ecosystem for statistics and data science is generally much more complete and actively developed. scikit learn is very good, but if you're not using that or doing dweeb learning, you're up the creek without a paddle.

Python doesn't even give you matrices as first class citizens; while I used a lot of Python before I used R, it still feels like they bolted lapack onto an unrelated scripting language and built things with it. More or less because that's what it is.

Personally I don't think the R language is anything special, good or bad: it's a typical sloppy interpreted language (though many of the difficulties described in the above 2010 document no longer exist). It's the package management system that makes it useful. It's not even a great package management system, especially when dumb kids use it like it's nodejs. But it's good enough to allow potentially crummy programmers (aka statisticians) to contribute meaningful and useful code to the ecosystem.

Why care if data frames are built into python or not? In R, I use tibbles anyway.

The terrible naming is an annoyance, but the ridiculous number of data structures isn't as bad as R's tendency to jump between them in a semi-random manner.

My example is if A is a matrix and b/c are variables then you don't know what the data type of A[b,c] is. I can tell you the types of A, b, c and that doesn't help; you need to know about the actual data stored in the variables to know if the return value is still a matrix or if R has thrown out the dimension information and jumped back to a vector (potentially transposing the result). You have to know about the drop=FALSE option and at that point the syntax of doing a complicated equation involving recursion and matrices falls apart.

The syntax is an embarrassment for working with matricies. I'd rather use a lisp-style (-> A (mmul v) (subset 1 k 1 j)), which isn't ideal but at least it doesn't have random options being set in the middle of it.

That single decision should be enough to disqualify R from being a well designed language for mathematical applications. The pigs breakfast that is the *apply() function family is a similar story.

The distinction between vectors, matricies, lists-of-lists and data frames is archaic too, the conceptual model should be a single 2-d data structure and then support additional operations under certain conditions. At least that particular decision makes sense at the time R was designed.

So... I've been programming in R for over 20 years. I've been ready for an alternative for performance reasons for about half that time. Julia seems like a promising alternative, that I wish I could use all the time instead of R but it's just not there; I'd prefer Nim most of all but that's even less well-resourced in terms of libraries. Maybe a zero-cost abstracted offshoot of Rust will eventually come to have a role? Who knows.

R, like Python, has far outgrown its initial scope. I don't think it was initially envisioned to be used the way it is today. But both have been kept in use as costumes for C/C++.

One of the things I've noticed the most in the last 20 years, to your point, is that the language used to be a lot more straightforward and simpler, more predictable. Over the years a lot has been added in a sort of haphazard way, and as a result today you have this kind of Frankenstein language that isn't what it started with.

As for data structures, though, I don't really see R as being that different from other languages. Many of them are the same as in other languages, but just have different names (and I do wish they used similar terminology). Others have been taken up in other languages as people have come to appreciate their utility.

Being a wrapper for C/C++ can only go so far. Eventually you have to write in R (or Python) and the speed shows, if you have enough data to deal with.

The main reason to use R is the existing set of libraries and analyses pipelines some may already have in place. Redeveloping these in Python can be a hassle because there may not even be libraries that recreated certain functionality that one needs to recreate on-top of translating their pipeline. With that said, most common analyses needs are now handled in popular Python libraries.

Sometimes though, the knowledge encapsulated in an R package is not trivial to understand and reimplement. The concepts methods use may involve math you're unfamiliar with (and use as black boxes), so then you have to look at the package code and try translating R->Python, then attempt to refactor to something sane (and hope you don't skip any underlying logic). That or learn the theory of what was implemented so you can now implement it in Python (which may not be feasible with tight deadlines).

people who are domain-focused and not programmers per-se seem to be less bothered by this.. at school, Very Productive People are split into both camps, python and R.. and Julia is gaining ground

Well, it is a document from 2010.

Machines are faster now, we have seen the hassle with Python 2 to 3 adoption (or non-adoption) and how hard it is to change a language, generally the model to use a slow but comfortable language for model specification and execute it via C lib is more accepted now, and last but not least: The Tidyverse really has momentum now.

Sure, Julia and Python are coming after R, but the ecosystem itself is far from done..

Plus, R's performance limitations are not that big of a deal. In my experience, the bottle neck is just a couple of lines of code that can easily be replaced by some lines of Rcpp. Much easier than switching to a whole new language and ecosystem. I was really excited when Julia was new but the cost of switching is just never going to be worth it for me personally.

I’ll be frank here. I’m an avid R user, I always hear about Julia but until anyone can show me something even remotely close to tidyverse in Julia, I’ll stick to my subset of R that gets me to 80/20 (and really it’s more like 98/2).

That seems to cover the core of the tidyverse, but not the long tail.

Cool I’ll check this out, thanks!

R is plenty fast. Commonly it's calling C++. When I run a multilevel model, it's C++ I'm waiting on, not R. The tidyverse + ggplot2 + statistical tools + interactive nature makes working in R really productive.

Speed is not the only computational constraint - the tremendous memory required for very large data sets is often a severe limitation especially within R (and I actually enjoy R).

This is Julia's native, in-language approach (no delegating to C/C++): https://juliadb.org/

Note: this paper is from 2010 and he has been making similar statements since 2008 (https://www.stat.auckland.ac.nz/~ihaka/?Papers_and_Talks)

Giving the timing, I'm interested in what he might think of Julia, which seems to have reaches a similar conclusion - statisticians need a new tool.

I use R because of my stat major also I am not fond of SAS. Also most bleeding edge statistic stuff are on R and no where else. There are tons of statisticians and other researchers just publishing paper, packages and code of their research and how to do it. R may be a one trick pony but it is very very good at that one trick.

You can see via The R Journal (https://journal.r-project.org/archive/2018-2/) and read through what researchers have done and published via packages.

This is the best justification for R: Its statistical lineage and adoption by researchers. There's plenty of great thinking manifested in these libraries and its community. This can't be understated.

R feels a lot like Perl to me, idiosyncratic but useful. I switched from Perl to Python years ago, but like R, I remember it being a scrappy language. It was sorta like how I use 'vi' for admin work, while I use a full IDE for large code bases in C/C++. I can crank up R to do quick things I could do in Julia or Python, but I am in the situation where all my colleagues use R (and Matlab), since they are not programmers. If their prototypes work out, I code them in C++ to run on large clusters. C(++) is the only language I've worked in consistently for over 30 years. I have hope for julia.

I use R because mainly due to the RStudio guys. I totally understand many of the author's points but you can easily write your code to avoid storing massive amounts of data in memory.

edit: grammatical error fix

I think this reason is very valid. R has warts and issues, but RStudio and Tidyverse authors have shown that R is an expressive language for building coherent DSLs that users love.

R is quite good at developing DSLs (quite Lispy), but Julia is even Lispy-er with a more flexible syntax and unicode identifiers. R, definitely, has a great statistical lineage, but that is also unfortunately what limits it from being as expressive - especially regarding new data types. No one in R creates their own domain-specific data types (that are performant as well) whereas this is the norm for the Julia ecosystem.


2010 document.

Can we add 2010 to title?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact