Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to R Programming (cecilialee.github.io)
307 points by cecilialee on Dec 7, 2017 | hide | past | favorite | 136 comments

One really underappreciated aspect of R is that it's a lisp at heart. This enables the user (and enterprising package writer) to build really clean abstractions for the task at hand.

The tidyverse suite of Hadley Wickham is a great example of this, notably with the pipe operator %>% (similar to |> in F#) which is not part of the base language and yet could be very easily implemented. Julia's macros probably enables the same type of implementation, but I don't see how one would achieve it as easily in Python for example. Non-standard evaluation is another example of R's lispiness in action [0].

Also, consider how easy it is to walk R's S-exp. Expressions in R can only be one of four things: an atomic value, a name, a call or a pairlist. Wickham's Advanced R has a great intro on this [1].

I believe Wickham's amazing work with tidyverse (which really changes the way you code in R) is just the beginning of a rediscovery of R's inner lisp power, a kind of "R: the good parts" moment.

[0] http://adv-r.had.co.nz/Computing-on-the-language.html

[1] http://adv-r.had.co.nz/Expressions.html

Anyone with a programming background getting into R should absolutely go read _Advanced R_. I've been using R off and on for a while now but Advanced R was a real revelation. All of R's weird behavior finally made sense.

Edit: Also there is a 2nd edition in the works: (confusingly hosted at the same subdomain of a different version of Hadley Wickham's website. https://adv-r.hadley.nz/

The subdomain confusion will get resolved once the 2nd ed is a bit more mature so I can just redirect the 1st ed.

After reading Advance R I HIGHLY recommend learning Racket (a lisp) and work through "How to Design Programs" it will take a while and it is very dense but this is the best thing I have ever did to make my programming skills better.


Looks nice! Thanks!

Python is very hackable. Some time ago I answered a couple questions about how to implement a "pipe" operator in Python at stackoverflow:

* “Piping” output from one function to another using Python infix syntax[1]

* How can I create a chain pipeline?

Often in Python it is not a matter of it being possible/impossible (to implement a different syntax), it has more to do with being always ultra-idiomatic. The zen of Python says "Special cases aren't special enough to break the rules" and the community tends to avoid writing a DSL like the plague.

[1] https://stackoverflow.com/questions/33658355/piping-output-f...

[2] https://stackoverflow.com/questions/47474704/how-can-i-creat...

FYI, the pipe operator for R (and other languages that have it built in) are for calling any functions with any parameter types. This is not just academic, in the tidyverse style it's both common and idiomatic to change object types in the middle of a pipeline (for example, from a data frame to a vector, or from JSON to data frame to an interactive Leaflet map).

While Python decorators and operators can get you surprisingly far, I just don't see them being in the same league as languages like Lisp and R that let you manipulate the AST really easily.

Agreed though that the culture of Python is the exact opposite of R (and Ruby, Perl, Lisp), and even if Python had all the same metaprogramming goodies as R you wouldn't see as widespread use.

Best thing I ever learned about R was it's scheme influence. I ended up learning Racket and it changed everything I have ever coded in R. Actually I have made Racket my main general purpose language of choice after going through "How to Design Programs."

Ros Ihaka actually suggested rewriting R in Common Lisp.


It's not just Lisp, it's an fexpr Lisp. Pretty cool (but also kind of maddening).

See also, this dead but fascinating project: https://github.com/crowding/vadr

I have seen HN crowd hating R very similar to hating js. While I'm not getting into those details, I'd like to list a few reasons why I like R:

- RStudio is simply great. I know Python has got Jupyter notebook but RStudio makes a good IDE for anyone (even beginners).

- Python is great because it's easier for beginners to start doing magick without getting frustrated hence a good beginners language and it is more appropriate for R because anyone who wants to begin with Data Analytics, R is a lot easier - without trying to figure out how to install a new package, load a new package, make a plot or anything of that matter. Hence the fall out rate would be less.

- Tidyverse. Without denial, it's a better Universe than Marvel's cinematic universe. Not a single day in my job goes without using dplyr.

- While I've quoted tidyverse in general, ggplot2 - embracing the grammar of graphics has set a very nice standard for visualization libraries which matplotlib (the goto library of Python doesn't offer much)

- Pandas is nothing but a library built on Numpy to offer R like data wrangling functions hence I'd like to consider dplyr and R's inbuilt data manipulation functions superior.

There is no doubt that Python has its own advantages with single library scikit-learn and webservices, R is no way to be hated.

Even millenial companies have found interest in R https://medium.com/airbnb-engineering/using-r-packages-and-e...


Missed RShiny to simply create a web app (unlike in Python starting a Flask server and then writing stuff on top of it)

I don't understand the Jupyter hype. Sure it's clever that it runs in a browser but it's less capable than the MathCAD I remember using in the 90s.

Indeed. I used Maple for the same thing.

I think the hype is due to the fact that the literate programming thing is a good idea but many people haven't seen it before and there aren't many tools for doing it. I just wish I could use a proper editor with Jupyter. Editing in the browser is horrible.

I believe emacs org can be used for this kind of notebook developement, however it looked like a configuration nightmare so I still haven't dived into it.

It's actually pretty easy to set up for general use. I do know and use emacs lisp, but I've not really used any at all for org-mode.

It does support "sessions" which allow persistence across the code throughout the document (you could even have multiple sessions), but the wat it's done for Python is quite hacky. It uses an interactive Python shell so you have to write code as if you're using the shell (double returns etc.) There is a better way using ob-ipython, but after spending a long time getting it to work at all I found it not good enough. Using Jupyter kernels is the way to go, I think, but it would be a lot of work to get it working well with org-mode.

You should give it a go, it's not hard to configure, and it allows you to trivially use several languages in the same file, which is really practical in many cases. It also exports nicely to HTML and PDF (via LaTeX)

If I remember correctly, PyCharm does support Python notebooks. I've used it and it's not terrible.

Apparently PyCharm supports using emacs as an external editor. Interesting. Thanks for the hint.

The key attraction of Jupyter (from what I can tell) is one which is underappreciated in tech.

It provides an accessible, better workflow for common use cases than what most people were using before.

Sure, there are things out there that do a better job. Or are more powerful. But something which requires highly-custom config & training to be hyper productive on by definition means most people aren't using it that way. Same argument for Python as a popular language.

It's free and it Just Works. That's about it.

It's free and it Just Works.

It's painfully clunky for those of us who remember something far slicker 20+ years ago!

Plot.ly now has Dash, which is RShiny for Python (which uses Flask) and appears equally capable to RShiny.

Bokeh is also similar, and quite customizable!

Is there any difference in using Jupyter Notebook (via R Kernel) and R-Studio specifically for R programming ? I already have Jupyter Notebook installed and I wanted to learn R, so do I need to install R-studio separately?

R Studio allows you to use R Notebooks, which have a number of advantages over Jupyter (I wrote a blog post on exactly this earlier this year: http://minimaxir.com/2017/06/r-notebooks/)

RStudio is a full-fledged IDE and you should definitely use it. It has movable panes for code, console, help files, history, plots, etc. There's nothing comparable in Python land.

Rodeo is rstudio for python

Since the acquisition of Yhat by Alteryx (Yhat created Rodeo) the Rodeo project seems dead. Another good alternative is Spyder which offers a similar type of IDE and one that is still being developed.

Didn't know about this one. Must be pretty new. Thanks.

Visual Studio tools for Python, PyCharm?

Disclaimer: I work for RStudio. I previously worked heavily with SciPy.

The difference kind of goes to the fundamental difference between R and Python. R's nature as a statistical programming language is something you have to install packages in python to achieve: numpy, matplotlib, etc.

What you gain with RStudio are environment inspection tools[1] built for the kind of vectors, data frames, etc. that you'd only get with `numpy` in Python land, and therefore PyCharm and VS don't know about (or would need a plugin to know about). Same goes for the plot viewer and `matplotlib`.

Beyond that, a sizeable portion of RStudio's runtime is written in R itself; you can actually write addins for the IDE using R, as opposed to PyCharm where you'd have to know Java or Kotlin, and I assume VS where you'd be required to use .NET.

It's always going to come down to "what is the best tool _for the job_?" Knowing people who use python for data science, they don't seem to indicate to me that they're particularly fond of PyCharm (which is what I'd use for Python if it's too big a project to effectively grok in VIM). They tend to use Jupyter notebooks (not even iPython!) because more important than static inspection and quality tools (which devs care about) is a richly-featured REPL that saves detailed history forever (which a researcher cares about).


Thanks for the clarification, but at least with Visual Studio tools for Python, some of that is also possible.


Extensions can be written in IronPython, http://ironpython.net/

The builtin repl supports IPython/Jupyter style, with inline plots, .NET and WPF integration.


Yes! These are the exact things why I love R! Although I love Python too.

I know many people think otherwise, but I hate R for many reasons. Here are some of them:

- You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error

- It confuses and mixes functional programming and oop not only per entity but also between the usage of them. Want to get a value of entity X? use x.getValue(). Want to get a value of entity Y? Use Y.getValue(y).

- The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.

- People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.

Disclaimer: My big-data-profs enforced me to use R even for tasks where R should not be used.

I've been a heavy R user for about 7 years, and I only slightly disagree with one of your points.

(In my opinion) R is best for traditional statistics, as opposed to AI, machine learning, predictive analytics, data science, data analysis or any other variant thereof.

If you're more concerned with Chi-squared tests than unit tests, or if you need to teach a mathematician or a biologist how to fit regression models and analyse residuals, goodness-of-fit statistics, p-values etc, then R is the best language for the job.

If you need to build a program (as opposed to just do a thing), or if you're more interested in accuracy than inference (as per most machine learning tasks), then Python with sklearn and pandas blows R out of the water.

> Python with sklearn and pandas blows R out of the water.

For some things yes, but for others the reverse is true. I'm also a heavy R and python user and find the two ecosystems extremely complementary. For building pipelines and web apps, python has an edge. For statistics, graphics, and data management, R is IMO superior. You can do everything in either language, but have to jump through hoops in some cases. Sometimes the best solution is use both!

For example, I run an internal web app for A/B testing using django and rpy2. Doing it all in python would have been sub-optimal because dataset management is so much simpler in R. Plots that were easy to do in ggplot2 were impossible to get right in matplotlib. The big drawback to this method is R's single-threaded architecture. Embedding R in a web server process is not easy (ask me!), and won't scale as well as a multi-threaded environment can.

All my data exploration and prototyping happens in R. Even basic report scripting can be done better in R than python because of the ease of data management. Consider a typical case of 1) run database query, 2) munge data around to produce a table, and 3) email or save to html. If you can't get exactly what you want from the database in one query and you have to do a lot of munging in step 2, then R is going to be more flexible than python. If I need to merge, aggregate, or recode variables, I would much rather use R. Doing all this with a list of lists "dataset" in python is convoluted at best, and recreating a lot of the functionality that base R gives you.

Do you not use pandas?

pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation. And while you would have been right about Python being better for machine learning a couple of years ago, these days basically every popular machine learning library in Python (Tensorflow, keras, etc.) now has an API in R.

I also don't know why you are separating "traditional statistics", "predictive analytics", and "data analysis". They often are the exact same thing. In fact, it makes me wonder how much experience you have with statistics if you are under the impression that it is somehow different from data analysis "or any other variant thereof".

You are right on exactly one count: Python is superior for putting data analytics into production. And that isn't an insignificant advantage. A lot of data science today involves packaging an analysis into some larger program or product, and Python is absolutely better suited to that task.

But in virtually every other case (including lots of machine learning problems), R is either as good if not greatly superior to Python.

I did start my post with the words "in my opinion". I am not right or wrong about anything, and neither are you. We're mostly talking about syntax preferences here.

I'm separating out traditional statistics as an alias for statistical inference - make distributional assumptions, test them, estimate the effect of X on y and put a 95% confidence interval around it. That sort of stuff.

It's the stuff that absolutely does not matter if you're assessing the overall effectiveness of a classifier, and certainly isn't needed in a lot of data analysis tasks where all you need are variations of counts and percentages.

For the record, my academic background is maths and statistics. I've picked up any software development experience on the job.

> pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation.

I would really like to hear a bit more about this, because this would greatly increase my motivation to learn more R. Specifically I've fiddled around with dplyr and it definitely feels more DSL-y but I didn't see a crazy benefit there. What are some of your favourite things about dplyr / data.table?

Took me a while to get back to you, but essentially dplyr is fantastic for readability and reproducibility. Reading through someone else's analysis, or even my own long after the fact, is orders of magnitude easier than base R, data.table, or pandas typically are.

data.table's advantage lies in its speed. It is by far the fastest of the three options. In just about every benchmark it either is significantly faster than pandas or at the very least is approximately equal.

Pandas is lauded by people who strictly use Python, and it really is fantastic considering how ridiculous data manipulation would be in Python without it. But its also the only option a Python user really has, so they've become married to the idea that it is best.

Basically, if you are using Python, use pandas. If you have an option, go for data.table for speed, dplyr for clarity, or a mix of the two if desired.

What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you. As an example, these two statements are equivalent:

mydf$newvar <- with(mydf, oldvar1/oldvar2)

mydf <- dplyr::mutate(mydf, newvar=oldvar1/oldvar2)

You can then use the pipe operator %>% to funnel the results of one operator into the next.

The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables.

Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr).

However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug.

So in pandas it's kinda similar:

> df[newvar] = df[oldvar1] / df[oldvar2]

And instead of the pipe, we have chaining for which is super straightforward and readable:

> df[newvar] = (df[oldvar1] / df[oldvar2]).abs().rank().astype(str).str[:4]

and for more complex or non-chainable functions we have .pipe:


which looks super similar to dplyr to me!

the data.table way:

mydt[, newvar := oldvar1/oldvar2]

I could not resist.

> People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.

Really? Maybe you worked with R before data.table, dplyr, and the tidyverse packages? I'm not that familiar with pandas in Python, but there is an incredible amount of productivity to be gained from knowing your way around a set of just around 5 packages in R that I never had when working with C++, Java, Perl, or Ruby.

Also, it could just be me, but I overcome the = and <- confusion by simply never using =.

I feel your pain! It took me a long time to get used to R. The only reason I tolerate it is I used SAS before that, so my point of comparison is an even more obtuse programming framework! Some general advice should you want to work with R some more:

- For assignment, always use '<-'. Read it as "set to". For example, "x <- runif(10)" means "set x to a vector of 10 uniform random numbers". When passing arguments in function calls, use '='.

- If the IDE gives you problems, try using the command line. R Studio or the R GUI app are not necessary. Simply type 'R' in a shell and you have an interactive read-line environment. Use the shell for exploratory work, then write code in your favorite editor and copy/paste after developing a series of commands you want to run.

- Use base R as much as possible, don't install a new package just for one function that you could do with base R functions, even if it's not elegant. Package bloat is one reason for inconsistencies in APIs. Some package developers will make you do x.getValue() and others getValue(x). But remember these are 3rd party packages. You can do a lot using just base R and a few select packages that are well respected (gglot2, dplyr, Hmisc, reshape).

That's funny, I absolutely never use '<-'. Mostly because it's 2 characters, and because it's inconsistent with most languages. Haven't ever run into any issues because of '='.

I like your base R point, although like you say, some packages are simply essential.

I complain about this every time a post on R programming comes up here, but my favorite thing to hate (our of many) about R is that there's no way to find out what the directory of the current script is. Imagine someone would want to use relative paths to their data files so that they could version control their scripts and run them unmodified on different machines! We wouldn't want to enable such abominations now would we!

I think you need to reference the data files from the working directory, not the directory where the script currently is. The two aren't necessarily the same.

The current working directory can be found with getwd() and set with setwd().

If you set the working directory at the beginning of the script, paths to data files should be relative to that location.

Yes but for example when running from within RStudio, or calling from other scripts, the two aren't the same. Calling from other scripts you can do chdir() first of course, but my point is that you can't sensibly rely in your script on cd and script path to be the same.

I've actually noticed this and was totally blown out of the water by it. I understand you can use getwd() and setwd() but I thought you could simply do relative paths (similar to other languages) but it doesn't always work and I haven't figured it out.

For example, if you are loading a data.frame from a csv, my.df <- as.data.frame(read.csv("file.csv")) seems to work if the R script is in the same directory as the .csv. This is what I tend to do in .Rmd code chunks (which is my primary R workflow). It also tends to work across platforms which is handy as who knows what box I'm going to be hacking away on. However, R's preference for absolute paths in general I find very strange as I'm always on different machines with, of course, different directory structures. Isn't everyone?

Regardless, R is funky but I think I like it in a sort of awkward 'first date not sure yet' kind of vibe. I'm a noob and novice programmer otherwise though so who knows.

Maybe I am misunderstanding your question, but isn't that just getwd()?

No, that gets you the working directory, which isn't always the same (like, when running from RStudio, getwd() returns the RStudio installation path IIRC).

if you run scripts non interactively, you could try commandArgs? That should contain the file path. For Rstudio maybe the rstudioapi package has a function like that...

Well yes, there are several workarounds; to the point that there are packages that wrap up all methods and try to decide which one is the correct one in the given invocation. This is the problem with R - there are many things for which you need only a single line to do something very complicated, but there are also many things that are just a tiny bit different from the standard cases, and are absurdly complex. Everything is just slapped together, without thought for the overall picture or overarching design.

Google stack overflow for 'R get current script path' some time, and weep not only at how often this is asked and upvoted (i.e., how many people suffer from this), but also at the suggestions offered - how divergent they are, and how complicated. But this is just one example. R is death by a thousand cuts.

not gonna argue that : D Been working with R for 3+ years now and totally second the "there are also many things that are just a tiny bit different from the standard cases, and are absurdly complex".

I can gradually move on to python at work now, which so far has been much more pleasant. It always surprises me what you can end up doing in R though, but really shouldn't if you want to go to production : )

getwd() returns the working directory, which can be set with setwd(), even from within RStudio. I'm still not sure what the problem is.

Not sure if it solves your problem but `source(file, chdir = TRUE)` can be useful.


I believe thats is what you are looking for.

Yes, and now I want to also make it work when not invoked from RStudio; and for various R version. So now I find myself wrapping all these options into a function, which I have to copy for every 10 line script. So then I make a package for it; or use the functions in someone else's package and add a dependency which I'm not sure will still work a year from now.

Or I could just use a sane language and go home in time for dinner.

(I mean I know about all the solutions and non-solutions; I've looked into this at least a dozen times over the last 5+ years. My point is that this shouldn't have been an issue in the first place.)

You are absolutely right, but then either you first post was missworded or I missunderstood the issue (most likely the latter), as there is a way to know the directory of the script.

100% agree with R is not a sane language.

Ah yes now I see - I said 'there's no way to find the current script' which isn't true. So that's probably what the others in this thread are also objecting against :) I guess what I meant was 'there's no same way' or 'look at how hard it is to do this tiny thingy which anyone with a programming background would find so basic, they wouldn't even consider it might not exist'. So yeah, I did screw up on making my point there.

So you need to have a specific IDE installed for this to work?

Nope! Base R works great. Old-school vi to edit scripts, and R base installation to run them (or REPL around). Of course, the IDEs do offer a lot of support, and RStudio is great for making your R functions into packages that are easy to share.

That was in reference to the rstudioapi package for finding the path of the current file, which I've just checked out needs a running Rstudio session to work.

> - The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.

There's more than one IDE for R [0], and strictly that's not a problem with R itself, but the people who built the IDE.

> - People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.

I think there's very little marketing behind R. It's predominantly a statistics package, but clearly a lot of statisticians are using it for data analysis. So I think it's people that using it for data analysis that talk about it, not bloggers paid to write about it.

[0]: https://stackoverflow.com/questions/1097367/what-ides-are-av...

I’ve known many quants use both R and python/numpy/pandas for complimentary tasks. The R standard library was generally spoken about in positive terms, but for data massaging and manipulation beyond pure maths/stats analysis a python environment probably offers much more flexibility.

Note that I don’t claim expertise in the above, but a bunch of very talented people I’ve worked directly with, and who were very directly incentivized to be productive, used R.

Perhaps your profs were trying to help you learn R, including its limitations, when they were setting you tasks?

This is a really big deal. In the first edition of Python for data analysis, they suggest using mean imputation. In case you don't know, this will totally break your variance calculations and thus any statistical tests.

In the second edition, they suggest doing some interpolation. Meanwhile, in R land there are multiple ways (as always) to do useful multiple imputation which gets you a much more accurate analysis which makes better use of all of the data (mice, Amelia and mi are all good, and somewhat complimentary).

That being said, I just thought of using PyTorch and a GAN to do multiple imputation, so maybe it's not impossible to do in Python. There is way, way less support for it though (but of course you could probably build in Numpy).

I guess the big difference is that R comes with numpy equivalent (matrix), a pandas equivalent (data.frame and base), and a well-tested, numerically-stable and reference implementation of pretty much all widely used statistical models.

Like, I really don't understand why you wouldn't want to look at residuals, even if all you care about is prediction. Your predictions will be much more stable and accurate, and it can often inform you as to how to model things more appropriately.

Finally, R's formula interface is a thing of beauty. Honestly, why the hell do I need to generate a model matrix for regression/classification when I can get R to do it for me.

I will also say that R is a frustrating, domain-specific, really irritating, wonderful language. But then I'm a crazy person, I wrote a stockfighter client in R.

I agree that there are also some good parts with R.

But the argument "It's good because many people use it" is the one I heard most often when it comes to discussion about programming languages especially old ones like R and java.

Actually for data massaging and manipulation, R is absolutely superior to Python.

I do not really disagree with you, except for the '<-' bit, just map it to a keyboard shortcut, and move on :).

But I would give R a try with the tidyverse, it made me go from hating R to just not caring about it.

While libraries are extremely inconsistent, if you want to use cutting edge statistical methdos as a researcher, you pretty much have no other option. Finally, data wrangling is quite well developed in the R evironment.

So long story short, after many years of hating R, now I just find it a handy tool to do my work despite it being old, inconsistent and sometimes annoying.

I don't get how a keyboard shortcut deals with the '<-' issue, which is the occasional and subtle difference in semantics from '='. Even if you can pick just one for your own work, it doesn't help with other people's code.

Tidyverse is new to me, I must check it out.

RStudio shortcut for <- is alt and -.

Also Tidyverse and data.table are the main reason for the sudden explosion of R's popularity. For me I love the piping since I am an old time bash user and | becomes %>% in R is the best thing for the way i think.

Typing '<-' is a fairly trivial matter, I think, compared to the semantic issues raised at the start of this thread, and covered in the question and answers below - for example (from the chosen answer): "R's syntax contains many ambiguous cases that have to be resolved one way or another. The parser chooses to resolve the bits of the expression in different orders depending on whether = or <- was used."


Just use the '<-' all the time, except in function calls. If you can use '<-' with an easy key press you will not be tempted to use '='. And the problem is greatly reduced.

But yes, the problem is still there.

Add to that:

- documentation is all in PDF format

- can only install packages from the interpreter

- testing libraries not feature complete

- weird namespacing

- poor test coverage in popular packages

- no mature webserver

The namespacing drives me nuts but it doesn't get mentioned very often in these kind of threads. How are people just ok with loading everything in the same namespace? You can use ::, but then that can have a ton of overhead.

Yeah, the namespacing thing is crazy. If its any consolation, it was much, much worse before R 3.0. Originally, packages used to clobber each other's namespaces, which lead to much hilariousness and non-deterministic bugs.

Now those hilarious bugs only happen at the REPL, which is a little better. If these kinds of bugs cause problems for you, i strongly recommend creating packages for your analysis/projects. It doesn't add that much complexity (with devtools, at least) and it does avoid a lot of these problems. Also, R packages require documentation, which is better than many other languages.

I've attempted to address the documentation issue at https://rdrr.io

The test coverage issue is what originally prompted me to get involved -- I was evaluating different EM solvers and found a lot of crazy obvious bugs (parameters backwards or totally ignored). There's a lot of room for improvement on the quality front. MRAN and Tidyverse are thankfully making some headway.

can only install packages from the interpreter

Not completely sure this is what you mean, but you can certainly run install.packages("package") right in your code, doesn't have to be done interactively. Usually want to first check if it is already installed, like with require("package").

I thing cshenton wants to install packages without running the R interpreter. Think of pip install, cpan, cargo...

> can only install packages from the interpreter

If the package has been downloaded, it can be installed with R CMD INSTALL

nice list of R's WTFs. I've recently hit an issue that you access properties of "S4 classes" (what are those? "The S4 object system. R has three object oriented (OO) systems: [[S3]], [[S4]] and [[R5]].") using @ instead of $.

I think that the users/library creators are also guilty of why working with R is such a pain. Giving them the option to overload operator was a major mistake. C++ programmers are more often engineers who have more concern for the code reader and even they probably overuse it.

I don't know why you think its a marketing lie. I think R is hands down the best language for data analysis, and nothing else really comes close.

Maybe you could elaborate with why you feel this way?

"You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error"

Can you tell us a bit about those edge-cases which can lead to hard finding bugs?

- There is also '->' which can be even more confusing (or helpful if you use pipes)

- Aren't those methods defined by each package/object? Or you mean '@'?

- RStudio (the most used IDE) is one of the reasons that I use R so heavily. Never encountered your problems. I can even git checkout to another branch without problems and the new versions are loaded without problem.

- For a quick descriptive analysis or some tests I don't know something easier, compared to SQL or Python. But that's probably only personal preferences and/or knowledge of the language.

I do R code QA for a living now.

And I can see why some people like R. They are end users for whom the language was explicitly designed, so they like the ergonomics (to use the term Rustaceans are popularizing.)

The thing is, like Perl and Latex and other products you could think of, R was initially written by people with a good idea of the end uses and how to enable those end uses, but not a good idea on how to reconcile those ergonomics with the need for a clean parseable syntax.

So if you make too extensive a reliance on R, you wind up having to hire someone like me.

Every language has its warts but R is actually the least disliked language https://stackoverflow.blog/2017/10/31/disliked-programming-l... and http://blog.revolutionanalytics.com/2017/11/r-is-the-least-d...

I really enjoy R and the more I learn programming the more I enjoy it. Best thing I ever did was learn the language Racket and How to Design Programs and Hadley Wickham's tidyverse.

I moved from Python to R about six years ago. Before that I did most of my work in the command-line. R's rise in popularity has been caused by the libraries in the tidyverse and data.table. The millions of dollars invested into R by many companies and an amazing eco-system

> You can use '=' and '<-' to assign values to variables and both do the same, except in a few edge-cases where you now spend one week finding the error

Its just R's symantics. It is almost universaly spoken to just use <- for style consistency and edge cases. Use the RStudio shortcut `Alt` and `-`. The reason you spend a week is the reason why it is recommended for all users to just use <-.

> It confuses and mixes functional programming and oop not only per entity but also between the usage of them. Want to get a value of entity X? use x.getValue(). Want to get a value of entity Y? Use Y.getValue(y).

R comes from S the creators of R were also inspired by Scheme. Personally I learned the language Racket to be a better R programmer and I pretty much live in the Functional side of R. I actually like the fact that they added more functional core to R from S+. http://r.cs.purdue.edu/pub/ecoop12.pdf

> The ide crashes once an hour and does not detect file-changes which forces you to restart it manually.

Then use a different system. R does not equal RStudio. I have never experienced this and I have worked in Linux, Mac and Windows 7 - 10. To me RStudio is the best example of an electron app and the only IDE that I actually use the built in git feature. R Projects and RWorkbooks are the best features of RStudio.

> People say R is the best and optimized for data-analytics which is simply not true. It's a marketing-lie spread by the creators. There is no data-analytics-task that you cannot do with the same ease in other programming languages.

So R is just as easy to use for data analysis and has best in class statistics? Also I think tidyverse is much easier than any other data analysis system I have ever seen, but I guess that is just my opinion.

> It's a marketing-lie spread by the creators.

What does Ross Ihaka and Robert Gentleman have to gain for marketing and what lie has they ever said. This falls into conspiracy theory.

There is a large community of R users and we like R. There are a ton of

Is there anything out there comparable to ggplot2 for high-quality plots?

Lightning (http://lightning-viz.org/) is a pretty cool interactive visualization server, with clients that work across multiple languages/environments. I've used it with R and Python and it's pretty slick.

I'm always a bit disappointed that yhat indiscriminately used 'ggplot' for the python package name. Using a variation on the name would have been more considerate.

I tried a few years ago and found that it didn't implement much and didn't work correctly. Since then I usually use something like rpy2 to run standard ggplot2 from Python.

Hopefully it has improved.

matplotlib makes fantastic plots, it's just not a very nice API.

Seaborn is quickly becoming my favourite and a bit more similar to ggplot in terms of scope.


tldr: you don't like R. I am no wiser than before I read this comment. So what do you use instead?

Regarding your 3rd point, I have never seen or heard anyone say R is the best, and I use it almost daily.

So you prefer Python, I assume?

Depends on the problem to solve. For calculation-heavy tasks i prefer python. But these days many use R to create shiny-apps which should be done in javascript instead.

That's one way to look at it. The other perspective is that with shiny 'many' are able to create interactive apps to display and explore data which would have required 5x as much time to make with js +/- d3

Shiny does use javascript; are you against the wrap?

A few weeks ago I had to do some data transformation (just a few thousand lines of data). Because I have some history with Excel I startet LibreOffice and wrote some formulas. After a few days I reached the point when LibreOffice required one and a half hours to recalculate the formulas.

That was the moment when I asked a friend of my who has some R experience to help me with the basics (yes the syntax is kinda weird at the beginning). After 4 hours of learning by doing we had the same result as what I had reached in a few days of work with LibreOffice and it calculated everything in about 17 seconds. Yes, this time I knew exactly what I wanted and R can do much more efficient transformations than you could ever do with a spreadsheet calculator. Nevertheless I was quite happy with the result.

As I am normally use to code with vim and tmux I use R just like a (bash)-script with the following shebang:

  #!/usr/bin/env Rscript
That way I can throw it into a watch myScript.R while I write it in vim in a different tmux pane. That might have some disadvantages compared to RStudio (e.g. can't view graphics in a terminal), but as it fits very nicely into my normal workflow and performs very well, I am very happy with that solution.

Have you tried Nvim-R?

I love it.

You can send a line to the R console using <space>. I've assigned loads of keyboard shortcuts beginning with your local leader that will do things like str(), levels(), head(), tail(), sum() on the object under the cursor.

It works fine with plotting figures, and I think you can set it up with tmux, though I use vim's buffers.

Haven't seen any disadvantages to compared to Rstudio yet. I guess you could even do :!git add ... from vim.

[1] https://github.com/jalvesaq/Nvim-R

Just to settle the Vim vs Emacs debate in the context of R, I refer you to R FAQ 6.2: https://cran.r-project.org/doc/FAQ/R-FAQ.html#Should-I-run-R...

The book "R for Data Science" by Garrett Grolemund and Hadley Wickham (O'Reilly, 2017) [1] provides a comprehensive introduction to modern R and a set of packages known as the tidyverse. Highly recommended.

[1] http://r4ds.had.co.nz/

I second this - the tidyverse packages make R feel like it's supposed to feel like, tables make sense, string manipulation makes sense, it's all grown out of one consistent approach, kind of the opposite of base R.

Just a shame that R's approach to namespace is so bad that importing tidyverse leads to a few name-clashes with bioconductor...

Hadley Wickham also has an Advanced R book [0] which has some of the functional programming concepts that you can use in R

[0]: http://adv-r.had.co.nz/

I can't believe you start counting at 0 ;)

It's a common theme on HN, I copied it from other comments when I first started commenting here

It's also a common theme of those complaining about R that it starts with 1 rather than 0 like a "real" programming language: https://stackoverflow.com/questions/3135325/why-do-vector-in.... Now that you've outed yourself as an insidious traitor, Hadley will be by shortly to take back your copy of the book. At least, that's how I read the smiley.

Ah, thank you for the explanation. You'll probably get done for being a snitch.

Yes. This is the book that I referenced to!

To save others some of the head-banging sessions I've had with R:

R has an integer division operator, %/%. R gives you the ability to define your own infix operators, as long as you give them symbols that start and end with %. Here's the kicker--all such operators have a higher precedence than multiply and divide, which can lead to unexpected results.

R as a programming language can be frustrating. It has scalar values; you just can't store one in a variable (it becomes a vector of length one). Some functions and operators will work with vectors of arbitrary length... but some require a vector of length one.

(Speaking of which, binary operations on vectors are done by adding corresponding elements, BUT if one operand runs out first, it will start picking them off from the beginning again, with a warning if the length of the longer one isn't a multiple of the length of the shorter one. This may be surprising.)

The wonky list notation takes time to get used to: foo[1] gives you a sublist; chances are you want foo[[1]].

Deciding which of the *apply() functions you want can be a pain. What passes for lambda expressions in R is clunky.

m:n gives you a vector of m, m + 1, ..., n... unless M > n, in which case it assumes you want m, m - 1, ..., n, so 1:0 won't give you an empty vector. This makes for clumsy special case code.

> What passes for lambda expressions in R is clunky.

Is that really the case, though? It seems like `function (args) body` is about as simple as it gets, and just as simple as in many other languages.

Use seq_len(n) instead of 1:n to get an empty vector for n=0.

There's also seq_along(x) instead of 1:length(x).

Man I dislike R for its syntax. It does a terrible disservice to people who start coding in R and then think that they "know programming" while they have missed most of the basic programming paradigms any "normal" programming language has.

I think R has a lot of similar ideology as PHP and well everyone has their own opinion about PHP.

Also I found the tutorial seriously lacking I mean no data.frames, matrices, vectors, tables or factors? How to iterate over data.frame might be the biggest thing a beginner needs to know before shooting themselves in the head. apply, lapply, sapply or vapply - which one do I need? Well IMO apply is the best one to start with as it's the basis of them all. sapply is almost the same but it just transforms the result into a vector or matrix.

Agree. It's amazing how such an ugly and inconsistent language can have so many great packages.

The inconsistency is something that's quite annoying though.

apply is NOT the basis for the other *apply functions, in fact apply it's the exception. There's apply and there's lapply. The rest are variations of lapply.

As for people mistakenly believing that they "know programming" I don't think this has anything to do with R's syntax. R is a programming language but it's also a system for interactive data analysis thus the syntax had to be adapted to that end.

I'll probably get downvoted for this, but let me tell you - Please don't use R in production. Please don't use R for any serious work.

Over the years, I've come to learn to appreciate the fact that languages are just tools. You simply use the right tool for the job. If you let your personal bias, love/hate get in the way, it will cause you a lot of pain in the long run. In the same token, R is one of the most fucked up languages to work with if you use it simply because you assume it's good for all analytics-related projects. It's not.

In one of my previous companies, we had a hipster, always used everything that's on trend. Against all advice, he decided to use R for many of our internal and client facing projects.

For what would have taken a week if Rails were used, he'd write everything in R Shiny. Yes, he used a statistical programming language to write a web application and serve APIs(!). Performance was terrible. There were lot of break downs. Development prolonged, even his own team members lost morale. I unfortunately had the ill luck of having to maintain some of his codebases and those days were the worst in my life. Worse yet, he didn't have a formal software engineering background, so he loved the idea that you are able to code everything inside of this blackbox called R Studio. Fuck tests, there were no tests written because he didn't understand the importance of tests. The projects he worked on lasted for nearly 1.5 years without completion. Almost every project had an instance on the cloud running an R server and it also costed a LOT simply because it was eating a lot of memory. Even our Ruby projects didn't consume as much.

Eventually most of the projects failed, we lost lot of customers. Many team members quit. All because of one singular mistake of choosing a language that's not right for the job. Eventually, one of our competitors came up with a working prototype in production using Python, Flask and with much better analytic capability at scale in less than 3 months. Python can do a LOT that R can do and cannot do and the code is much, much easier to read.

For example, string concatination:


    hello + world

If you're really interested in data science and/or analytics, I sincerely urge you to start with Python and Pandas together rather than R. It is much, much performant, easier to reason, and much, much easier to maintain and scale. Please consider this as heartfelt advice based on my mistakes rather than a rant. Thank you.

But if you know R, you can change the behavior of operators.

    > oldPlus <- `+`
    > `+` <- function(e1, e2) {
    +     if (is.character(e1) && is.character(e2))
    +       paste(e1,e2,sep="")
    +     else
    +       oldPlus(e1,e2)
    + }
    > "hello" + "world"
    [1] "helloworld"

I have started really enjoying R (with tidyverse) because it allows me to present complicated topics in a very simple manner. I can easily embed short R snippets and LaTeX equations in an Emacs Org mode document, and then export it as a very nice-looking easy-to-read HTML or PDF document with basically no effort other than coming up with the text itself.

It is incredibly liberating.

I'm working on my data science degree, and this is my method as well, though I'll admit I don't know much R yet so I'm using it very simply. I'll usually have a mix of python, R, octave, etc snippets.

I really love emacs org-mode.

As the other comments on this submission imply, if you’re learning R from scratch, start with tidyverse.

You can use base R, but when people talk about how much they hate R, it’s usually because of base R, not tools like dplyr/ggplot2. (I had learned R and used it in college, and nearly quit R entirely until dplyr was released)

And over the last summer, I started using forcats/lubridate, and I am kicking myself for wasting my time not using them sooner and using ugly hacks for the appropriate functionality instead.

R can be an annoying programming language, but for some reason I've found it easier to use for prototyping than even Python. I think it's because I can sloppily copy and paste between notepad and repl without much issue, whereas in Python I have to be concerned about the whitespace and things are a bit more verbose. I also get more out of the graphing capability of R, but that's probably because I don't understand Python's graphing well enough. Be that as it may, R just seems to have what I need to get things done as sloppily as I need. My workflow tends to be a combination of Python or Java spitting out numbers, and then using R to analyze and graph those numbers, all glued together with Bash scripts.

> For someone like me, who has only had some programming experience in Python, the syntax of R feels alienating initially. However, I believe it’s just a matter of time before adapting to the unique logicality of a new language.

I preferred R to Python right from the start. However, R is anything but logical, and its syntax is the least of its problems.

> And indeed, the grammar of R flows more naturally to me after having to practice for a while, and I began to grasp its kind of remarkable beauty, that has captivated the heart of countless statisticians throughout the years.

Wow, statisticians care about beauty? This is a shocking scientific discovery! (In the social sciences, but don't let this detract from your achievement.) What data do you use to support your theory?

I use R (or want to use it) whenever I find myself using excel or google-spreadsheet. If I was more fluent in R I would use it many more times. I found that using it instead of standard spreadsheet was much more robust. Spreadsheet have their role, however R is an amazing tool to have in your programming toolset.

It's clear there are a lot of strong opinions about R!

One kind of obscure problem I run in to is R's embrace of a global namespace. Package developers sometimes assume people are using this namespace, and access it via the globalEnv() function. This means that to use the package anywhere else, you basically have to patch their code.

(in contrast, I don't even think about problems like this occurring in python packages. Worst case scenario, can just use a subprocess )

R is great, haters gonna hate, but when you want to prototype a model, nothing flows like R + tidyverse + RStudio.

YAFL, yet another “fine” language.

The biggest problem with R: it is too slow.

R is free programming - see the R site above for the terms of utilization. It keeps running on a wide assortment of stages including UNIX, Windows and MacOS.

There's no reason to use R unless you are unable to learn.

That's silly. If you're doing research using new statistical methods, they're almost certainly available on R first. And ggplot2 remains the best plotting library I've ever seen.

If you're unable to learn, you're basically not even human.

Perhaps you missed of a word at the end of the sentence? (Not that this makes it much better, but at least comprehensible)

... unable to learn Python?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact