Hacker News new | past | comments | ask | show | jobs | submit login
15 Page Tutorial for R (studytrails.com)
194 points by sndean on Aug 17, 2016 | hide | past | web | favorite | 29 comments

This tutorial is much more basic and has much less practical statistical applications than the R tutorial posted last week (https://news.ycombinator.com/item?id=12264360), which itself is out-of-date relative to the R for Data Science book (http://r4ds.had.co.nz/)

I really am curious why anything "R" and "Tutorial" gets massively upvoted to the Top 3 of HN like clockwork nowadays. I might have to restart my R tutorial screencasts since there appears to be a demand. :P

Because it's exploding. Microsoft backing does wonders, for one. But it was even without them. Every analytics vendor standardizes on R.

> I really am curious why anything "R" and "Tutorial" gets massively upvoted to the Top 3 of HN like clockwork nowadays.

Same thing I was wondering. Is it simply that R is gaining popularity among a more mainstream audience, while it hasnt changed that much recently ?

Idiomatic R has changed enormously with the growth of Magrittr the ℅>℅ operator, dplyr and now broom. Code by expert users is unrecognisable from that of a few years ago.

R still looks pretty much the same to me when I look at the code on kaggle, stack exchange, etc. Can you share where you have seen this expert R use and/or what use cases?

Most modern R usage I've seen, even on Kaggle, has used margittr/dplyr since it's orders of magnitudes faster/easier than the base functions. (i.e. I almost quit R in favor for Python without those two)

A quick example of my own notebook which uses margittr/dplyr heavily to process Stack Overflow survey data: https://github.com/minimaxir/stack-overflow-survey/blob/mast...

Thanks for the example code, it actually has me wondering whether using the pipes has any impact on performance. They are much easier to read than the nested function calls that would be used instead.

However, I mean I just searched for an R script on kaggle and the first I found is this: https://www.kaggle.com/bpavlyshenko/grupo-bimbo-inventory-de...

From my experience that is a typical example (data.table is very common, but not so much dplyr and the pipes).

I looked through the recent/featured questions on stack overflow and cross validated and saw only "old school" R as well. For example: http://stats.stackexchange.com/questions/228800/

Indeed. 10-12 years ago there were probably no more than 50ish people in the world who knew R better than me. By 5 years ago the tide had heavily turned but I was still using it at least a few times a week.

If I had to use R as a daily driver now it'd be like learning the language all over again. I could write it using the old ways, and u do that on the few occasions I have to, but it's no longer idiomatic at all

Yeah but that tutorial does not even go into dplyr or tidyr.

Indeed - I think it's unclear how you should begin to teach R now. There is a lot of legacy stuff that you may need to recognise if you want to understand the help documentation. But if you wanted to seriously work in R it maybe better to skip straight to ggplot2, dplyr and broom without touching base R, the apply functions, and list in list in list hell.

That's pretty much what I did. I dip back down into the 'base' stuff when I need to, but in general I stick in the hadleyverse of tools. It's been great.

I would love R screencasts

R is essentially a LISP dialect. There is even a library that shows the syntax tree in LISP syntax:

> library(codetools) > showTree(quote(1+23+45)) (+ (+ 1 ( 2 3)) (^ 4 5))

See also the R-to-LISP compiler: http://dan.corlan.net/R_to_common_lisp_translator/

Another great resource for learning R is Swirl. I can't recommend it enough.


  > a[1-3]
  [1] 1 3
  > a[[1-3]]
  Error in a[[1 - 3]] : attempt to select more than one element
The first line evaluates to a[-2]. I think what the author meant was a[1:3]. a[[-2]] gives an error, a[[2]] is valid R.

a[-2] is valid R syntax. A negative index means that element is omitted from the result, so a[-2] is a without the second element.

As an aspiring R-learner, I have a few quibbles in the first pages I've looked at:

re: the Assignment operator http://www.studytrails.com/R/Core/AssignmentOperator.jsp

> However, there is a difference between the two operators. = is only allowed at the top level i.e. if the complete expression is written at the prompt. so = is not allowed in control structures. Here's an example:

      > if (aa=0) {print ("test")}
      Error: unexpected '=' in "if (aa="
      > aa
      Error: object 'aa' not found
      > if (bb<-0) {print ("test")}
      > bb
      [1] 0
What in the LOLWTF...This is a bad example because even as an experienced programmer, I have no idea what is supposed to happen, or why this pattern would even be used -- and even then, I was surprised with the result. This is an overly complicated explanation even if it is correct. I prefer Hadley Wickham's style guide:


> Use <-, not =, for assignment.

(an assignment operator that isn't an equals sign is one of the things I miss most when switching away from R)

From the next chapter, Listing Objects: http://www.studytrails.com/R/Core/ListingObjects.jsp

> All entities in R are called objects. They can be arrays, numbers, strings, functions.

This may technically be the case, but any R tutorial that does not open up with what makes R different from other mainstream languages is doing the reader a major disservice. This Listing Objects chapters shows patterns that use R in ways that I haven't seen used in other R examples (beginner and advanced). Even if it's correct, what's the point in showing esoteric examples unless this tutorial is meant to teach R to someone interested in the design of languages?

Again, Wickham's Advanced R book handles this topic well (in fact, Advanced R is probably the best book you can read if you already know how to program -- it is incredibly accessible) in his early chapter on Data structures: http://adv-r.had.co.nz/Data-structures.html

> R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis...Almost all other objects are built upon these foundations. In the OO field guide you’ll see how more complicated objects are built of these simple pieces.

And this next sentence is the one thing I wished someone had printed out and stapled to my forehead before I started to learn R:

> Note that R has no 0-dimensional, or scalar types. Individual numbers or strings, which you might think would be scalars, are actually vectors of length one.

Maybe that's self-evident to other programmers, but even as someone who once programmed in MATLAB, I was stunningly ignorant I was of how every return value I interacted with was a vector, even a single simple string. In retrospect, the interactive shell alludes to this...but I didn't even bother looking up the details about the shell:

         > 2 + 2
         [1] 4
         > 'a'
         [1] "a"

R is wonderfully easy to start up with and produce visualizations with. But skipping over the language's fundamentals was incredibly painful for me. A few minutes skimming "Advanced R" would have easily saved me hours of confusion.

wrt assignments: = does assignments too but not eagerly. It's used for late bindings. This isn't a matter of style as your reference to adv-r would suggest but a different use case.

I wish people would spend more time reading the official language reference instead of all these half-baked tutorials.

I can see myself and other HN readers enjoy the design of the R Website[0]. But I believe that is not universal.

The HTML documentation is intimidating. Not that these are bad things. Science is complicated, and it needs complicated tools.

These tutorials allow(half-baked or not) a casual insight into to complex subject matter.

[0]: https://cran.r-project.org/manuals.html

The only place I ever see the "if (x<-y$x)" style used is deep in base R source that probably hasn't been touched in a decade.

I prefer Python

Also, if you are using models and algorithms on the backend, R is painful compared to python.

If you are just doing interactive work RStudio and jupyter with well-stocked support libraries are about equivilant to me.

If you want to do something not all that popular (eg. conjoint analysis) or something cutting edge (duplicating recent research results), R is pretty much the only way to go.

If you want to do something not all that popular (eg. conjoint analysis) or something cutting edge (duplicating recent research results), R is pretty much the only way to go.

Why? Shouldn't you be able to do that in Python (or MATLAB if that's your thing)?

You sure can! Except you will likely have to roll your own, rather than the plug-and-play modules that seem to be released for R first (used to be MATLAB, but R seems to be the pretty new girl in class, getting all the attention).

Agree. While R's statistical stuff is much more polished, doing anything but running the regressions is a royal PITA.

You got it the other way around. Doing anything but model specification (i.e. regression) is royal PITA in python.

Regression in sci-kit learn is like 2 lines. It's hard to understand where your painpoint is?

That's my point. Specification of the model is not (usually a problem). The other tasks are the problem.

That would be so because with R you get access to a myriad of statistical packages or because the functionality of numpy etc. is already included? IMHO the only real shortcoming is the lack of 64 bit integers, which is of course ridiculous.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact