

Try R — A new online course, for free - allennoren
http://oreilly.com/go/tryr/
Sponsored by O'Reilly, build by CodeSchool (creators of Rails for Zombies), this is a free online course.
======
Homunculiheaded
For anyone interest in R without a background in statistics: I would highly
recommend learning the two in parallel (if not statistics first).

R is first and foremost a language for statistical computing. You really
aren't going get much out of it without working on some interesting data/stats
problems. Plus for most hacker types I think being able to play with the
statistics you're learning about with R can be a great learning aid.

However not only is it beneficial to learn stats with R, it is imho dangerous
to learn R without some stats. There's already too much research being
published with were 'p-value' means "the thing that t.test() output that I was
told needs to be in the paper".

Because R lets you play so freely with stats I find it a great tool to gain
greater intuition about certain mathematical principles, but there is a
temptation to let the tool do the work and the thinking for you.

~~~
chernevik
How about not-beginners looking to refresh / deepen their intuitions?

I've recently been working with the Python toolset in this space -- pandas,
numpy, matplotlib -- and run smack dab into my rusty regression analysis. In
particular I need to better understand the distribution assumptions underlying
the error distributions and the variances around the coefficient and intercept
values.

Any suggestions for some deeper study / refresher?

~~~
pseut
Depends on the data sets you want to work with. For straight-up linear
regression, with a heavy emphasis on observational data appropriate for
microeconometrics, "Introductory Econometrics: A Modern Approach" by Jeff
Wooldridge is absolutely phenomenal (an old edition is fine). (This is usually
assigned for advanced undergraduate econ majors or non-advanced masters
students; I don't know what the equivalent would be for undergraduate stats
majors).

For more "intuition about working with data, especially if you're a visual
person," Howard Wainer's books are wonderful; one example is "Graphic
Discovery: A Trout in the Milk and Other Visual Adventures." They're non-
technical, short chapters, discussions of different data sets.

Bill Cleveland's "Visualizing Data" and "Elements of Graphing Data" cover the
same material -- graphing data -- at a more technical level. I don't know
Cleveland's books would help with the issues you asked about, but... they are
amazing books and if you're interested in the subject at all I can't recommend
them highly enough.

I don't have any free recommendations, unfortunately.

~~~
creamyhorror
Seconding Wooldridge. It's the only economics text I'm keeping from university
- I'm getting rid of all the rest. It really digs into the material and
highlights the pitfalls and incorrect assumptions in regression and
forecasting. I'm planning to consult it when I start working on analytics in
current/future projects.

------
seanlinehan
This is great. I've nearly completed a class at UC Berkeley which was almost
entirely in R and I can say with certainty that it is a marvelous language. It
is powerful, concise, and has an incredibly robust community. I've
experimented with many programming languages, but I have not used one which
allows you to experiment as rapidly as R.

I'm currently going through the Codeschool lessons to see if there is anything
that I may have missed in my class. So far so good!

Edit: The most important thing that I didn't see covered in the course was
RStudio. Considering that R is more of a scripting language that a programming
language, I've found that RStudio is instrumental is using the language to
it's full capacity. While it's certainly possible, and in some cases optimal,
to use R from the command line, my experience is that the GUI features of R
studio are incredibly powerful. The ability to browse data frames and have
graphs show up in the context of your work has been very useful for exploring
and understanding data. Otherwise, the course does a pretty decent job
introducing readers to the language and it's data strcutures.

~~~
wallee
Revolution R is also really nice to develop in. I believe that you can get an
academic version for free.

------
stfu
I'm just going through this and while I love the concept, there is one
criticism from me. The course is very comfortable to walk through, but it
doesn't make use of the benefits of the online environment.

It is just "this is how it goes, now type it" kind of teaching. I am almost
done with the second session, and most likely have completely forgotten most
of the content from the first. If anybody else is going to try teaching stuff
in a similar way, please let me try to play/try out stuff as early as
possible. Even if the exercises are completely pointless, please make them a
bit harder than just exchanging an "+" for an "-". It feels so pointless
having such a great learning environment and not using it to make it feel less
of a brain-dump process.

~~~
wiradikusuma
Same here. And sometimes after being introduced to a concept or function, I
will wonder, "hmm, what if i do this.." but the embedded REPL doesn't allow
you to tinker.

------
thedaveoflife
For those learning R, this site: <http://www.twotorials.com/> which I found on
HN several months ago is fairly helpful as well.

~~~
Tactix47
Great link, thanks so much for sharing! Looks like a very complete
introduction to working with R.

------
hkmurakami
Thanks. I was sorely disappointed yesterday when I found out that Coursera
classes follow a strict schedule (and that I couldn't look at the material
right then) and that I wouldn't be able to try the Data Analysis with R course
on their site.

I'll definitely check this out :).

~~~
joshz
Videos are available on Roger Peng's Youtube.

[http://www.youtube.com/user/rdpeng/videos?flow=grid&view...](http://www.youtube.com/user/rdpeng/videos?flow=grid&view=1)

~~~
hkmurakami
You are my hero.

------
iaw
For those who want to go deep :

<http://www.burns-stat.com/pages/Tutor/R_inferno.pdf>

------
zmmmmm
I feel like it starts off a little bit on the wrong foot by introducing basic
types as scalar variables. In reality R has no scalar variables, everything is
a vector, list, and scalars are immediately coerced into a vector eg:

    
    
        > is.vector("a")
        [1] TRUE
    

This might seem like nitpicking but it leads to a world of confusion when
programmers used to languages with scalars start trying to use R that way and
it took me several months of confusion and weird bugs before I finally clicked
and started understanding R better.

~~~
hadley
Can you give an example where the confusion between a scalar and a length one
vector is important? I'm trying to figure out how to better teach R to people
familiar with other languages and understanding your stumbling blocks would be
v. helpful.

~~~
seanlinehan
For a strong conceptual grasp of how the language work, I think it is
fundamental that students learning R (especially those with a history in other
programming languages) understand that there are no scalars in the language.
The main argument that I would make for this is that nearly all R functions
can operate on vectors with a length grater than one. By understanding that
when you send a "scalar" to a function you are actually sending a vector, I
believe it is much more conceptually clear that you can, and should, send
larger vectors to functions and can receive the expected results. This is in
comparison to most other programming languages where it would be necessary to
iterate over a list or array in order to operate on each individual element.

For example, if you told somebody familiar with, say, PHP to add 2 to each
element in a vector, they would likely break out the oh-so-familiar for loop
to iterate over each element and apply the transform. This is completely
suboptimal in R, as you could just do vector + 2 and receive the exact same
thing.

~~~
hadley
Ah, that makes sense. Thanks!

------
zkoch
I really like this, but one big complaint is that the auto-scroll after
completing a little task isn't correct. So each time after I "pass" a
particular section, I have to manually scroll down with my trackpad.

~~~
patched
Gregg here from Code School.. we actually tried making it autoscroll down at
first. It was too distracting and annoying. Scrolling down when you're ready
to continue just felt more natural.

~~~
a_bonobo
I found another rare, but annoying thing - on German keyboards the "~" is
hidden away on the combination "Alt Gr and +", when I try to type that into
the prompt under Firefox 17.0 nothing happens. Workaround is to copy-paste.
Under Chrome 23.0.1271.95 the same key-combination inserts single quotation
marks, weirdly enough, same workaround.

------
goldfeld
R doesn't seem to get much frontpage love on HN, or even if it does and I
haven't seen, what would people suggest is the technology for statistics going
forward? I really hoped it would be around Clojure (e.g. Incanter[1]) and not
Python, for entirely selfish reasons.

[1]: <http://incanter.org/>

~~~
tokipin
honestly i don't know of anything that can compare to Mathematica, besides its
$300 price tag

~~~
dagw
And that $300 price tag goes up to $1000 if you want a license that lets you
use Mathematica in any sort of commercial or professional context (it only
goes up to $500 if you only want to use it an academic setting).

------
sonabinu
A nytimes.com article on R outlining it's history and how it is moving from
academia to main stream data analytics
[http://www.nytimes.com/2009/01/07/technology/business-
comput...](http://www.nytimes.com/2009/01/07/technology/business-
computing/07program.html?pagewanted=all&_r=0)

------
bernardom
Beautiful website.

I highly recommend supplementing a course like this (where you learn about the
language's ecosystem) with the R Cookbook from O'Reilly. It's been a lifesaver
for me, and helped me learn R over the course of a few months of needing it at
a new job.

Now I find that I need to learn something else for data munging- R is
_terrible_ at data manipulation and querying.

The querying bit is solvable with the incredibly useful sqldf package from
Google. The package allows you to use SQL syntax to query your data.frames (by
creating, populating, querying and deleting a psql table in the background).

Example: I have a dataframe named dfrm with columns named "id" "height" "name"

If I want the heights of all people whose names start with D, I would need to
use:

> dfrm$height[which(substr(dfrm$name,1,1)=='D')]

Terse, but painful. Compare to:

> sqldf("select height from dfrm where name = 'D%'")

Much easier!

~~~
agentq
I actually find base R __excellent __for data munging and manipulation, even
without using additional packages. Here is a reproducible example that very
easily accomplishes what you were trying to do (first two lines just set up a
sample data frame)

    
    
      set.seed(123)
      dfrm <- data.frame(height=runif(20),
                         name=paste(sample(LETTERS[1:5],20,replace=TRUE),letters[1:20]))
      subset(dfrm, grepl('^D',name), sel=height)
    

Basic R functions like subset, transform, with(in), reshape, aggregate,
(a,ma,ta,sa,va}pply, match, grep(l), by, split, table, etc. allow you to
accomplish just about any data frame munging you might want. Add on the plyr,
reshape2, data.table, xts/zoo packages and you're ready to tackle just about
anything.

I'm not a big fan of sqldf because imo R is not supposed to act like SQL.
Using sqldf in practice would require a lot of query string manipulation and
takes away from the nice functional features of R.

Nevertheless, it is very easy to write incomprehensible R code. The best way
to avoid this is to take one of the existing style guides (Google, Hadley
Wickham's) and adopt it seriously.

~~~
Myrmornis
One drawback with R is that in computations like this, several intermediate
data structures with one dimension equal in length to nrow(dfrm) are
allocated. Traversing an iterable of tuples is a simple way to think about it,
is efficient, and ties in with other technologies e.g. relational databases. R
is often people's first language (e.g. science graduates) and those people
would be better off learning how to iterate over tuples than learning the
obscure bestiary of data structure manipulators you point out.

------
id_ris
I've been using R extensively for the past 12 months and have achieved a high
level of comfort with the language. Now I find myself at a wall because of my
lack of math and statistics background. I've taken R as far as I can, or put
more properly, R has taken me as far as I can go without learning more math.

With that said, I have little reason to use R right now except for it's
excellent plotting ability with ggplot2. Otherwise for data munging,
wrangling, connecting to databases, doing unit testing, etc - R is a giant
PITA. Better to stick to Python for that. And as I learn D3, I think I'll use
R even less for visualization.

Therefore R will only be valuable to me once I can harness it's power for data
mining and machine learning, which is it's killer feature, IMHO.

~~~
hadley
Would love to hear what you find most painful about data munging/wrangling and
unit testing. It's something that I've been trying to improve in R (e.g.
<http://vita.had.co.nz/papers/tidy-data.html> and
[http://journal.r-project.org/archive/2011-1/RJournal_2011-1_...](http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf))

~~~
tanyaM
Do you think the love it/hate it dichotomy over R for data 'munging' stems
from different ways of thinking about data. I'm slowly getting comfortable in
R since returning to work in a sort of freelance arrangement that makes me
highly motivated to use free or affordable tools. I started out, however, in
clinical epidemiology data analysis using MS Access and SAS. I still think of
data in terms of rectangular data sets, RDBMS and sql. I have a hard time with
vector and matrix related terminology. I think I'm going to end up using
reshape2 and data.table a lot since sqldf is noticeably slower even with my
small data sets (compared with web analytics, finance, etc). The problem with
sqldf and variable names containing a dot is a real drag as I try to adopt
good coding style. I am missing the clarity and familiarity of sql statements,
though, as I try to find my new workflow in R. I hope a more unified approach
to data munging emerges soon. BTW, I totally espouse the reproducible research
(RR) method of documenting study design, analysis, interpretation... I am
loving knitr and latex for RR so I can no longer imagine using different tools
for data munging and analysis.

~~~
tanyaM
Correction: I should have said 'literate programming' instead of 'reproducible
research', since I'm not in a position to follow all components of RR.

------
rjsamson
I'm constantly impressed by the high quality content that Gregg and the rest
of the Code School folks put out on a regular basis and Try R is no exception.
Really excellent work! Looking forward to getting through the rest of it.

------
taeric
So, I've started using R for some stuff I'm doing at work. I have to say that
I'm basically treating it as a non visual spreadsheet. Seems everything I've
used it for so far, I could have done with excel. Am I doing it wrong?

~~~
frankc
Not doing it wrong, but only using a subset of R. For instance, R has powerful
data manipulation ability that can get your data to use the subset of R that
does what Excel does. R also has a huge library of packages that go way beyond
what Excel can do, especially for statistics. Sure, you can do an ols
regression in Excel, but you can you do a complicated machine learning model?

~~~
taeric
I should have been clearer. I've never been "good" at using Excel. So far,
I've really only done things that I would think you could do in Excel. Might
be a touch overkill, but I'm not sure. I am curious which data manipulation
items you are referring to. Any good pointers?

I am going through the Machine Learning for Hackers book, though. So far it
has been interesting. I guess I never realized that machine learning is
essentially statistics. (Or am I looking at that incorrectly, too?)

------
prashantganti
I am currently reading the "The Art of R programming" by Norman Matloff and it
is a good book for R beginners. Some familiarity with maths and stats basic is
obviously required though.

------
scrumper
Was keen on trying this out. Crashed on me on the variables bit on the first
page; just span and span. Happens a lot with these online interpreter things.

------
prezjordan
Oh man I really wish I had this at the beginning of the semester. I'm towards
the end of a grueling stats course - difficult, and not the best professor.
Each homework assignment I feel like I barely scrape by without really
learning. This is the first time I've ever felt this way about school.

~~~
aggronn
This has been a very common theme through my undergraduate stats education.

------
keithpeter
Section 2.1 contains the following instruction

    
    
       "Try creating a vector of numbers, like this:"
    

So I typed c(5, 9, 11) and got an error message.

They meant

    
    
       "Type the R code to create this vector:"
    

I shall work through the rest in a few days, nice environment.

------
merlinsbrain
I was very interested in the course syllabus for 'Statistics One' by Prof.
Andrew Conway. I missed the course on Coursera and now I'm unable to view the
course archive. Does anyone know where I can find the lectures? (Yes, I've
googled some.)

~~~
tomku
I took that course when it was running on Coursera, and I honestly can't
recommend it (in its current state, at least) to anyone looking to learn basic
statistics.

It covered a lot of material, but the quality and order of coverage was very
inconsistent. The first couple weeks were fine, but it felt really odd to jump
from correlations and scatterplots into regression, then come back to t-tests
and AOV afterwards. There were also some errors in the R code on the slides,
which led to a lot of confusion on the discussion forums during the class. As
a student, it didn't feel like the class's pedagogical approach was very good,
and I'm now finding myself using other resources to fill in the gaps.

If you'd like to hear more about those other resources I'd gladly post a list,
but they're mostly Python-centric. One that I can whole-heartedly recommend
even if you stick with Prof. Conway's class is the set of lectures from Roger
Peng's "Computing for Data Analysis" class on Coursera. The course itself
isn't available at the moment, but the videos are on his Youtube channel[1].
It teaches R from a programming perspective, and you'll find the content
invaluable once you start writing R code that's more complex than a couple
stats functions and a plot.

[1]:
[https://www.youtube.com/user/rdpeng/videos?flow=grid&vie...](https://www.youtube.com/user/rdpeng/videos?flow=grid&view=1)

~~~
merlinsbrain
Hi, thanks a ton for the detailed response. Luckily I don't really need a
Stats 101, so I don't think I'll mind him jumping around. If, of course, it
does get a bother I know which course is right for me. Till then I'm also
doing a bit of Thrun's Udacity Stats course on the side.

I would actually appreciate a list of resources in Python, that's what I like
using most! I have downloaded a copy of "Think Stats", but haven't gone
through it yet.

~~~
tomku
Sorry for the late response, I completely forgot about this post!

Looks like you're on the right track though, "Think Stats" and Udacity's stats
class were the main things I was going to recommend. I'd also recommend
checking out IPython's web notebook for inline charts and general awesomeness,
and the Pandas library for an R-style data frame built on top of NumPy. The
best resources for learning about IPython are probably screencasts, and the
author of Pandas has a book out named "Python for Data Analysis" that covers
IPython, NumPy, Pandas and some matplotlib.

------
nasir
I did my master thesis stat works in R. it was a pain at the begining because
I was not very good with stats. so learn the stats alongside.

------
nerfhammer
Looks like they emulated some features of readline in their javascript shell!
^A and ^E work (but no support for ^T, ^W, ^U...)

------
SeanA208
I really appreciate this! It's easy to follow and I was looking to learn the
basics of R, this seems like the simplest way.

------
65b
I got a random hang after typing hel( instead of help( also on sqrt( of a
vector.

Also paste works when you shouldn't be able to type.

------
xwowsersx
Just started this now. It's really awesome. Only wish it was a bit faster.

------
WinnyDaPoo
Shucks! It won't let me type '=' in Opera! :(

~~~
temp453463343
I'm not having this problem in Opera... I'm running version 12.11

~~~
WinnyDaPoo
Yup, Opera 12.11 on Debian Testing x86_64.

    
    
      _ and - result in -
      = and + result in +
    

This is so odd...

~~~
temp453463343
Oh, I'm on Windows... maybe a platform depended bug?

~~~
WinnyDaPoo
I'd suspect it could be. It wouldn't be a first for Opera.

------
jenni18071
chisq.test([14,25,1,8,25,4,7,15,23,6,0,5,26,8,6],simulate.p.value=TRUE)

------
jenni18071
chisq.test (arbre,simulate.p.value=TRUE)

------
tekniiq
numpy is the best!

------
frozenport
I do most of my stats work in Matlab, why should I use R?

~~~
temp453463343
Don't.

I've used both extensively (though not for statistics). R's syntax is a little
more C-like and consistent than MATLAB's - however the biggest difference is
documentation.

R's, like most open source documentation, is rather terse and often very
unsatisfactory. This gets especially apparent once you get into 3rd party
libraries and use things like Bioconductor. You’ll have no idea how things are
designed to be used, and without a guru at hand to walk you through you’ll be
in a world of pain. Googling for solutions is also very difficult (even using
something like RSeek). There are some archaic boards that sometimes have what
you need, but often you'll get stuck and not know what to do. What’s nice
about MATLAB is that all the libraries are made by a competent team of
engineers and they put in the money/resources to have good documentation. Even
the more abstract rarely used libraries have decent documentation. In R, if
you try using non standard libraries, you’re gunna get screwed.

The IDEs for R are also worse. RStudio is quite nice, but it's really bare-
bones compared to MATLAB's IDE. The one really neat thing about it is that you
can host it on a server and then remotely work on your work by just going to a
URL.

Also I think there are legacy issues in R (though MATLAB has those too). So
there are for instance matrices, dataframes and lists (which are list vectors,
but not at all). Why there are these three formats that fundamentally do the
same thing is beyond comprehension. (Maybe someone can give some insight)
Functions will randomly return one type or another. I always find myself
fighting to keep the types consistent and R keeps trying to mess with me. In
MATLAB everything is a matrix, so that makes things a lot easier

Fundamentally the issue is that MATLAB has a much larger user-base than R, so
you'll just have a much easier working with it.

If R's documentation and community was on the same level as MATLAB's then I
would maybe consider recommending it. If you work in genetics and you need to
use something like Bioconductor, then R is a must I guess. Most other
libraries are Fundamentally it's just some syntax differences.

The expression "You get what you pay for" is really pertinent here.

Note: I personally still use R for plotting, because I’m personally more
familiar with it. Otherwise I try not to touch it. Code organization for me
always gets messy, but I guess that’s cus I’m used to writing in OO languages.

~~~
jme3
This pretty much perfectly illustrates my comment below, pointing out that
these sorts of recommendations are entirely subjective and useless.

Many of your points are quite subjective. I could do the same thing with
Matlab. For instance, I find it mind boggling that anyone could get anything
done when you have to devote a separate file to every single functions. That
seems incomprehensible to me. And yet, I realize that that's probably a mostly
subjective thing that you get used to.

Personally, I find R's documentation excellent. When people complain about it,
it's usually because they have mistaken it for a tutorial. It's not. It's
documentation.

Without any data, I seriously doubt your claim that Matlab has a much larger
user base. (There is considerably more activity in R on StackOverflow than in
Matlab.)

Your complaint about matrices, lists and data frames is similar. Data frames
exist for the same reason that there's a mean() function: a columnar data
structure that holds differently data types in each column comes up so often
and is considered so useful that it is built in.

pandas in Python was developed in a way that went out of its way to
specifically _mimic_ these data structures because data frames are considered
such a vital aspect of R.

And keep in mind that these criticisms are all coming from someone who _also_
recommended against switching...!

~~~
temp453463343
You make good points, however I have to take issue with the documentation.

> When people complain about it, it's usually because they have mistaken it
> for a tutorial. It's not. It's documentation.

I don't really see the distinction. Documentation is supposed to explain to
you how to use the code. You can call it whatever you want. If it's through a
tutorial, then why not. R - and especially the non-standard packages you
download through CRAN - have very terse documentation that barely explain how
each function works on it's own, and much less how it works in the context of
the rest of the package. You can't just tell the user what goes into the black
box and what comes out and expect people to be able to use your software.

Sure they're are vignettes (I think that's the term), but they're really
inadequate b/c they only scratch the surface of how the package is meant to be
used.

Anyways, that's my 2 cents. I've spent soooo many hours fighting with R
documentation trying to figure out how to get what I needed done. Sometimes
months later I would find out there is a much better way to do something that
simply was not explained anywhere. I'm OK at R now, but I went through a lot
of pain to get to where I am now. I'd never wish it on anyone else.

My experience with MATLAB on the other hand has always been very pleasant. I
spent like 3 hours going over the tutorial on how to use it (much better then
R's "Introduction to R") and I hit the ground running. When I needed something
a quick search through the help or online always turned up results.

~~~
pseut
From my memory, MATLAB's documentation not only discusses the implementation,
but also discusses the statistical/engineering methodology. It's overkill and
can be pretty annoying (paging back and forth between different parts of the
help can be somewhat time consuming) when you actually know the statistics but
just want to understand the implementation. Hence the distinction between
"documentation" and "a tutorial".

I don't know whether it's an explicit or implicit design choice or just a
happy accident, but I'm grateful that the R documentation doesn't try to hold
anyone's hand and guide them through data analysis beyond their training.

------
asdasdsdasdad
god loves a Try R.

------
just_saying
I shouldn't have to enable JavaScript to try R.

~~~
scrumper
Umm... You could download an R environment, but that's hardly a smaller
footprint than enabling JS. How else would you do it?

~~~
just_saying
Well, HTTP GET and POST still work. No one seems to mind sending and receiving
data that way when they're wrapped in AJAX calls.

