
R for Data Science - hadley
http://r4ds.had.co.nz
======
hadley
I'm the author, and I'm happy to answer any questions.

The book should be in print by (hopefully) the end of this year, or definitely
by Jan 2017. The content will not change significantly, but there is will be
minor fixes and a lot of proof reading.

~~~
danso
Trivial, self-serving question: is there a library for generating the diagram
of table relationships here (13.2 nycflights13)
[http://r4ds.had.co.nz/relational-data.html](http://r4ds.had.co.nz/relational-
data.html)

And of course, thanks for another great book, it's helpful for learning R but
I'm always enlightened by how thoroughly you explain the general concepts
(e.g. Relational data and joins). Have heard a few people on faculty speak
enthusiastically about the book even as I hold out for more adoption of Python
:)

~~~
hadley
No, I wish there was. Those were painstakingly drawn by hand.

~~~
phillc73
Possibly DiagrammeR?

[https://github.com/rich-iannone/DiagrammeR](https://github.com/rich-
iannone/DiagrammeR)

------
minimaxir
R for Data Science is the canonical source for learning R and other real-world
R tools such as dplyr/tidyr/ggplot2, and one I've recommended on HN
submissions about R tutorials which simply go over primitative data types and
out-of-date packages. (It's one of the reasons I've postponed making R
tutorials myself, since the book would be better/more accurate in all
circumstances.)

------
Kevin_S
Seems interesting. Quick question:

Some background on myself first. I am a financial consultant (only 1 year
since graduating) and am planning to do a PhD in Accounting in the next 3
years. Currently working through the GMAT, but once that is complete, I will
find myself with 2 or so years to do things that will help prepare me for
research. One thing I have considered is taking a course/reading books on data
science and such to prepare me for the advanced stats/data analysis that will
go on during research. As someone with no coding experience, and with solid
quant background (I was an economics undergrad), would this book be a good
starting point for getting experience with this stuff? And is R the
appropriate language to learn? I don't mind learning to code, but it is
intimidating.

Thanks!

~~~
vegabook
Personally I work in macro and fixed income market analysis (strategist), and
I can heartily recommend R as your first language. Indeed, coming from a CS
background, I first applied Python to many problems, and resisted R which was
not a "grown up" programming language, in my opinion (some would make the same
accusation on Python). However I dipped my toe in the water one day because R
had a Bloomberg terminal add in and Python did not (at the time), and after
about a month of uphill learning curve the eureka moments started
materializing thick and fast. I cannot recommend R enough, as a problem
exploration language. It just beats Python hands down when it comes to
grabbing some (usually dirty) data, mangling it around, cleaning it, and then
install.package'ing a bunch of potentially useful libraries which allow you to
do everything you could possibly imagine to a small to medium sized data set.
And crucially, static graphing. Nothing else comes close for this use case.

Now...caveats. R is not a production programming language. If you find
yourself creating something truly useful for many users, that requires robust
programming language structures such as threading, proper memory management,
server-capability, or indeed, speed, R is going to become frustrating. Yes a
whole bunch of people will tell you "it's possible, I do it, etc", but that is
not its sweet spot. Also, if your data set is bigger than 2-3 gig or so,
you're going to start hitting R's memory management wall. It's slow. You'll
then be better off with Python, C, or indeed, Scala, or possibly, Apache
Spark. The common thing about these caveats, however, is that they're
definitely second order problems, later in your career life cycle, than the
excellent mainstream data science tool which is R for people who have outgrown
Excel, but are not full fledged computer scientists, and who want to get (lots
of) stuff, done.

(by the way, pre-empting comments. Yes Pandas is great, but no it's not quite
R).

~~~
Kevin_S
Thanks for the replies everyone. I'll definitely save this link and pick up
the book when it comes out!

~~~
131012
If you want to attain a deeper understanding of coding, I suggest you try
another language once you get the working basics of R. Maybe some basic MOOC
on Python or C. You'll understand concepts of CS that are hard to grasp by
learning only R.

------
zzleeper
Looks really nice. I'm a heavy Python/Stata user, but I'm seriously thinking
about transitioning, given all the amazing work in the hadleyverse.

Also, RMarkdown looks incredibly well thought out

~~~
hadley
_cough_ tidyverse _cough_

------
yodsanklai
I'm learning R for fun at the moment. I'm sure it's super useful for
statisticians but it's quite an intricate language! It's an unlikely mix of
different paradigms and features mixed together. Not something I'd recommend
to a beginner programmer, yet it seems that people love it (even non-
programmers).

I looked at several tutorials and what worked for me the best so far are the
official manuals
[https://cran.r-project.org/manuals.html](https://cran.r-project.org/manuals.html)
(esp. the language definition and the "introduction to R").

Moreover, for the programming languages enthusiasts, the following article is
pretty interesting:

Evaluating the Design of the R Language (Morandat, Hill, Osvald, Vitek).

"R is a dynamic language for statistical computing that combines lazy
functional features and object-oriented programming. This rather unlikely
linguistic cocktail would probably never have been prepared by computer
scientists, yet the language has become surprisingly popular. With millions of
lines of R code available in repositories, we have an opportunity to evaluate
the fundamental choices underlying the R language design. Using a combination
of static and dynamic program analysis we can assess the impact and success of
different language features."

~~~
hadley
You might enjoy <[http://adv-r.had.co.nz/>](http://adv-r.had.co.nz/>), which
discusses R from more of a programming language perspective (albeit a
programming language that is chiefly used for data analysis). There are a lot
of misunderstanding about R the language.

------
benhamner
Great book!

We also have almost 10,000 forkable & executable R examples on Kaggle
([https://www.kaggle.com/kernels](https://www.kaggle.com/kernels) \- select R
from languages). Almost all of these use at least one of Hadley's libraries

------
mooneater
hadley, I love your book, and I learn a lot from your preferences in R
packages. Now is there a general source for determining the "best" packages
for various tasks?

CRAN has task views, but they are long lists and don't clearly show popularity
or feature matrices. There are just so many options.

Im thinking something like
[https://djangopackages.org](https://djangopackages.org) , for example see
[https://djangopackages.org/grids/g/commenting/](https://djangopackages.org/grids/g/commenting/)

~~~
hadley
Not yet. It's a hard problem.

------
dreww2
Hadley, can you share a bit more about your plans for modelr and what need(s)
the package will be designed to solve? Congrats on your book btw, I've been
reading it for a few weeks and it's quite simply excellent.

~~~
hadley
I don't think modelr is going to change significantly in the future. It solved
a pressing problem (fitting models as part of a pipeline) so I could teach
modelling using the same interface as everything else in the book.

However, the modelling infrastructure in R is generally showing it's age, and
thinking about how to make modelling easier is something that I will be
working on in the coming months.

~~~
dandermotj
I saw the vctrs package repo the other day on your GitHub. What's your plan
with that? I guess your covering all of R's base data types (dplyr:data
frames, purrr:lists, forcats:factors, vctrs:vectors)? Also do you plan on
developing further functional programming packages?

~~~
hadley
Yes. You can read the issues for what the package will start with.

No plans for more FP packages in the near future, although I do want to add
multicore and progress bars to purrr.

------
Rekushi
What is the equivalent book for Python and data science?

~~~
danso
There's a large number of such books, though none that are as authoritative
with respect to Python (this is a statement about the size of Python's
community vs. R, not necessarily about the authors):

\- via Wes McKinney, creator of pandas (which makes Python about as close to R
as you can get): [https://www.amazon.com/Python-Data-Analysis-Wrangling-
IPytho...](https://www.amazon.com/Python-Data-Analysis-Wrangling-
IPython/dp/1449319793)

\- [http://joelgrus.com/2015/04/26/data-science-from-scratch-
fir...](http://joelgrus.com/2015/04/26/data-science-from-scratch-first-
principles-with-python/)

There are a bunch of books specific to machine learning too though I haven't
read them myself.

~~~
hadley
What would you recommend for visualisation?

~~~
danso
I actually have no idea about that. I don't think there's an equivalent to R's
base graphics, so that would seem to make matplotlib the closest thing to a
standard -- seaborn [0], which I've seen used a lot lately for more advanced
dataviz, lives atop it, but it's also relatively new.

People seem to have conflicted feelings about matplotlib, maybe because of its
origin in MATLAB? Not that Matlab itself is bad, but I think the decision to
make matplotlib's API comfortable for MATLAB users seems to cause confusion to
contemporary users, even before the usual 2.x vs 3.x issues (matplotlib ported
to 3.x a few years ago but many users still write Python in the 2.x style.)

Anecdotally, I feel like I see advice like "Just use plotly" more than I see
recommendations to actually learn matplotlib. I actually gave up on matplotlib
until I stumbled upon this comprehensive tutorial, which covers the basics and
many elaborate use cases. If there's a book that does it better, I haven't
heard about it:

[http://www.labri.fr/perso/nrougier/teaching/matplotlib/](http://www.labri.fr/perso/nrougier/teaching/matplotlib/)

The matplotlib site itself is chockful of well-documented examples, but some
of them seem to be significantly more verbose than they need to be. My
impression is that the library is stable/ubiquitous enough that there isn't a
big movement to overhaul things. Last time I looked at the API changes for
v2.0 [1] (1.5.3 is stable), most of the changes had to do with default styles
and stylesheets, which is non-trivial given the number of people who use
ggplot2 because it "just works"

[0]
[https://stanford.edu/~mwaskom/software/seaborn/](https://stanford.edu/~mwaskom/software/seaborn/)

[1]
[http://matplotlib.org/devdocs/users/dflt_style_changes.html](http://matplotlib.org/devdocs/users/dflt_style_changes.html)

~~~
sonabinu
What are your thoughts on bokeh? I seem to always revert to R for
visualizations

