
An opinionated view of the Tidyverse “dialect” of the R language - porker
https://github.com/matloff/TidyverseSkeptic
======
minimaxir
From personal experience, I learned R during college as a part of my stats
courses, before tidyverse even existed.

After doing a few personal projects using base R with complex manipulations, I
would have _quit using R entirely_ if it were not for tidyverse/dplyr.

I'm a _pragmatic_ software engineer: I prefer to get a data analysis job as
fast as necessary, and IMO there isn't much merit in making things more
complicated by using base R just for the sake of base R. It's definitely not
_faster_ , both in cognitive load and in speed. In my data science work there
hasn't been a _single_ case where I've needed to fall back to base R, or
wanted to.

Pipes are being undersold here: the examples only show a single piped action,
whereas pipes shine when you have to do 5+ operations in a single manipulation
(I've seen base R code that does that by just having a _lot_ of redundant df
assignments and it's a pain to read).

~~~
phonebucket
With regards to the usefulness of piping with dplyr: the post does express its
admiration for data.table as the alternative for dplyr.

In data.table, chaining operations together is standard and as simple as a set
of brackets to contain the next operation.

Both are preferable to pandas though (joke not troll).

~~~
bsg75
There are nuances that to me make R data frames, especially with Magrittr,
simpler to work with over Pandas. Pandas indexing always seems to get in the
way.

I have been using Python a lot longer than R, but the Pandas syntax is taking
longer to internalize.

Maybe it's just me?

~~~
wodenokoto
No. Pandas is a powerful tool, but a poor api / user interface.

~~~
finebalance
Pandas grouping and slicing is, imo, significantly worse than R. And I so
despise those multiindexes.

------
roenxi
He is glossing over the major issue with base R, which is its tendancy to
switch data types in a way that appears random to new users. As an undergrad I
spent nights literally on the verge of tears debugging R code where the types
had been mucked up by R's bizarre semantics. Claiming that data[foo] is the
same as filter(data, foo) is not correct - the [] and [[]] operators have a
lot of strange side effects depending on what argument is passed in. It is
like saying map() and a for loop are the same because everything a map can do
a for loop can do. That is true, but the single-mindedness of map is a major
selling point because you know it is only going to do certain specific things.
The issues here will disproportionately affect beginners who are likely to
accidentally pass in the wrong thing - it needs to fail with an error or in a
predictable way to best help them. Hadley's code fails sensibly.

One of the _major_ innovations of the tibble package is that it puts the types
of each column into the headings when it prints the table out. Dr. Matloff is
grossly underestimating how painful it is for beginners to, eg, figure out
when they are dealing with a column of strings vs a column of factors. If a
beginner copies code to someone else for help, the other person can actually
tell what types are involved! It is like magic compared to base R.

Given how ad-hoc the R community is and the paucity of people with programmer
backgrounds, these minor usability improvements on controlling types are
important. And anybody who cares about efficiency is using the wrong language.

 _EDIT_ Also, this comment that RStudio is bypassing an Open Source project's
leadership - that is one fo the major features of open source. It is usually a
healthy sign that the project leadership are lost in the weeds. I recall a
situation with broad parallels in the history of GCC for example; it didn't
kill the project although I believe a bad codebase died.

~~~
nonbel
> "Dr. Matloff is grossly underestimating how painful it is for beginners to,
> eg, figure out when they are dealing with a column of strings vs a column of
> factors."

This is just sapply(df, class)... Is this not found in R 101 material that
beginners are likely to find?

~~~
ACow_Adonis
Pay no attention to the myriad of other apply functions hiding behind the
curtain and the difficulty in remembering what each of them does and why you
can't just use map, eh? :)

So to answer your question, no, its not just simply that.

~~~
Zelazny7
I use a simply device to keep these straight:

\- (l)apply: List apply always returns a list

\- (s)apply: simplify apply tries to return simplified result

\- (v)apply: verify apply checks the return type conforms to user supplied
example

\- (m)apply: multiple apply applies FUN to multiple vectors

\- (r)apply: recursive apply is essentially a flatmap

\- apply : no device here, only use on matrices, never data.frames

~~~
roenxi
After reading your explanation I still, as has been the case for years, don't
understand what sapply or rapply does, and vapply sounds weird.

That isn't going to change, because I'm just not going to use them or 'invest'
the time in finding out what some statistician-of-yore's interpretation of a
map is. Instead I'll stick to tidyverse map - returns a list. Or tidyverse
map_[int/chr/dbl/etc, etc] if I want a vector of [int/chr/dbl/etc, etc].

That and the data frame manipulation verbs covers the most useful 80% of cases
where *apply would otherwise be needed. If the base R team were implementing
functions that way in the base, stats professors wouldn't need to complain
about mass exoduses from base R.

~~~
Zelazny7
Yeah, that's a valid point. And it took me a couple years to internalize these
differences -- mostly by being burned on numerous occasions. Base R has a lot
of idiosyncrasies and they have to be memorized, unfortunately.

I develop R packages for my colleagues and so I stick to Base R whenever
possible. I don't want my packages depending on the tidyverse at all. But for
EDA, I am agnostic about what my colleagues do. They should use the tools that
stay out of the way and let them get their hands around the dataset
intuitively. For me that's Base R, for others it's data.table or tidyverse.

------
pgcudahy
In this article and others I've read, people complain about the tidyverse's
lack of performance, but I think that places too much emphasis on speed of
execution versus speed of development. As an academic, most of the R users I
know only code as a portion of their scientific projects. Besides data
analysis we're doing data collection, manuscript writing and grant writing.
The tidyverse's more english-like syntax (eg select() versus `[`) and
following a series of pipes rather than unnesting ten sets of brackets makes
it so much easier to come back to my code after a few weeks or months and pick
up where I left off, or to work with other people's code. My time is more
valuable than computer time so the tidyverse is my choice.

~~~
FranzFerdiNaN
This is my experience as a data engineer/analyst. I work in the healthcare
space and my analyses are run on datasets that are at the uppermost half a
million rows. Yes, data.table is faster than dplyr/tibbles, but i dont care.
Like you said, most of my time is spent on expressing my thoughts into code,
not running the code. Tidyverse really simplifies it compard to base R.

Its good to know tools like data.table exist though, people shouldnt think the
tidyverse is the only way to do things.

------
ellisv
As an R user of some 10+ years, I think some of this is spot on and some of it
is just silly.

Pretty much everyone knows the Tidyverse is slow to run, but many of us find
it faster to write and very readable because it avoids one-time-use
assignments (i.e. in `a <\- f(x); g(a)` the variable `a` is only used once).
Naming things is hard. Prof. Matloff's point is well taken though - you don't
_need_ the Tidyverse and it doesn't run faster. Fair criticism.

> The "star" of the Tidyverse, dplyr, consists of 263 functions. While a user
> initially need not use more than a small fraction of them, the high
> complexity is clear. Every time a user needs some variant of an operation,
> she must sift through those hundreds of functions for one suited to her
> current need.

> By contrast, if she knows base-R (not difficult), she can handle any
> situation.

Yet base R has some 1,346 functions. Another 455 functions in `stats`. I'm not
sure the quantity of functions has as much bearing on learning as the quality
and consistency. So this seems like a silly argument to me.

Mostly I think Prof. Matloff is forgetting that base R is kinda terrible and
no one else has done much about it aside from RStudio.

~~~
cwyers
> Mostly I think Prof. Matloff is forgetting that base R is kinda terrible and
> no one else has done much about it aside from RStudio.

I think he'd disagree with the contention that base R is kinda terrible. Like,
I'm pretty sure he's wrong. But this isn't something he's forgotten, it's
something he's never believed.

The point about the Tidyverse and RStudio is that RStudio doesn't make any of
its revenue from the Tidyverse. They distribute it free. There are no upsells
for any of it -- I can't go out and buy ggplot2 add ons, I can't buy dplyr
support contracts. There's no money to be made on the Tidyverse, and plenty of
resources thrown at it.

So what is RStudio selling? They sell an IDE, they sell servers to host things
built in R, they sell a package manager solution, they offer most of these
things as a SaaS offering as well (although rstudio.cloud is still in trial
and they haven't started charging yet).

In other words, RStudio offers tooling for people who write R. They don't
really care what kind of R you're writing, from a financial perspective. They
get paid the same if you're using RStudio to write stuff using data.table as
they get paid if you use dplyr. RStudio is making a bet that more people will
want to write R code if they make R a more pleasant language to work in. And
anecdotally, they're absolutely right -- I tried to learn R before the
Tidyverse, and I hated every moment of it. I like post-Tidyverse R (although I
got into it when people still called it the Hadleyverse). I know several other
people with similar experiences to mine.

~~~
cosinequanon
If you don't think base R is kinda terrible then I recommend you read R
Inferno. It's truly a crazy language with enough footguns for a centipede. A
lot of the attraction to the tidyverse is having sane defaults and consistent
APIs that help you do the right thing quickly.

------
kickout
Decent critiques in my opinion. I was trained on/learned base-R myself ~10
years ago and love the "tidyverse" (although the author is correct, data.table
is superior for big data and its not particularly close). Can't imagine people
that only know tidyr/dplyr/etc without knowing base-R, seems like those people
would get exposed quickly.

I enjoy base-R solutions where I can. I use data.table when my data sets are
medium-ism. But like most people, my data is small and people just call it big
and thus the attraction to dplyr in particular. Those chained pipes are
addicting.

Always love seeing R discussion on HN, seems to be taboo though (not a 'real'
language and all that).

~~~
Bootvis
I don’t even understand the not a real language critique, R:

\- is based on scheme, a HN favourite;

\- integrates very well with C++ through RCpp allowing you to do whatever you
want. 9/10 times someone already went through the trouble for you.

~~~
nerdponx
The R standard library is kind of weird and hard to use for "general purpose"
programming. It's very clearly a domain specific language.

Also, it's unbelievably slow for basic operations like looping, function
calls, and variable assignment. It's literally orders of magnitude slower than
Python (I've tested it). Unlike its fellow C-flavored-Lisp Javascript, R
retains an extreme level of homoiconicity, Which apparently makes the language
very difficult to compile or optimize. I don't know how e.g. SBCL does it, but
over the lifetime of the R language no one has managed to implement a non-
trivial subset of R that performs better than GNU R (fastR was never finished
to my knowledge). So you are basically relegated to writing code in C,
Fortran, or C++ if you need even decent performance. Otherwise you are stuck
using "vectorized" operations like lapply(), which are fine but make for a
jarring experience if you're coming from other languages.

So of course it's a "real" language in the sense that Bash is a real language.
But it's ultimately and fundamentally a domain-specific scripting language and
I don't know if there is a way around that.

~~~
vincent-toups
I can give you some sense of why one can compile SBCL and not R. SBCL and
other Lisps these days maintain some kind of conceptual barrier between
macroexpansion time and run time, so that you can _stage_ your activity
appropriately. Read the code, gather the macro definitions, expand the code
(repeat as necessary) until you have some base language without
metaprogramming in it.

Then you can optimize and compile. This is particularly true of Scheme which
foregoes much of the dynamic quality of Common Lisp in favor of a much more
static view of the world. But its still basically true in CL.

R doesn't have macros of that kind. It has a sort of weird laziness based on
arguments remaining unevaluated until needed. Metaprogramming in R is
typically done by intercepting those "quoted" arguments and then evaluating
them in modified contexts (this is exposed to the user more or less by letting
them insert scopes into the "stack").

Thus, there is no distinction between macroexpansion and execution time.
Hence, its tough to write a compiler, which is basically just a program which
identifies invariants ahead of time and pre-computes them (eg, lexical scope
is so good for compilers because you can pre-compute variable access). Because
all sorts of shenanigans can be got up to by walking quoted expressions and
evaluating them in modified contexts, R is hard to compile.

This is, by the way, why the `with` keyword was removed from Javascript. It
provided exactly the ability to insert a scope into the stack used to look up
variable bindings.

~~~
lispm
> This is particularly true of Scheme which foregoes much of the dynamic
> quality of Common Lisp

The idea of even the standard Common Lisp is that both is possible: a static
Common Lisp and a dynamic Common Lisp, even within the same application in
different sections of the program. Common Lisp allows hints to the compiler to
remove various features (like fully generic code being reduced to type
specific code), it allows a compiler to do various optimizations (inlining,
getting rid of late-binding, ...) - while at the same time other parts of the
code might be interpreted on the s-expression level.

~~~
vincent-toups
Its definitely the idea. I know it works well enough for a lot of people, but
for me, I'm happy to just give up on the dynamic behavior in favor of a
simpler universe.

The last thing I want to be thinking about is whether my compiler needs a
"hint" that something can be stack allocated, for instance. That is their
business, not mine.

~~~
lispm
Common Lisp as a language standard makes no commitment to the smartness of a
compiler. If a certain compiler can figure out things, great, but some
compilers might be dumb by design (for example to be fast for interactive
use). That's left to implementations.

------
throw20102010
I think he’s exaggerating the effect of some of these claims. People who “grew
up” learning R+Tidyverse will discriminate against non-Tidyverse users? Give
me a break. You are either an advanced or competent or novice R user. You
either understand how pipes (%>%) work or you don’t.

I doubt that anyone is going to be denied a job because they are an _amazing_
R programmer but they just don’t have the experience with a particular set of
packages, especially those that are as well implemented and easy to learn as
the Tidyverse.

I can totally imagine someone not getting a job because they are a crap
programmer, and then blaming it on something else. Or I can also imagine
someone saying that the Tidyverse packages suck, and then be denied a job
because of their attitude.

I could make a similar argument for most of the substantive claims in his
post.

Some of these examples (dplyr vs. data.table) are cherry picked. I have
several of my own examples where read_csv is way faster than read.csv, so
maybe, like all good programmers, we should be testing and profiling our code
and implementing the parts that make the most sense for our needs.

The bottom line is that the Tidyverse is a good set of packages and you are
free to use them or not. There isn’t some blood feud (like vi vs. emacs)
between users and non-users, we all get along just fine.

There are dozens of great tutorials on how to learn R that don’t use the
Tidyverse, and RStudio is under no obligation to offer a full course on every
possible way to learn R. Any R user is also a competent Google user.

It’s fine to be opinionated. But, a professor as respected as Norm Matloff
should be careful of how they say things, or they risk souring their students
on a set of packages that might be very useful in the future.

~~~
educationdata
"cherry picked"? There is no doubt that data.table is generally way faster
than dplyr, unless you cherry pick a few use cases.

The problem is that dplyr is much slower than data.table and RStudio is
promoting tidyverse too much to make the slower choice a default for many
users.

The article is 100% correct on this issue.

~~~
cwyers
Depending on the size of your data, you might not care that dplyr is slower
than data.table. If you're better at writing/composing dplyr, you can often
make up the speed difference between the two in terms of the savings in time
spent writing and reading code. And if your data is that large, there are
solutions like dbplyr out there to run dplyr code on various backends and
offload the computation outside of R.

~~~
minimaxir
In practice, if there's ever a case that there's "too much" data such that
dplyr starts to hang (e.g. millions of rows, hundreds of columns), you would
get better value by setting up a database first with the data. Which you can
then query with dbplyr!

~~~
educationdata
data.table can handle millions of rows easily, as long as the data can fit in
the memory.

------
sghi
As someone who uses and teaches R extensively (and loves the tidyverse) the
tidyverse is so much easier to teach to people without any programming
experience and who have very little faith in their tech or maths skills (which
was me as well).

The tidyverse just 'made sense' to me when I started using R for the first
time a few years ago, and now I love using R and programming. On the other
hand, some of my ex-classmates learnt base R (because that's what we were
taught) and found it hard, didn't learn anything properly, and now still think
R or other programming languages are opaque and hard.

I'm not particularly fussed if Statistics Profs prefer data.table to dplyr or
base R to tidyr, I know what is easier to teach, understand and use for me and
a lot of other ecology/bio students and people.

~~~
hardboiled
I concur with your sentiments having cultivated data science teams from the
ground up with diverse educational backgrounds.

Programming in base R is more akin to assembly language and has accreted a
babel of inconsistencies that make it difficult to teach and learn. Learning
base R isolates you into a Galapagos island of academics who are either
ignorant of the needs of data workers or too elitist to engage with those not
in their priesthood.

Learning Tidyverse is a considerably better transition for learning other
languages, frameworks, and libraries.

Functional programming is closer to algebra than indexing into data structures
with magic numbers. I've found more success teaching functional pipelines of
data structures using the idioms in Tidyverse as a general framework for data
work than base R. Abstraction has a cost but for learning it is the
appropriate cost.

I sense that much of this `monopolistic` fear mongering is really about
feeling out of date.

------
mespe
I teach and consult on R and data science. I had the privilege of learning R
from one of R's core developers. My students often ask why I don't use the
Tidyverse. The answer is because I don't need to - I can do everything the
Tidyverse does and so much more in base R.

This article only briefly touches on what I think is the biggest issue with
the Tidyverse. The Tidyverse is incredibly limiting. The "tidy" workflow hides
many details of the language, which leads many users to think all data has to
be in a tidy "data.frame" (or now tibble) and organized just so. The functions
the Tidyverse authors have chosen to implement are seen as the limits of R's
capabilities. Every data problem has to be forced into the Tidyverse box (or
it is impossible). I cannot tell you how many Tidyverse scripts I have seen
where 90% of the operations are munging the data into the correct format for
the Tidyverse functions, when the original data can be handled with a single
base R function (e.g. lapply). Most Tidyverse users accept this as just the
way R is.

Most R users who only learn the Tidyverse never hit its limits, and for them
the Tidyverse is perfectly OK. Those that do either resign themselves to the
perceived limits of R, or (hopefully) start learning some base R. One friend
who took the latter path once exclaimed to me "Logical subsetting is
amazing!?!" \- this is a foundation piece of the language. That she went 2+
years in the "Tidyverse" without even knowing it was an option was eye opening
to me.

To its credit, the Tidyverse is empowering - with very little programming
knowledge a beginner can do a lot of data science. However, the majority of
these people get stuck as "expert beginners". Some of these become fierce
advocates of the Tidyverse without truly understanding base R. Meanwhile, none
of the truly advanced R users I encounter use the Tidyverse.

ed.typo

~~~
FranzFerdiNaN
"Meanwhile, none of the truly advanced R users I encounter use the Tidyverse."

This is most likely because truly advanced R users have been using R since far
before the Tidyverse existed. A whole new generation of R users is being
brought up with the Tidyverse, so im curious to see how the situation will be
in 10 years time.

~~~
mespe
This is not entirely true - many are of the same "generation" as me. To be
clear, I don't actually consider myself an advanced user. To me, many of these
advanced users are pushing against the limits of the language in a way you
don't see with run of the mill data science.

That said, I have heard my mentor say "I don't get the point of <insert
tidyverse function> \- I did this 20 years ago."

~~~
KKPMW
> That said, I have heard my mentor say "I don't get the point of <insert
> tidyverse function> \- I did this 20 years ago."

Tidyverse has a tendency to promote "new" things without any references to
what came before. If you listen to their talks they speak as if they invented
functional programming and the idea of pure and tiny functions working
together.

And it trickles down to the users. A lot of people who learned tidyverse first
for example, praise the `purrr` package, but have no idea that something like
`Map()` is in base R.

------
inputcoffee
I was waiting for the critique... but I never quite saw it.

Imho, the data problem Tidyverse is trying to solve is basically the ones we
face in a database. So, select, join, inner join and so forth. Show me all the
rows in this datatable where the 4th columm is larger than the 6th column and
the number itself is odd. Something like that.

There might be other ways to do it, but you want your select, filter,
summarize, mutate etc functions to all work with each other, pipe to each
other and be compatible.

Maybe there is a better way to do all this -- I haven't seen it but I am not
an expert -- but you have to show that to me.

So, in base R, walk through a set of example of mutating, joining, filtering
and so forth, and show me how they are all easier. Then I'll say, wow there is
an alternative to this Tidyverse thing. But in lieu of that demo, this felt
more like an intro to a complaint than an actual complaint.

Edit: Also, its funny that Wickham is (apparently) such a nice fellow that
people go out of the way to be nice to him in critiques.

~~~
barcadad
His critique is more about the impact of the full ecosystem effect of the
Tidyverse, not what you are referring to, which is just the dplyr semantics.
The Tidyverse demands that it's many related packages use tidy data principles
and lock users into that approach, which differs from base-R. Much of this
discussion is really just a debate about dplyr and magrittr rather than the
fragmentation that the broader tidyverse has brought on. All that said, I
agree with many commenters that the Tidyverse's improvements to speed of
development can more than offset the speed of execution issues, at least for
small-to-medium datasets.

~~~
RosanaAnaDana
I think any one making the point of speed of development have seriously missed
the boat. Lets be real: tibbles suck. Once you get the hang of data.table
syntax for matrix operations, its superiority becomes impeccably clear.

I run a data science group a large geospatial company and we develop day in
and day out in R and python. We've purged tidyverse as much as possible from
all of our code base. We've moved completely over to data.tables, which make
the vast majority of the tidyverse irrelevant.

~~~
hadley
Let’s keep the discussion civil please. People have legitimately different
needs, and just because a package isn’t well suited to your needs doesn’t mean
that it doesn’t help people with different backgrounds and goals.

------
thom
Wow, the idea that the Tidyverse is somehow burdened with an overabundance of
computer science or functional programming is just hilarious to this
programmer who has had to maintain, debug and optimise other people's R code.

~~~
jrumbut
I think this is what the author is getting at, that correct tidyverse usage
requires greater experience and knowledge and creates difficult to debug and
optimize code when used incorrectly (and is easy to use incorrectly).

I tend to agree. I've found tidyverse code has a write-only quality to it.
Since I'm going to see more of it I plan to dive into the inner workings of at
least dyplr and purr.

That said it is hard to deny what Hadley Wickham has done for the R community
and I can't write off the idea that he sees a bigger picture I'm missing.

~~~
hardboiled
Tidyverse code is not write-only; it is designed to be mostly read-only where
the level of abstraction is at the level of the domain, which in this case is
data frames and data pipelines akin to the same constructs in relational
algebra/SQL or data processing (map/reduce).

Correct usage requires learning the right abstractions, but fortunately these
abstractions are shared across language communities and frameworks.

If you are doing any data processing work and you do not know about
foundational functional concepts ie map/reduce then I would argue the code you
write is _less readable_ , overly focused on the idiosyncracies of how to do
something rather than how entities and data are related to each other - their
logic and structure.

The main optimization is to be correct and expressive of one's intention and
purpose. If you need performance use Julia.

~~~
thom
It’s very hard to write expressive, readable code that munges some horrible
tabular format into another arbitrary tabular format, to be fair. That said
it’s true that most R code doesn’t seem like map/reduce, or some other
pipeline of transformations. It’s usually more like someone cut and pasted a
long, hard REPL session into a notebook and has no intention of ever scrolling
back up.

------
phonebucket
I initially viewed this article with skepticism, rather liking the functional
style of the Tidyverse.

However, the claims the author makes are entirely valid. Data table is a
fantastic package, and easy to learn. Tidyverse really has split the R
community in disparate sets of R users.

To my mind, one of R’s greatest strengths is its meta programming
capabilities. It is this that allows such broad paradigms to exist within the
same language. In that, the challenges that R is presented with is similar to
languages like Lisp dialects and Scala.

I don’t have solutions, but the author has convinced me that a problem is
indeed there.

------
uptownfunk
I guess I am what you would call an R power user.

What’s really crazy is I never realized how fast and easy excel pivot tables
were to work with.

I know I know it sounds ridiculous.

But if you do a lot data splicing and dicing, excel can actually get your cuts
out way faster through pivot tables than writing R code.

So if you use R for almost everything, give Excel a try as well. And the nice
thing is this is even more powerful if you work with non data science folks,
because then you can say I just did this in excel (where excel skills should
be required by everyone in the company, and R for the data scientists) and
they’ll probably just leave you alone the next time they need something since
they will try to do it in excel first.

~~~
inputcoffee
Agree 100%, but only if it is a one time activity. If you have to
automatically pull files and do the operation several times, it is better to
go with R (or python, or awk sed or whatever).

------
atheriel
Most of these tidyverse vs. data.table arguments end up sounding a bit
irrelevant to me. They tend to focus on syntax, which is largely down to
preference, or performance, which is really only important once in a while.
And they seem oddly focused on the need to choose one or the other, when it
seems obvious that all are good choices with various tradeoffs.

I tried my hand at saying something more original about this debate recently:
[https://unconj.ca/blog/for-and-against-data-
table.html](https://unconj.ca/blog/for-and-against-data-table.html)

On the other hand, it's nice to see some criticism of RStudio and Hadley, even
if it does stray into the conspiratorial. They've done a lot for the R
community, but they do have some obvious blind spots.

~~~
vharuck
It's not that RStudio has nefarious motives that cause harm, it's that they
have neutral and sometimes benevolent motives with unforeseen consequences.

------
jhbadger
At least in my case, although I do like some of the tidyverse (ggplot is
tidyverse after all, and probably my favorite plotting system in any
language), I find the "tibble" data structure and its disdain for row names
really annoying. Many, many, packages which I use expect a data frame with row
names and likewise output such. So I am constantly converting back and forth
between tibbles and data frames.

~~~
Bootvis
The claim that ggplot2 is tidyverse can be disputed. It existed before and
works fine outside of it.

~~~
hadley
You are of course free to dispute what is in the tidyverse, but I pretty
strongly believe that it is part of the tidyverse

~~~
RosanaAnaDana
Seeing as the tidyverse is pretty much your invention, its well within your
purvue to define what is in and what is out. I do think, respectfully, that
both the op and the general sentiment article have the right of it.

Data.table and data.table-esque notation represent such an improvement of
tibble/dplyr, that within my company, we're making a concerted effort to purge
all tidyverse packages from general use (less ggplot). When new developers
come on, if they are coming from tidyverse, their first task will be something
involving pipes and data.table. Tidyverse was fine in school. It doesn't pass
muster in production, at least not in our work.

Data.table syntax is simpler, easier to read, easier to teach, and orders of
magnitude faster. It plays nicer with other packages than the tidyverse (if it
fits into a DF, it almost always fits into a DT, and i've never met a tibble
that I didn't wish was a data.table), and since almost all of our datasets are
10's to 1000's of millions of lines long, the decision was really made for us.

~~~
hooloovoo_zoo
"Tidyverse was fine in school. It doesn't pass muster in production...."

This is a bit disrespectful.

"Data.table syntax is simpler, easier to read, easier to teach..."

This is rather arbitrary, and I don't think it's the majority view of the
community, whatever the advantages of data.table.

"...orders of magnitude faster"

This is an exaggeration in most real cases, even according to the benchmarks
pointed to by data.table.[1]

[1] [https://h2oai.github.io/db-benchmark/](https://h2oai.github.io/db-
benchmark/)

------
zhdc1
I am a feverent user of data.table and try to avoid dplyr when ever possible.
I have colleagues who are exactly the opposite. No one cares.

If you want efficient code, try not to use dplyr. If you want readable code,
data.table isn't the best answer. If you want speed, there are better
languages than R to choose from, even with Rcpp.

If you want to get something done, go with whatever you know best and will do
the job.

Avoid Stata.

------
stilley2
I recently tried to do some stuff in R with tidyverse, and was not a fan. I'm
no expert, but the tidyverse's frequent use of nonstandard evaluation drove me
crazy. It makes it so much more difficult to write functions encapsulating
tidy functions. However, I never see people complain about this, so maybe I'm
doing something wrong or not grokking something...

~~~
jointpdf
There is a new operator, {{ or “curly curly”, aimed at dealing with this. It’s
part of the rlang (0.4) package and looks snazzy / easy to grasp.

Check it out:
[https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/](https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/)

~~~
RosanaAnaDana
Yuck.

~~~
jointpdf
Have anything more constructive to say?

------
cwyers
A friend pointed out to me that the author a few weeks ago put out a Python
versus R comparison where he makes the rather incredible claim that the
addition of some opinionated packages in the R ecosystem is more disunifying
than the 2.7 to 3.x split in Python:

[https://github.com/matloff/R-vs.-Python-for-Data-
Science](https://github.com/matloff/R-vs.-Python-for-Data-Science)

------
zelda_1
When I'm writing the book (www.anotherbookondatascience.com) I always try to
avoid the usage of tidyverse and got some criticism. But now I feel supported.

------
firstinstinct
at the end of the comments of the post [http://varianceexplained.org/r/teach-
tidyverse/](http://varianceexplained.org/r/teach-tidyverse/)

compare base R:

    
    
      var1 <- "pounds"
      var2 <- "wt"
      mtcars[[var1]] <- mtcars[[var2]]/1000
    

Now hadley solution comment: You can now express that sort of thing fairly
elegantly with dplyr:

    
    
      var1 <- sym("pounds")
      var2 <- sym("wt")
      mtcars <- mutate(mtcars,!!var1 := !!var/100)
    
      
     Watch your steps: !!, :=, sym.
    
    

Thats a lot of added complexity to be able to use column names, and to top it
all, the use of non standard evaluation make things very complex when one go a
little away from basic EDA.

~~~
roenxi
Well, strictly literately to this case the tidyverse command would be:

mtcars %<>% mutate(pounds = wt / 1000)

Which is obviously simpler, and quite likely what a beginner is actually
trying to do.

The real complaint - "tidyverse doesn't let us use variables to name columns!"
\- is fair enough, following quasiquotation is hard work. If you need that,
drop back to base R. However, it is harder to teach because all the common
operations (select, filter, etc) will involve some combination of []/[[]] and
relatively hard to figure out which one a piece of code is trying to do.

Except for the lucky few who have brains wired to think in terms of arrays and
database relations, someone learning the language is unlikely to thank you for
that.

------
c3534l
Tidyverse was nearly the only thing I really liked about R. The syntax is
quirky, it's full of cruft, and unpleasant to make full programs out of. But
if I used a new package, I knew instantly how to use it.

------
ahurmazda
With respect, its laughable to think that anyone working on big-data with R
has _not_ heard of `data.table`. When your livelihood depends on it, you tend
to find it out one way or the other.

However, I'd claim most folks don't deal with big data or even medium data!
[Aside: I chuckled at the 1e5 columns on the plot]. Personally, at the small-
to-medium range, all languages are fast enough. It boils down to ergonomics
for me. I _detest_ the subscript notation. It slows me down -- not just
productivity-wise but also how I read or reason-about my code. Not
surprisingly, I prefer `dplyr` over `data.table` (base R or `pandas`)

------
dracodoc
One reason dplyr was promoted is that you can connect to different server
backend like sparkr, sql etc, which is the enterprise direction RStudio aiming
at. In the other hand, data.table is your friend when you are using your own
machine. With more and more memory and cpu power available, its benefits are
actually increasing.

I have been using data.table from almost day one, it does worth more
recognition. The syntax is a little bit hard in the beginning but can be
grasped after some efforts. I often feel that my exploration code will be much
more tedious look if written in dplyr style.

~~~
vasili111
If my university uses Enterprise RStutio on grid it is better to stick with
tidyverse rather than data.table?

~~~
Tarq0n
It doesn't make a difference. Rstudio's products aren't integrated with their
libraries in particular.

------
antipaul
dplyr is an "advanced" topic?

It's, like, the easiest data manipulation library in any language.

~~~
cwyers
People who were big into R pre-Tidyverse have this tendency to view the way
they manipulated data frames pre-Tidyverse as the "basics," and Tidyverse as
the "advanced" method. I think it does anyone learning R for the first time an
incredible disservice, for the reason you state: dplyr is fantastic. But the
old guard is (rightly) afraid that if people start with dplyr, they'll never
get around to learning what base R provides for manipulating data frames.
Which is good in my view, and nobody should have to learn that.

------
bagrow
I think the same thing is happening in Python for data science due to Pandas.
It’s a great package, but many data scientists are only able to manipulate
data using Pandas, and know little “base” Python.

~~~
jphoward
That sounds incredible to me as using pandas without falling into the traps of
chained indexing seems so much harder than 90% of Python.

~~~
telchar
I've been planning to go from "can work it out with copious examples" to
"knows when to use apply, transform, etc. off the cuff" level of knowledge in
Pandas - do you have any suggestions on good resources on this?

I'd like to understand better why the data structures work the way they do and
thus have an intuition on what operations to use when. The O'Reilly Python
Data Science Handbook[0] seems like it might be useful here, but I'm not sure
if it is still up to date.

[0] [https://www.oreilly.com/library/view/python-data-
science/978...](https://www.oreilly.com/library/view/python-data-
science/9781491912126/)

~~~
squaresmile
I find the pandas documentation pretty good [0]. Actually, I find the
documentation, official tutorials of pandas, numpy, matplotlib all pretty
good. Each package has their own idiosyncrasies but I think the documentation
and code examples cover them well enough.

The OReily book should not be that outdated if at all. It also covers other
essential tools.

[0] [https://pandas.pydata.org/pandas-
docs/stable/getting_started...](https://pandas.pydata.org/pandas-
docs/stable/getting_started/index.html)

------
firstinstinct
A related post is [http://varianceexplained.org/r/teach-
tidyverse/](http://varianceexplained.org/r/teach-tidyverse/)

------
firstinstinct
What I see clearly is that is not a good option to begin with R inferno,
except when you are a fighter prepared to learn at any cost.

------
kgwgk
> Ironically, though consistency of interface is a goal, new versions of Tidy
> libraries are often incompatible with older ones, a very serious problem in
> the software engineering world.

I was (sadly) amused by a recent comment on HN linking to some tweets
celebrating that some R code still ran four years later.

~~~
tylermw
That was my comment, and the point wasn't that the R code still ran four years
later--the point was that R Core's focus on backwards compatibility (in
reference to R being hamstrung via its focus on "compatibility with S") isn't
a bad thing. You're never going to get the 2.x - 3.x schism that python had
with future version of R, assuming R core continues with that philosophy.

~~~
kgwgk
I completely agree with your point and I think saying that R is keeping
compatibility with S doesn’t make much sense. R is keeping compatibility with
R, as it should.

Maybe I lack context, but those tweets looked to me as if someone said “I can
run this four year old game in the latest release of Windows!” or “Three
versions of Office later I can still open my excel files!”

