
R 4.0 - KKPMW
https://stat.ethz.ch/pipermail/r-announce/2020/000653.html
======
zhdc1
This alone is reason to upgrade -> "R now uses a stringsAsFactors = FALSE
default, and hence by default no longer converts strings to factors in calls
to data.frame() and read.table()."

~~~
FiReaNG3L
One of the big reasons why I quit R 10 years ago and never looked back -
Python wasn't secretely converting anything, AND failing silently when its not
the expected type.

~~~
kyllo
R's come a long way in the last decade. The tibble and data.table packages
both address this issue. data.table
([https://github.com/Rdatatable/data.table](https://github.com/Rdatatable/data.table))
is the more strongly-typed of the two, by default it fails loudly when it
encounters data that doesn't conform to the column type. It's also quite fast
--binds to C code that parallelizes with OpenMP. It has very terse and
expressive syntax, I find it so much more intuitive and easy to work with than
pandas.

If you're happy with Python, by all means keep using it. I use both languages.
Just suggesting that if you gave up on R that long ago, you might be
pleasantly surprised by how much better it's gotten since then.

~~~
Tarq0n
vctrs [0] is the latest effort by the Rstudio developers to help people write
type-stable code. The R standard library has a lot of issues with silently
casting types, but the wonderful thing about it being so scheme-like is that
many of these things can be evolved through libraries.

[0] [https://github.com/r-lib/vctrs](https://github.com/r-lib/vctrs)

------
bransonf
As someone who writes R daily, I’m really excited for 4.0. That said, R still
leaves a lot to be desired. Changing the default for stringsAsFactors is
great, and I think it reflects a small shift in R from being an exclusively
stats-based language to something more general purpose. The nature of
stringsAsFactors is that in most statistical models you need your categorical
variables to be ordinal.

That said, R still is a stats language by design. In Python or JS for example,
you can concatenate strings with ‘a’ + ‘b’ but the + operator in R is
explicitly only for numeric types. R also has a horrible architecture for
memory management, leading to code that uses profound amounts of RAM. I face
this issue constantly as I work with very large datasets.

I have a love/hate relationship with R and despise using it in production. I’m
also not a fan of the divergence that Tidyverse has caused. Particularly the
expectation of Non-standard evaluation and the tendency for new R learners to
become dependent on these packages. Especially as it relates to
reproducibility and deploying code, these unnecessary dependencies suck.
Tidyverse is maturing and breaking changes are still too common for comfort.
There is no reason in my opinion to load stringr when a grep() will suffice.
Or, to subset with select when [[ works perfectly fine. Or to filter when
subsetting on a logical with which()... the list goes on. Tidyverse is
essentially reinventing the wheel in many places. The biggest problem is that
it doesn’t translate well to base-R in my experience with new programmers,
leading to this divergence.

That said, piping with %>% and modifying directly with %<>%(via magrittr) is a
pleasure that other languages I’ve worked with don’t manage as well.

And at the end of the day, I’m not going to rewrite implementations of all the
latest statistical methods already written in R, and this is its strong suit.
I’m increasingly using sophisticated spatial and spatiotemporal methods, and
these methods are solely implemented in R.

I understand that R gets a lot of flack from software developers and I
understand why. But, I also think it’s too often overlooked for its strong
suits.

~~~
jrumbut
Maybe stringsAsFactors was a mistake in the original design, but there is so
much code out there reliant on this behavior now and since it was the default
you don't really know where the new behavorior will bite you besides looking
for calls to data.frame that don't set the parameter.

Plus, it's not such a bad feature when you know it's coming.

As far as the tidyverse goes, I get it now however it seems to discourage the
creation of a nice, well organized set of functions to limit the amount you
need to keep in your head at the same time, and a lot of R users are very
smart people capable of understanding very disorganized code. Instead of
functions you get copy/pasted incantations, in Base R it's at least broken
down into steps which is a start.

------
AuthorizedCust
I teach a data science graduate course to non-tech majors, and it's mostly on
R. I welcome these changes.

I teach base R first for a few weeks, then I teach the TidyVerse as I
introduce data science concepts, like text mining. I convey the TidyVerse as
like an overlay on R, improving shortcomings and adding great functionality,
syntactic sugar, etc.

This context switch is jarring to some students. It's great to see some of the
TidyVerse's strengths--part of the tibble--be moved back into base R.

Now pardon me: I need to go. I start teaching dplyr in 51 minutes!

~~~
phillc73
I must admit to not really understanding the TidyVerse attraction, or why I
should be doing "tidy data evaluation", rather than using base R.

If I want to use an advanced data manipulation library, I'd typically reach
for data.table. If I want to use a verb based approach, why not SQL rather
than dplyr.

I have tried dplyr and code I've written a few years ago still imports that
package (hopefully it still works), but I just didn't find it was particularly
useful or helpful compared to the alternatives.

~~~
bart_spoon
Tidyverse is more than dplyr. It also includes libraries like ggplot2, for
which there really is no peer. Also stringr, for string manipulation, tidyr,
which has a wide variety of very useful utility functions for working with
data. dplyr is simply the backbone that connects them all and allows you to
move between them without switching thought processes.

dplyr also offers functionality that many don't realize beyond its basic data
manipulation. tibbles allow for arbitrary data types for columns, meaning you
can have data frames containing your typical strings or floats, but you can
also have data frames with columns consisting of nested data frames, or fitted
models, or other atypical objects. Once you start working in this way, it can
really streamline a lot of complex analytical processes, make your code much
cleaner and easier to work with, and allows you to integrate them directly
into other tidyverse packages, like ggplot2.

> If I want to use an advanced data manipulation library, I'd typically reach
> for data.table

Data.table is powerful, but from a usability/readability perspective, most
people find it inferior to dplyr. And there are now packages for using the
dplyr API and having it run data.table on the backend, so I personally see
little to no use for using data.table by itself anymore.

> If I want to use a verb based approach, why not SQL rather than dplyr.

This seems like an overly simplistic reduction. There are bigger differences
between dplyr and SQL than both being a verb based approach. And even were
there not, dplyr is directly integrated within R. It is still much easier to
do your data manipulation directly in R and handing it back and forth between
dplyr and whatever other libraries you are using, than it is to do the same in
SQL (even utilizing SQL queries within R).

The beauty of the tidyverse is that it is unifying the most important aspects
of the data science process under a single approach and API philosophy. It
integrates in a way that is pretty unprecedented not just in R, but in any
programming language (where data analysis tasks are concerned).

~~~
phillc73
I don't want to say that the Tidyverse is inferior, because it's not and it
works for a lot of people. Hadley will also know much more about R and always
be a better programmer than me.

However, there are other tools which predate Tidyverse, which I think so a
stellar job at their core competency.

ggplot2 is great, but it was also around long before the Tidyverse concept.
But at the same time base R plot() can also be pretty powerful[1] and look
great.[2] As an alternative to ggplot2 I would also propose that Vega Lite[3]
could be a contender, with an excellent cross language ecosystem.

There are also libraries available for applying SQL on dataframes from
directly within R if that's what you want to do. sqldf[4] has been around for
a long time, and now there is also the new duckdf[5], which is a bit quicker.
Or one can use the DBI[6] library, which requires a bit more coding. Learning
SQL is also a great skill which has a lot of value outside R.

Tidyverse may be useful, helpful and convenient for a lot of people, but I
think we shouldn't lose sight of the wide R ecosystem which has provided a lot
of alternative packages for a long time, perhaps without the marketing and
profile of RStudio and the Tidyverse.

[1]
[http://karolis.koncevicius.lt/posts/r_base_plotting_without_...](http://karolis.koncevicius.lt/posts/r_base_plotting_without_wrappers/)

[2] [https://github.com/KKPMW/basetheme](https://github.com/KKPMW/basetheme)

[3]
[https://vegawidget.github.io/vlbuildr/](https://vegawidget.github.io/vlbuildr/)

[4]
[https://cran.r-project.org/web/packages/sqldf/index.html](https://cran.r-project.org/web/packages/sqldf/index.html)

[5] [https://github.com/phillc73/duckdf](https://github.com/phillc73/duckdf)

[6]
[https://cran.r-project.org/web/packages/DBI/index.html](https://cran.r-project.org/web/packages/DBI/index.html)

------
teruakohatu
Not a whole lot is new. Highlights (from my perspective) :

* matrices are now a subclass of arrays.

* better reference counting to save on memory.

* stringsAsFactors = FALSE is now default breaking a lot of packages (was this really worth changing?)

* a new way of defining raw strings: r"(...)"

* better default colors for plots. The new default color scheme is less saturated. Apart from looking better, in the old scheme something plotted in bright yellow on a white background was hard to read. Yellow is now more of a mustard.

~~~
_Wintermute
> stringsAsFactors = FALSE is now default breaking a lot of packages

I know this is popular on the surface, but it's going to break so many
"reproducible" analyses from manuscripts and scientific studies.

~~~
kachnuv_ocasek
Reproducibles should be tied to specific versions of software or even OS and
hardware. So no, this is not going to break correctly designed analyses. It's
only going to break those which were not reproducible in the first place.

~~~
_Wintermute
Hence the quotation marks, the state of software engineering in many
scientific fields isn't very good, you'll be lucky if people list which
package versions they used.

~~~
kachnuv_ocasek
Fair enough. I might have misunderstood your remark originally, though the
point stands, of course.

------
roenxi
The breaking changes are all quite severe, I hope there is real demand for the
features they are bringing in.

Even if they don't agree with the direction Wickham's Tidyverse is going it
showcased how flexible the R language was to being rewritten from the inside.
Hadley effectively pulled of a more significant language upgrade with less
breakages - the R core team could learn something from him here. Even their
sensible na.rm default for new functions is introducing more weird
inconsistencies to R.

~~~
rankam
Has the R core team publicly stated that they disagree with direction
Wickham's Tidyverse is going? Genuinely asking as I love the Tidyverse, but
would be interested to hear arguments against it.

~~~
KKPMW
I am not sure why the original commenter said that tidyverse is more
backwards-compatible compared with R core. It used to introduce breaking
changes every 6 months or so.

Also they like it this way and promote it:
[https://twitter.com/hadleywickham/status/1175388442802479104](https://twitter.com/hadleywickham/status/1175388442802479104)

~~~
disgruntledphd2
They still do.

I like the tidyverse, but Hadley's struggles with lazy evaluation and
arguments has cost me lots and lots of time updating internal code at various
workplaces.

Don't get me wrong, the tidyverse is great, but if I was writing R code that I
expected to run without supervision for a long time, I'd avoid it as much as
possible.

~~~
baldfat
There are more then enough ways to keep it going. This has been address by
several tools.

Personally I have some tiddy code that is 8 years old and it still is working.

~~~
mellavora
Yes, and I have plenty of it which does not.

------
mike_ivanov
Congratulations to the people behind the release. Thank you guys very much.

Also, please don't pay attention to the haters. There are people who hate R,
and those who wholeheartedly despise Python; those who loathe Java and plenty
of those who can't stand Lisp. Though, R for some reason tends to trigger the
most exemplary manifestations of this phenomenon.

------
mikevm
R is the quirkiest hard-to-remember programming language I have ever tried
learning.

------
bitcharmer
Kudos to all contributors. Learning R and using it was a lot of fun and really
satisfying experience. Great option for someone like me, who has to do some
basic analysis of large data sets without rich statistical background.

~~~
flashyfaffe2
Did you learn R for your own fun or just to upgrade your skill set and get a
job?

~~~
bitcharmer
In my case it was Q vs R. For what I was trying to achieve and with my level
of statistics background R was an obvious choice. The language/runtime AND
RStudio are entirely free. Give it a try!

~~~
ryndbfsrw
kdb+ is the only tool I've come across that solidly managed to outperform
data.table for data processing. I have huge respect for good Q programmers

~~~
bitcharmer
Likewise. A good kdb dev is a very rare and intelligent beast, hats off. I'm
just too stupid to be able to comprehend much of Kx's ideas and so R was just
much more appealing to me.

------
bsg75
I’m disappointed 64-bit integers are not mentioned.

That is an unusual omission in a current language, and the add on packages are
not always a substitute.

Is anyone here aware if that is planned to be a core addition?

------
jerzyt
My daughter is in grad school, learning programming for the first time in her
life. She's learning R and Python at the same time. She's finding R much more
intuitive.

~~~
rossdavidh
Speaking as a programmer, R is deeply unintuitive, but...that's for an
experienced programmer's intuition. I believe it is, if not intuitive, then at
least less unintuitive for someone with a background other than programming
(like, say, statistics).

The typical programmer's boggled response to R, somewhat like their response
to SQL or CSS, is that most of the languages they look at are descended from
C, and anything that is not looks weird. Rather like someone who knows only
Indo-European languages encountering a non-IE language for the first time.

Of course, any real-life language, whether R or CSS or SQL or C or anything
else, has plenty of actual defects to complain about. Like, for example, "="
meaning "change the thing on my left to be the same as what's on my right".
But if it's the same defects that you're used to, you don't see them as much.

~~~
jerzyt
I agree that if you have a background in a common programming language, R is
not intuitive. I know it from personal experience. However, for someone with
no programming background, R may be quite reasonable. Just judging by my
daughter's experience.

~~~
rossdavidh
Agreed.

------
rgovostes
R is the worst programming language I've touched. I usually hear in response,
"it may be quirky, but it has a lot of well-tested functions useful for data
analysis."

This confidence seems misplaced. Experienced developers using better languages
with better tooling create bugs all the time. How likely is it that R
packages, written by specialists (or grad students) whose main focus is often
not programming but some other discipline, in a language full of "quirks," are
going to be really reliable?

One quirk that got me the last time I touched R was that functions and
variables live in different namespaces, so you could assign 4 to foo and then
call it.

Fun fact, a lot of the R source code is machine translated from Fortran to C,
but it has been cleaned up a lot.
[https://github.com/wch/r-source/search?p=1&q=f2c&unscoped_q=...](https://github.com/wch/r-source/search?p=1&q=f2c&unscoped_q=f2c)

~~~
notafraudster
RE: functions/variables -- you don't quite have this right. Let's walk through
when this can occur and when this can't. First, this cannot occur if you're
talking about your local environment:

    
    
      z <- function() { print("hello world") }
      z <- 3
      z # Outputs 3
      z() # Errors because there is no function z()
    

What the actual quirk you're talking about has to do with attaching other
namespaces. So, the way R works is that by default it loads several packages.
For instance, the reason you can call `mean()` is that the "base" package is
attached when you open a blank session in R. It is absolutely possible to have
multiple symbols take the same label across namespaces. Here's an example:

    
    
      mean <- 4 # Creates variable called mean in the local namespace
      mean(c(5, 10, 20)) # Uses the mean function in the attached base namespace
      mean # Uses the mean variable in the local namespace
    

But this is not unique to variables versus functions, it also works with
functions:

    
    
      mean <- function(x) { sum(x) * 100 }
      mean(c(1, 2, 3)) # Outputs 600
      base::mean(c(1, 2, 3)) # Outputs 2
      rm(mean) # Drop local namespace function
      mean(c(1, 2, 3)) # Outputs 2
    

You can use sessionInfo() to see the order of attached namespaces that will be
searched. In RStudio, you can also press the little dropdown arrow next to
"Global Environment" in the environment pane to see the order of attached
namespaces -- it'll search them in that order. Alternatively, you can be
defensive and always prefix all functions in any namespace with the namespace
name (equivalent to always using module.function style function calls in
python).

Finally, you can use the built-in function "find" to see the order in which R
will try to resolve a symbol, e.g.

    
    
      sum <- function() { print("i like sum coding in R!!!") }
      find("sum") # .GlobalEnv first, base second
    
    

I'm not sure this is a quirk. What are your options as a namespaced language?
1) Allow users to import multiple functions/variables with the same name
across different namespaces and resolve the conflict via some kind of
hierarchical order; 2) Don't allow users to import multiple
functions/variables with the same name, so you can never use overloading to
monkey patch; 3) Always require namespace prefixes at all times; 4) Make
function dispatch a blocking operation that asks REPL users which to use? I
dunno, it doesn't seem to me like R's approach is any less sane.

I guess one thing that is different about R's approach is that "built-ins"
have no special priority, they're all part of some namespaces that are
attached by default but otherwise exactly like third-party libraries (base,
stats, graphics, etc.)

In R, the only reserved words that cannot be overloaded are while, repeat,
next, if, function, for, and break. (Note: else does not appear here because
of a genuinely baffling quirk about how else is implemented in R)

~~~
playing_colours
Can an alias be an option to resolve conflict: “import mypackage.mean as
mymean”?

~~~
hdkrgr
No aliases (afaik), but convention is that you explicitly call mypackage::mean
in cases where names might even hint at being ambigous.

------
SteveSar
First off, R may be annoying to learn (for instance, just concatenating two
strings takes much more typing than Python, say) but once you get to data
analysis it’s a dream. (And certainly more Pythonic than Python’s
alternatives, which result in code cluttered with lots of periods and
references to packages and sub-packages everywhere). Also, the packages work
really well with each other—-with Python it seemed like every time I’d upgrade
packages with pip, I’d end up with a new conflict I’d have to sort out. And,
frankly, having to rely on Anaconda to keep everything from breaking is not
exactly ideal. So props to CRAN!

Gotta say, though, one of the most annoying things about R is the name.
Entering “R” as a search term on job boards tends to lead to a lot of not-
very-helpful results.

Also, it’d help if the version release names had more order to them. So you’d
read some ridiculous phrase like “parachute trombone” and know that it must’ve
been released after “mouse parade”.

------
enricozb
Is it still the case that when you have an error in your script (and aren't
using RStudio) that you don't get the line number of the error, you just get a
traceback?

------
einpoklum
What impact does this have on people _learning_ R?

It's a new major version - does it require significant updates for training
materials? Are there a lot of outdated idioms now?

------
stilisstuk
I love R. But working with 50 million rows in ram is a pain. Anyway to do it
differently? I use data.table BTW, so asfactor is always false

~~~
scottlocklin
Buy more ram?

If you're regularly running into memory performance issues despite using
data.table, and you don't feel like fiddling with mmap, consider looking into
an APL variant.

~~~
stilisstuk
Will look into mmap. Thx

------
rwmj
Please start release announcements by saying briefly what it is! _" R is a
programming language for statistical analysis"_ or whatever. Don't assume that
people are familiar with every possible piece of software already.

Also pull important stuff to the top of announcement. Were there security
issues that need you to upgrade? What are the major new features that would
encourage you to upgrade?

For example here's a release announcement I wrote recently:
[https://www.redhat.com/archives/libguestfs/2020-February/msg...](https://www.redhat.com/archives/libguestfs/2020-February/msg00287.html)

Free software doesn't usually have an advertising budget, so you have educate
people on what your software does at every opportunity you get.

~~~
sigwinch28
This was posted to the r-announce mailing list.

I think it's reasonable to assume that this message was intended for
subscribers of that mailing list, and that the people who chose to subscribe
already know what R is.

~~~
rwmj
And yet the message appears here, on a general tech news site, and no doubt
many other places. Assuming only R-announce readers will see it is plainly
wrong.

------
CDLcdi
Looks good

------
DmitryOlshansky
I hope there was an R 2d2 release at some point...

~~~
bachmeier
There was, just not from the R team. The original name of my project for
calling from R into D was called rtod2 (playing off D version 2).

