
R is a joy if you treat it like Awk - dwrodri
http://dwrodri.blog/posts/oct-7-Rscript-snippets.html
======
iron0013
I use R a lot, and nothing in this post rings even the slightest bit true to
me. This guy's use case for R seems to be very unlike almost anyone I've ever
heard of. Towards the end of the article, when he's talking about how all he
really needs from R is the summary() and boxplot() functions, it really
becomes clear that he's never done anything more than dip the very tip of his
pinky toe into R.

There are lots of valid criticisms of R, but this article doesn't touch on
them. It's so off-base that it's the proverbial "not even wrong".

~~~
scottlocklin
Its definitely a n00b criticism; R is absolutely the shizznits for cleaning
and transforming data. You have to use data.tables for larger data sets, but
(God Bless Wes) it can't really be beat for this; it does an amazeballs job
and has every doodad you can imagine for making your life easier.

Cleaning a broken csv or whatever: no, it is crap for that. You use awk/sed/tr
and all that for such problems.

If you're the type who don't want to deal with R, I guess you can use it from
the CLI. A couple of the R deploys I've done work like this.

The real problems with R are .... oh man .... so many. R inferno covers a lot
of them as a language/environment. Weak database connectivity is another one.
The thing which makes me batshit is the nodejsbro-ification of the package
management system. Aka people chaining together things like node works; R's
package manager isn't designed for this. But also the way code, packaged and
otherwise simply rots between the many, many upgrades.

You could probably run and deploy scikit learn/pandas based code from 5 years
ago without much problem. In R, you have to make a build with the salted
package dependencies ... and for all I know stuff it in docker.

Anyway unlike python, it basically has every data transformation and
statistical tool under the sun. I guess this is the price we pay.

~~~
thom
Is R really that good for data cleaning and transformation? It's slow, single
threaded (yes, even for a lot of real-world use-cases with data.table) and
memory-hungry. People only ignore this because code is generally written from
the top of a notebook down to the bottom without ever being re-run.

~~~
maltelau
A popular counterpoint in the R community is that in many data cleaning tasks,
the bottleneck is human understanding / coding time, not comptutation time. In
other words, we'd rather spend 1 hour writing up a script that runs in 10
minutes and needs to be run a handful of times at most, than spend 6 hours
writing something that takes 10 seconds.

Edit: This of course goes hand-in-hand with the claim that it is easier/faster
to write R scripts. If you're not familiar with it, the tidyr and dplyr
packages in particular (part of the tidyverse) are fantastic in the verbs they
provide for thinking about data cleaning.

------
halfeatenpie
I use R on a daily level. A significant portion of our climate change
adaptation research code and decision planning systems (which are used in
various utilities around the world to support decision makers) are built using
R.

I understand what the author is stating, but I just feel like this is from
inexperience with R and ignoring the vast amount of packages available within
it that are specifically targeted at data science. There are some valid issues
and criticisms of R, but I think this article only focuses on the application
of R in a single context. A significant portion of data science is about
cleaning the data and an entire suite of packages known as tidyverse solves
these problems (for me anyways) while also being very simple and easy to
understand. I mean tidyverse supports piping, which is exactly what this
article is saying to use.

Obviously your mileage may vary, but this post irks me the wrong way.

~~~
throwaway66920
The tidyverse is excellent, and has its own stylistic choices that are
arguably quite good. A lot of other random packages also have their own
stylistic choices and they are not good. So I suppose I would say that R makes
it really easy to write data pipelines well, but also really hard to write
them poorly; and doesn’t make it very obvious on what the better choices are.

~~~
scottlocklin
If only Hadley hadn't picked snake-case, which doesn't work so well out of the
box with ESS. Yah, I know there is a way of fixing this with a setq in ESS,
but I shouldn't have to.

~~~
thom
Always weird to meet Emacs people who don't like having to customise Emacs.
That said, ESS has some weird defaults, not sure I'd lay them at R's door tbh.

~~~
scottlocklin
ESS defaults predated Hadley's contributions by a decade, making his seemingly
unique adoption of snake case _extremely_ unsociable. I actually use _ to <\-
a lot. It's enough of a pain in the ass, I basically stopped using his
packages. This was annoying, but ultimately the Hadleyverse became a bridge
too far for the kind of bread and butter data science I do, so it turned out
to be helpful to me personally.

------
goodside
For historical context, R was not initially intended to be a standalone
scripting language runnable via POSIX conventions and didn’t gain these
features until circa 2010. R is designed to be used interactively, like the S
language it was based on, a style later made familiar to students via MATLAB
and TI graphing calculators. Before Rscript it was common to hack together
shell scripts that cope with the expectation of human TTY input to run R in
production, which functioned well as a “you must be this tall to ride” sign
for people who might not appreciate how unreliable R code can be compared to a
traditional language.

------
dredmorbius
The precursor to R, S, was designed as a Unix tool, and pretty much implicitly
relied on awk as its data-cleaning preprocessor.

For those who come from the world of large enterprise statistical and data
reporting tools such as SAS, awk shares an exceedingly strong resemblance to
the SAS DATA step, whilst R effectively provides a host of analysis and
graphics tools the correspond to numerous other SAS procedure and products.

The hacks for pipelining R are cool and useful. Thanks.

------
bllguo
> In my experience, just about anything beats R when it comes to cleaning
> dirty data.

> I used to resent R, it was shoved upon me as a strange tool that promised to
> replace Python, but failed miserably.

well I'm glad the author found a way to make R work for their purposes, but
this just reeks of inexperience with R..

~~~
dwrodri
I am definitely inexperienced with R. The intention of this post was to
highlight a key point of friction I encountered throughout my introduction to
the language. Namely, I was introduced to R (by someone more senior than
myself) as "data science in a runtime/IDE", but there quite finite boundaries
where the conveniences of R stop and other tools begin.

What motivated me to write this post was the lack of discussion of R in this
use case. I actually stumbled into using "Rscript -e" one liners while looking
to do basic stats in a Linux CLI.

That being said, I still stand by my point. Taking data from "out in the wild"
(log files, tarballs of images, unstructured text) and making use of it can be
frustrating in R because cleaning up edge cases, removing unwanted data, and
getting everything into the correct container/type often involved unintuitive
chaining of function calls. This is coming from the perspective of someone who
worked with Python/awk/sed prior to being exposed to R.

If you have good counterarguments, I'd be more than happy to hear them and
addresss them in an end note in the post.

~~~
mushufasa
Have you ever used the tidyverse ecosystem? It's a set of syntacticly-
compatible package authored chiefly by one man, which have evolved into a
superset of R, if not an outright R2. (R's Scheme lineage makes it very
hackable in this way).

I've working on python for my current project and constantly longing for R's
syntax specifically for cleaning data. It's so much better.

Just last night, I wanted to perform an anti-join to find discrepancies
between two data sets for debugging

One line in R:

    
    
       anti_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))
    
    

Witchcraft in Pandas: (from stack overflow):

    
    
       Method 1 
       # Identify what values are in TableB and not in TableA
       key_diff = set(TableB.Key).difference(TableA.Key)
       where_diff = TableB.Key.isin(key_diff)
    
       # Slice TableB accordingly and append to TableA
       TableA.append(TableB[where_diff], ignore_index=True)
    
       Method 2:
       rows = []
       for i, row in TableB.iterrows():
           if row.Key not in TableA.Key.values:
               rows.append(row)
    
       pd.concat([TableA.T] + rows, axis=1).T
    

I actually fired up R and re-imported the .csv data just for this. Took 15
secs while my colleague was still stuck debugging his own weird for loops.

~~~
dmix
Nothing beats building stuff in software that just works with a simple small
interface while being powerful!

May I ask what you use R for? I like learning languages for fun and I've been
meaning to do some NBA analytics stuff. I'd love to have a REPL style
interface to just do one-off math and analytics or short scripts. I haven't
dug into the data science stuff yet but I'm disinterested in Python for some
reason (maybe because I used to write Ruby for a living).

~~~
mushufasa
I started using R because I needed a better tool for formal statistical
analysis. (Econometrics, didn't want to pay for STATA. Much better packages
around variants of linear regressions + panel/time series data than python).
Since, I've used it for some random scripts, data visualizations, and
financial analysis (Josh Ulrich's packages + tidyquant).

R is a thoroughbred at doing data analysis from your laptop. It's bad at
living on a server and operating any sort of app.

For your use case (NBA analytics) I think it is a great tool and would
recommend R. In fact, here are a couple packages/tutorials specifically for
doing that: [http://asbcllc.com/nbastatR/](http://asbcllc.com/nbastatR/)
[https://blog.revolutionanalytics.com/2016/09/analyzing-
nba-b...](https://blog.revolutionanalytics.com/2016/09/analyzing-nba-
basketball-data-with-r.html)

------
tylermw
"R is only great if your data is already great."

Disagree. R has an incredible ecosystem for parsing, cleaning, and
manipulating data. Even ignoring the tidyverse, base R provides more than
enough functionality to clean and analyze data--you just need to spend some
time learning it. If you use the tidyverse, it's even easier. The only other
ecosystem that comes close to R is Julia, which was designed taking many of
the best parts of R into consideration.

~~~
wodenokoto
How would you go about parsing logs in R?

The problem with logs is that lines have different number of columns depending
on what that line is logging.

What a given line is logging can usually be determined by a regex (e.g. ends
with a ip-address, starts with this words, etc., etc)

I'm genuinely curious, as this is a use-case where I have a lot of difficulty
using python or R. I can see how grep + awk are strong as a preprocessor, as
they can scan through a larger than memory file, and select columns from rows
that match your criteria.

~~~
tylermw
I was just doing this yesterday--the "base R" way would be to simply use
readLines() and grep through the same way you would in the command line.
You're right though, grep + awk are super useful and parsing larger-than-
memory logs is where they really shine and the shortcomings of R become
apparent.

However, if your data fits in memory and you have a non-trivial analysis to
perform, R is a great choice. I had a file with tabular data interspersed with
metadata at random points and it was straightforward to parse and store the
data in a custom data structure.

------
rcthompson
R has lots of packages available for data cleaning. They just aren't
necessarily included in base R. Most of the ones I'm aware of are in the
Tidyverse. Even if you're dealing with a hand-written Excel file with multiple
tables in the same sheet and various information encoded in text/background
colors and such, there's they tidyxl package to help you do that.

------
wodenokoto
I haven't worked with logs, but I do find R a joy to work with in general,
especially the tidy verse is a joy to use, but slow and memory hungry on very
large datasets.

I haven't found a good way around very inconsistently formatted csv files in
any language (a row only represents column 3,4 and 6 if it starts with a
comma, all other rows have all columns, but are space separated and values may
contain comma, etc, etc)

------
oarabbus_
R isn't a joy, but it is powerful. This is the best blog post I've read on the
matter: [https://www.talyarkoni.org/blog/2012/06/08/r-the-master-
trol...](https://www.talyarkoni.org/blog/2012/06/08/r-the-master-troll-of-
statistical-languages/comment-page-1/)

------
CreRecombinase
If you want to treat R like awk, you should really check out the littler
package, a super useful R package which provides an alternative to both
Rscript and R CMD BATCH designed for writing one-liners
[https://github.com/eddelbuettel/littler](https://github.com/eddelbuettel/littler)

------
larrydag
To be honest I don't find Python to be very useful for data cleansing. Full
disclosure I use R quite a bit and use Python sparingly.

If you want to use R for data analysis I suggest the R package Tidyverse or
use a product like OpenRefine.

If you just need to run summary statistics with an application then using
Python is just fine.

~~~
objektif
What part of it you dont like? I use Python for data manipulation on a daily
basis and I find it very useful.

~~~
psv1
The pandas syntax is far less intuitive than R dataframes, especially if
you're also using dplyr.

~~~
larrydag
Agree. Pandas seems to be a mixture of Python indexing syntax and R
dataframes.

------
chubot
I get what the author is saying -- I use R in shell scripts too. It's really
useful and composes well with other shell tools.

I also get what the commenters are saying, because R is a useful interactive
language too, and it's also pretty good for data cleaning. Although it's
significantly slower than Python, which is why I do all cleaning that cuts
down the data before loading it into R.

As an example, I generate some benchmarks with every release of Oil:

[https://www.oilshell.org/release/0.7.pre5/benchmarks.wwz/osh...](https://www.oilshell.org/release/0.7.pre5/benchmarks.wwz/osh-
parser/)

and the tables are manipulated with R, but running the benchmarks is done with
shell, and creating the HTML is done with Python:

[https://github.com/oilshell/oil/blob/master/benchmarks/repor...](https://github.com/oilshell/oil/blob/master/benchmarks/report.R)

(and yes this page is meant to be motivation to speed up my principled but
slow shell parser)

R and tidyverse are the best tools for manipulating tables by far. I wrote an
intro here:

 _What Is a Data Frame? (In Python, R, and SQL)_
[http://www.oilshell.org/blog/2018/11/30.html](http://www.oilshell.org/blog/2018/11/30.html)

------
just_myles
I used to use awk all the time whenever I wanted to evaluate specific portions
of a columnar file (With a short one liner). Python comes close but, awk is
much faster. In conjunction with SED, I think that is a little much and imo
exceeds what I was doing. If I need to do any replacing or 'cleaning', I would
then use something like Python or SQL.

------
mikorym
So, this is not about the article. I've recently started using rpy2 as a
python wrapper for R functions and libraries and I'm finding that it is not
that bad.

There is some performance issues, but I am OK to trade that off for the
convenience of using "try:" rather than R's "tryCatch". Having tryCatch as a
function rather than built into the syntax is unacceptable to me. But there
are some libraries in R that don't have elegant alternatives in Python or
which you have multiple options or perhaps not the time to unlearn.

------
arminiusreturns
Here is how I like to use R, which can use the authors more introductory
methods or full fledged data crunch:

Like I do all my other data sources, as code blocks inside an emacs-org
notebook. If you are doing data science, you quickly find that it's management
and combination of the various particular projects that becomes the most
daunting (imho), and your data science notebook becomes the most important
part of that organization. In that arena for me it's pretty much either
jupyter or emacs org-mode.

------
crispyambulance
The regex stuff in stringr, and read_csv from readr are adequate for almost
anything if you're talking about ordinary delimited file import.

What does awk give you that these don't?

------
isostatic
Something to bypass while you reach for your Perl interpreter?

------
dima55
R is really overkill here. Simpler and nicer tools exist if you want to munge
data on the command line and/or plot it. For instance:

[https://github.com/dkogan/feedgnuplot/](https://github.com/dkogan/feedgnuplot/)
[https://github.com/dkogan/vnlog/](https://github.com/dkogan/vnlog/)

~~~
dwrodri
I love CLI tools like these. Great find!

------
xiaodai
"R is a joy if you don't use it."

Joke. I am a heavy R user.

------
safgasCVS
I’ve recently discovered the pipe command allowing R to consume the result of
a terminal command’s output (link:
[https://youtu.be/RYhwZW6ofbI](https://youtu.be/RYhwZW6ofbI)). Quite useful
for reading in compressed files and it’s made me learn a bit of awk to fix
text files on the fly

Oh and there’s also Rio if you want to explore injecting R into your command
line workflow.

------
alexhutcheson
If you're interested in learning how to do data cleaning and restructuring in
R, I highly recommend the "Wrangle" chapters of Hadley Wickham's R for Data
Science book, which you can read online here: [https://r4ds.had.co.nz/wrangle-
intro.html](https://r4ds.had.co.nz/wrangle-intro.html)

------
jwilber
I don’t want to post something inflammatory or condescending, but it’s very
clearly that the author does not have experience using R.

------
yiyus

        awk '/Recovery time:/{print $2}' output.log | boxplot

------
Myrmornis
dwodri: Not sure if you're a MacOS user, but iTerm2 distributes a script to
display images inline in the terminal. I've forgotten how to get R to output
png data directly to stdout, but it could be nice to do that and display the
images inline in the terminal

[https://www.iterm2.com/documentation-
images.html](https://www.iterm2.com/documentation-images.html)

~~~
dwrodri
I do in fact use macOS, but I'm more preferential towards kitty[1]. Sadly,
OpenGL is getting deprecated in macOS Catalina so I will probably crawl back
to iTerm2 eventually.

I'm quite familiar with asciinema[2] as well, which I've considered using for
creating animated examples. In general, I like to err on the side of caution,
and optimize my website for people with poor/metered connections.

Also, I didn't expect this to get this much attention! Mea culpa for not
putting more work into the post itself.

1 =
[https://sw.kovidgoyal.net/kitty/index.html](https://sw.kovidgoyal.net/kitty/index.html)
2 = [https://asciinema.org/](https://asciinema.org/)

~~~
Myrmornis
We might be miscommunicating! Or if not, sorry if I didn't understand what you
said in reply.

What I was trying to say is: how about extending your blog post with one line
to show people how to make the boxplot appear in their terminal as soon as
they run the command? (I think at the moment it's going to appear as a PDF
file named Rplots.pdf in the current directory, right?)

~~~
dwrodri
This is correct. A good cross platform tool for rendering images in many
modern terminal emulators is imgcat[1]. iTerm2 appears to use this under the
hood.

1 =
[https://github.com/eddieantonio/imgcat](https://github.com/eddieantonio/imgcat)

------
joker3
Rscript can also be used to run R scripts from the command line. Just put
>#!/usr/bin/env Rscript on the first line.

------
dwrodri
OP Here: I’m aware of some CSS issues on mobile and will work to fix them
shortly.

UPDATE: fixed color palette issue, although it still looks quite bad.

~~~
thebelal
FWIW `scan()` is going to be much faster if you are just reading doubles from
standard input than `as.numeric(readLines())`.

------
dfgdghdf
Lots of people don't realize this, but R is very similar to JavaScript. Shiny
apps offer a nice functional web framework too.

~~~
collyw
That gives me even less incentive to try it out.

------
gpvos
_> Half-assing something in Unix can produce great results as long you chose
the right half of the ass._

That is a great quote!

------
enriquto
for that simple use case gnuplot is even better than R

------
hkt
Awk is one of the few tools that possesses the ability to make me beg for
death.

