I use R a lot, and nothing in this post rings even the slightest bit true to me. This guy's use case for R seems to be very unlike almost anyone I've ever heard of. Towards the end of the article, when he's talking about how all he really needs from R is the summary() and boxplot() functions, it really becomes clear that he's never done anything more than dip the very tip of his pinky toe into R.
There are lots of valid criticisms of R, but this article doesn't touch on them. It's so off-base that it's the proverbial "not even wrong".
Its definitely a n00b criticism; R is absolutely the shizznits for cleaning and transforming data. You have to use data.tables for larger data sets, but (God Bless Wes) it can't really be beat for this; it does an amazeballs job and has every doodad you can imagine for making your life easier.
Cleaning a broken csv or whatever: no, it is crap for that. You use awk/sed/tr and all that for such problems.
If you're the type who don't want to deal with R, I guess you can use it from the CLI. A couple of the R deploys I've done work like this.
The real problems with R are .... oh man .... so many. R inferno covers a lot of them as a language/environment. Weak database connectivity is another one. The thing which makes me batshit is the nodejsbro-ification of the package management system. Aka people chaining together things like node works; R's package manager isn't designed for this. But also the way code, packaged and otherwise simply rots between the many, many upgrades.
You could probably run and deploy scikit learn/pandas based code from 5 years ago without much problem. In R, you have to make a build with the salted package dependencies ... and for all I know stuff it in docker.
Anyway unlike python, it basically has every data transformation and statistical tool under the sun. I guess this is the price we pay.
It the little bits of R programming I've done, the biggest pain for me was debugging. If something fails, I had no idea what line of code the error happened. Since some of the functions were rather long, step-wise debugging was laborious.
Yeah, I had this problem until I discovered options(error=recover) which drops you into a nice lispy stack. And ya, someone told me about it: I don't know where you'd read about such things. There's probably a dozen alternatives I also don't know where to read about.
Besides the aforementioned browser(), try() and tryCatch(), you would check the packages assertive and assertr to guarantee the expected inputs, and testthat to create tests for your scripts. All awesome.
Is R really that good for data cleaning and transformation? It's slow, single threaded (yes, even for a lot of real-world use-cases with data.table) and memory-hungry. People only ignore this because code is generally written from the top of a notebook down to the bottom without ever being re-run.
A popular counterpoint in the R community is that in many data cleaning tasks, the bottleneck is human understanding / coding time, not comptutation time. In other words, we'd rather spend 1 hour writing up a script that runs in 10 minutes and needs to be run a handful of times at most, than spend 6 hours writing something that takes 10 seconds.
Edit: This of course goes hand-in-hand with the claim that it is easier/faster to write R scripts. If you're not familiar with it, the tidyr and dplyr packages in particular (part of the tidyverse) are fantastic in the verbs they provide for thinking about data cleaning.
I have had this issue as well. Although to be fair, I would say this isn't the fault of R. Educators in the field of data science love notebooks because they can pair documentation, visualizations, and code all in one document. However, heavy reliance on notebooks produces a class of programmers who have very little clue how the code they are writing actually runs.
R has inbuilt great parallel tools (check for example the doSnow and future frameworks);
the best packages for data manipulation are mostly written in C (for example data.table and a good part of the tidyverse);
and with frameworks like Drake you can easilly create a Dag out of it that can process complex iterations millions of times. Check the uses of the Rcpp package that makes interfacing C code to R a breeze.
But of course, if you were comparing R to a pure compiled language, you are out of luck.
data.table is amazing. after getting over the learning curve i pretty much never touched dplyr again. pandas... lol. Hopefully https://github.com/h2oai/datatable lives up to its counterpart
managing python dependencies is no fun either, tbf
I am not sure there is a language where the median user's code looks more different from 2012 to now than R's. RStudio and the tidyverse have basically created an all-new "standard" library for the language that has gotten extremely wide adoption, and it tends to smooth over a lot of R's warts if you stay with code that has been written to fit with the style of code that the tidyverse encourages.
This sounds like something that in most other languages, would have become R v2 (non-backwards-compatible)? Why does R retrofit instead of enforce best practices?
The short answer is that "tidyverse" is a set of third-party packages that was developed by a different team than base R. It's not a modification of the language itself. This is possible because of certain features of R that make it particularly syntactically flexible (see, e.g., https://adv-r.hadley.nz/metaprogramming.html).
I know of exactly nobody who was happy with how the Perl 5 to Perl 6 transition went, or the Python 2 to Python 3 transition. I think not ever having a firm cutover like that was beneficial to R.
EDIT: And yes, as citrate05 says, it's just an additional set of libraries. There's no changes to the language itself.
Python wasn’t pretty but I’m happy it was done. Than if we were still on python 2 with only a portion of the changes. There would be a lot of discontent if that was the case too.
There is definitely more that R offers than what I discuss here. In retrospect, I will be more restrained on my opinions when I have little experience in my pocket. That being said, it was absolutely my intention to present CLI that deviate from R's intended use. There is already plenty out there on R's intended use.
What's that adage that goes something like, "if you want to get an answer on the Internet, don't pose a question..."?
> and would have chosen my words more carefully if I'd known the author was here in the thread
Not a specific dig against you, but I find it useful to write (and say) anything with the assumption that the author I'm (hypothetically) addressing is a direct witness.
I'm relatively new to technical writing and the discussion from all of these comments (yours included) has been really helpful for guiding how I write future posts.
HN is (in my experience) a lot more cynical and straightforward than other places[1]. It's something I've learned to appreciate and also take it with a grain of salt.
This is incredibly gracious feedback taking, a lesson in how not to be defensive.
I agree with you where some here don't. I think there are often better tools for any single data cleaning task.
R's strength is being second best at an enormous range of tasks (and often being first to get new techniques) and packaging that with analysis and visualization.
I use R on a daily level. A significant portion of our climate change adaptation research code and decision planning systems (which are used in various utilities around the world to support decision makers) are built using R.
I understand what the author is stating, but I just feel like this is from inexperience with R and ignoring the vast amount of packages available within it that are specifically targeted at data science. There are some valid issues and criticisms of R, but I think this article only focuses on the application of R in a single context. A significant portion of data science is about cleaning the data and an entire suite of packages known as tidyverse solves these problems (for me anyways) while also being very simple and easy to understand. I mean tidyverse supports piping, which is exactly what this article is saying to use.
Obviously your mileage may vary, but this post irks me the wrong way.
The tidyverse is excellent, and has its own stylistic choices that are arguably quite good. A lot of other random packages also have their own stylistic choices and they are not good. So I suppose I would say that R makes it really easy to write data pipelines well, but also really hard to write them poorly; and doesn’t make it very obvious on what the better choices are.
If only Hadley hadn't picked snake-case, which doesn't work so well out of the box with ESS. Yah, I know there is a way of fixing this with a setq in ESS, but I shouldn't have to.
Always weird to meet Emacs people who don't like having to customise Emacs. That said, ESS has some weird defaults, not sure I'd lay them at R's door tbh.
ESS defaults predated Hadley's contributions by a decade, making his seemingly unique adoption of snake case extremely unsociable. I actually use _ to <- a lot. It's enough of a pain in the ass, I basically stopped using his packages. This was annoying, but ultimately the Hadleyverse became a bridge too far for the kind of bread and butter data science I do, so it turned out to be helpful to me personally.
I agree. But I'd also argue that it's based on how you're using the tools you're given (or what others have given you).
Pretty much the argument of using apply vs for loops. Most of the time, apply is going to be significantly more computationally efficient. However, some people think of the problem in different context than others, and if it works for them then it works.
For historical context, R was not initially intended to be a standalone scripting language runnable via POSIX conventions and didn’t gain these features until circa 2010. R is designed to be used interactively, like the S language it was based on, a style later made familiar to students via MATLAB and TI graphing calculators. Before Rscript it was common to hack together shell scripts that cope with the expectation of human TTY input to run R in production, which functioned well as a “you must be this tall to ride” sign for people who might not appreciate how unreliable R code can be compared to a traditional language.
The precursor to R, S, was designed as a Unix tool, and pretty much implicitly relied on awk as its data-cleaning preprocessor.
For those who come from the world of large enterprise statistical and data reporting tools such as SAS, awk shares an exceedingly strong resemblance to the SAS DATA step, whilst R effectively provides a host of analysis and graphics tools the correspond to numerous other SAS procedure and products.
The hacks for pipelining R are cool and useful. Thanks.
I am definitely inexperienced with R. The intention of this post was to highlight a key point of friction I encountered throughout my introduction to the language. Namely, I was introduced to R (by someone more senior than myself) as "data science in a runtime/IDE", but there quite finite boundaries where the conveniences of R stop and other tools begin.
What motivated me to write this post was the lack of discussion of R in this use case. I actually stumbled into using "Rscript -e" one liners while looking to do basic stats in a Linux CLI.
That being said, I still stand by my point. Taking data from "out in the wild" (log files, tarballs of images, unstructured text) and making use of it can be frustrating in R because cleaning up edge cases, removing unwanted data, and getting everything into the correct container/type often involved unintuitive chaining of function calls. This is coming from the perspective of someone who worked with Python/awk/sed prior to being exposed to R.
If you have good counterarguments, I'd be more than happy to hear them and addresss them in an end note in the post.
Have you ever used the tidyverse ecosystem? It's a set of syntacticly-compatible package authored chiefly by one man, which have evolved into a superset of R, if not an outright R2. (R's Scheme lineage makes it very hackable in this way).
I've working on python for my current project and constantly longing for R's syntax specifically for cleaning data. It's so much better.
Just last night, I wanted to perform an anti-join to find discrepancies between two data sets for debugging
One line in R:
anti_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))
Witchcraft in Pandas: (from stack overflow):
Method 1
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Method 2:
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
I actually fired up R and re-imported the .csv data just for this. Took 15 secs while my colleague was still stuck debugging his own weird for loops.
pandas is such a shitshow. Every time i use it, im in a world of pain googling the finicky syntax for selecting columns, aggregating, filtering. I never touched R but pandas is so terrible for me. Nowadays it's either raw numpy arrays, plain sql or pyspark...
Great example! My first thought is that anti-join could be the basis of a "csv_diff $1 $2" shell function.
I have a hunch that there could be a really good follow-up post to this that takes these R hacks to the next level by extending it to work better with pre-structured text (where R really shines) and CSV files as arguments.
Nothing beats building stuff in software that just works with a simple small interface while being powerful!
May I ask what you use R for? I like learning languages for fun and I've been meaning to do some NBA analytics stuff. I'd love to have a REPL style interface to just do one-off math and analytics or short scripts. I haven't dug into the data science stuff yet but I'm disinterested in Python for some reason (maybe because I used to write Ruby for a living).
I started using R because I needed a better tool for formal statistical analysis. (Econometrics, didn't want to pay for STATA. Much better packages around variants of linear regressions + panel/time series data than python). Since, I've used it for some random scripts, data visualizations, and financial analysis (Josh Ulrich's packages + tidyquant).
R is a thoroughbred at doing data analysis from your laptop. It's bad at living on a server and operating any sort of app.
This is completely different functionality than the parent comment's code. It's only joining on one "column", not two, and it only returns the values in that "column", while the parent's code returns complete rows from the dataframe.
I’m coming at awk & sed from the other side having used R daily for a few years.
R is not made for cleaning up weird text files. Yes of course it can be done but that’s like the joke that everything is within walking distance if you have enough time. I recently had to use R to fix a 50gb csv where 10 of the columns were long json strings and needed to turn that into a data frame. That experience alone made me buy a book on awk and sed
I should have fleshed out my comment for sure, was in a bit of a rush!
first off I found it interesting to frame R as being presented as an improvement over python. should be the other way around. python for DS came second, and was supposed to improve over R
anyway I'm not sure i can even begin to discuss your use cases, not having that much relevant experience there. I use R (and python) for general analytics tasks and for building production models. in these more traditional DS environments I strongly believe R is far superior to python for data munging and visualization. When I say this I am comparing data.table (and to a lesser extent, tidyverse) to pandas. I don't even want to get started on everything I hate about pandas. So while we are both "cleaning data" you seem to be talking about a stage before someone like me would even be looking.
The use cases I'm focusing on in the article are definitely less "production-friendly" that what you describe. This article definitely caters more to showing R as a CLI tool than R as an ecosystem.
My language must have been ambiguous in the article. I intended to frame R as arriving second because that was my personal experience. My writing philosophy for blog posts is that my personal opinions and experiences should stand out from the technical detail. My reasoning is that even for those who disagree with the personal content will be able to discern for themselves the value of the post.
Disagree. R has an incredible ecosystem for parsing, cleaning, and manipulating data. Even ignoring the tidyverse, base R provides more than enough functionality to clean and analyze data--you just need to spend some time learning it. If you use the tidyverse, it's even easier. The only other ecosystem that comes close to R is Julia, which was designed taking many of the best parts of R into consideration.
The problem with logs is that lines have different number of columns depending on what that line is logging.
What a given line is logging can usually be determined by a regex (e.g. ends with a ip-address, starts with this words, etc., etc)
I'm genuinely curious, as this is a use-case where I have a lot of difficulty using python or R. I can see how grep + awk are strong as a preprocessor, as they can scan through a larger than memory file, and select columns from rows that match your criteria.
I was just doing this yesterday--the "base R" way would be to simply use readLines() and grep through the same way you would in the command line. You're right though, grep + awk are super useful and parsing larger-than-memory logs is where they really shine and the shortcomings of R become apparent.
However, if your data fits in memory and you have a non-trivial analysis to perform, R is a great choice. I had a file with tabular data interspersed with metadata at random points and it was straightforward to parse and store the data in a custom data structure.
Good point. The original draft included this as a value proposition because I generally find that bold opinions are a good starting point for developing valuable insights. Sometimes, that insight comes from those with dissenting opinions. I've since edited the post to emphasize more "R can be a great CLI tool" and less "cleaning data in R is bad".
R and the tidyverse are great at cleaning already nicely formatted tabular data. As someone else has already mentioned it quickly turns pretty ugly if you stray away from this.
Well... Julia. Of course, the tidyverse is big and I guess there isn’t one to one ration of features between tidyverse and Julia, but mostly it is true.
For example, in Julia you can work with the dataframes in a SQL/LINQ/dplyr fashion like this:
(sorry not sure how to format code properly on HN)
Theres also a version of ggplot, although I’m not sure if its a rewrite ot just a wrapping around the R version.
Edit: Actually, this I believe is a part of the DataFramesMeta.jl package... So the answer would be rather “most of the things are baked into Julia and when they arent, there’s a package for that”.
R has lots of packages available for data cleaning. They just aren't necessarily included in base R. Most of the ones I'm aware of are in the Tidyverse. Even if you're dealing with a hand-written Excel file with multiple tables in the same sheet and various information encoded in text/background colors and such, there's they tidyxl package to help you do that.
I haven't worked with logs, but I do find R a joy to work with in general, especially the tidy verse is a joy to use, but slow and memory hungry on very large datasets.
I haven't found a good way around very inconsistently formatted csv files in any language (a row only represents column 3,4 and 6 if it starts with a comma, all other rows have all columns, but are space separated and values may contain comma, etc, etc)
If you want to treat R like awk, you should really check out the littler package, a super useful R package which provides an alternative to both Rscript and R CMD BATCH designed for writing one-liners https://github.com/eddelbuettel/littler
I get what the author is saying -- I use R in shell scripts too. It's really useful and composes well with other shell tools.
I also get what the commenters are saying, because R is a useful interactive language too, and it's also pretty good for data cleaning. Although it's significantly slower than Python, which is why I do all cleaning that cuts down the data before loading it into R.
As an example, I generate some benchmarks with every release of Oil:
I used to use awk all the time whenever I wanted to evaluate specific portions of a columnar file (With a short one liner). Python comes close but, awk is much faster. In conjunction with SED, I think that is a little much and imo exceeds what I was doing. If I need to do any replacing or 'cleaning', I would then use something like Python or SQL.
So, this is not about the article. I've recently started using rpy2 as a python wrapper for R functions and libraries and I'm finding that it is not that bad.
There is some performance issues, but I am OK to trade that off for the convenience of using "try:" rather than R's "tryCatch". Having tryCatch as a function rather than built into the syntax is unacceptable to me. But there are some libraries in R that don't have elegant alternatives in Python or which you have multiple options or perhaps not the time to unlearn.
Here is how I like to use R, which can use the authors more introductory methods or full fledged data crunch:
Like I do all my other data sources, as code blocks inside an emacs-org notebook. If you are doing data science, you quickly find that it's management and combination of the various particular projects that becomes the most daunting (imho), and your data science notebook becomes the most important part of that organization. In that arena for me it's pretty much either jupyter or emacs org-mode.
I’ve recently discovered the pipe command allowing R to consume the result of a terminal command’s output (link: https://youtu.be/RYhwZW6ofbI). Quite useful for reading in compressed files and it’s made me learn a bit of awk to fix text files on the fly
Oh and there’s also Rio if you want to explore injecting R into your command line workflow.
If you're interested in learning how to do data cleaning and restructuring in R, I highly recommend the "Wrangle" chapters of Hadley Wickham's R for Data Science book, which you can read online here: https://r4ds.had.co.nz/wrangle-intro.html
dwodri: Not sure if you're a MacOS user, but iTerm2 distributes a script to display images inline in the terminal. I've forgotten how to get R to output png data directly to stdout, but it could be nice to do that and display the images inline in the terminal
I do in fact use macOS, but I'm more preferential towards kitty[1]. Sadly, OpenGL is getting deprecated in macOS Catalina so I will probably crawl back to iTerm2 eventually.
I'm quite familiar with asciinema[2] as well, which I've considered using for creating animated examples. In general, I like to err on the side of caution, and optimize my website for people with poor/metered connections.
Also, I didn't expect this to get this much attention! Mea culpa for not putting more work into the post itself.
We might be miscommunicating! Or if not, sorry if I didn't understand what you said in reply.
What I was trying to say is: how about extending your blog post with one line to show people how to make the boxplot appear in their terminal as soon as they run the command? (I think at the moment it's going to appear as a PDF file named Rplots.pdf in the current directory, right?)
This is correct. A good cross platform tool for rendering images in many modern terminal emulators is imgcat[1]. iTerm2 appears to use this under the hood.
There are lots of valid criticisms of R, but this article doesn't touch on them. It's so off-base that it's the proverbial "not even wrong".