Especially rOpenSci's peer review process ([more here](https://devguide.ropensci.org/softwarereviewintro.html)) for R packages is fantastic.
I do most data engineering in R (RMarkdown workbooks), and most software engineering in Python/Django. It took three separate, dedicated attempts to get warm with R (pre-tidyverse, showing my age), now I'm interrupting work on an RShiny app to write this comment. The ecosystem around the tidyverse helps immensely to convert my colleague's workflows from Excel to R. Clarity and simplicity wins over purity here (you may now light your pitchforks). And NSE still breaks my brain.
What also helps is that R is so focused on data and statistics. It gives a focus to the users that really helps when it comes to finding help. Python is famously second best at everything, but that also means it's community is spread thinner over more subjects.
I've gotten some adoption of R Studio at two companies now. It's amazing for exploratory analysis and its cloud capabilities are wonderful.
1. Native Data types - this is one of the things that Julia was designed to do very well. That is, native-like-treatment for all data without needing to have a C-family underbelly like Python does for its high performance code.
2. Non-Standard Evaluation - Julia has Metaprogramming and Symbols which provide similar ideas in a different way. It uses abstract syntax trees and is very lisp-like in that way if you wanted to get into writing Macros and such.
3. Package Management - Julia has a best-in-class built in package management system with versioning. Julia also has first-class support for documentation, so its very easy for developers to write relevant documentation. As an R user before RStudio, package management was a pain but RStudio hides the manual work that used to be searching for, downloading, and unpacking packages. Packages usually work really well together, usually automatically so you can often get really cool results where other languages would require a lot of coordination (like Tidyverse).
4. Function paradigm - Julia is multi-paradigm and is conducive to functional, imperative, object-oriented, among others.
I'm a big Julia fan, after having gone R -> Python -> Julia. Not to make this totally in favor, I still like R for plotting because it's more mature. RStudio also is very nice for dynamically interacting with datasets, but Juno comes pretty close there too.
In any case, I am looking forward to new julia versions which should address the delay in plotting (as far as I known).
but using Python throughout, or even Python with a sprinkling of Cython/C++/C for performance, allows for cleaner and faster engineering than using a special language for each niche.
I don't think that R has a bigger problem with non-programmers being bad software engineers than Python has - there are plenty of people who know Python passably and are quite happy that they can be productive without being good software engineers (versus Java where the intent of the language is biased for everyone to write code with a minimum quality standard rather than everyone to write code with focus on being productive). But you can find decent Python software engineers, and more recently you can find decent software engineers who also know enough of the niche in question to produce high-quality production code in that niche from the get-go rather than throwing models over a data science - engineering wall that exists between two departments.
There are times when it may be better to use another language and there is nothing wrong with that. I default to Python and if I think there will be specific issues with it, then I can look at a more specialized language.
If you want to program with most of the tidyverse libraries, you are forced to implement a bunch of non-sense into your function to properly evaluate arguments within a function. Sure, NSE may be useful in some circumstances, but more often than not, it just increases the likelihood of introducing a bug.
Especially for new programmers, NSE is a huge leap and very confusing.
The old solution was exporting underscore ie: ‘mutate_()’ suffixed functions that used standard evaluation. And this was fine. But, then RStudio decides to deprecate these functions and force NSE on users. I’m not happy about that, and I often avoid using libraries like dplyr when writing functions so that I don’t have to deal with it.
As much as I like ggplot2, I find the rest of the tidyverse to be solving problems it invents (e.g. quosures to fix the problem of not permitting string arguments for dplyr verbs) and monopolising an open source ecosystem.
I agree that it is slow, and when things break apart, heavy the NSE use in dplyr really comes back to bite you.
People are usually pretty surprised when I take the stance that R is faster than Python for the things most people actually care about, which is data manipulation and model building. Python has its datatable library which is approaching data.tables speed however it is very much a work in progress, and does not have very useful features yet.
In particular, I suspect the benchmark sizing is forcing falling back from numpy's int64 to Python ints as categorical labels, which easily could explain a 10x or more differential.
- working interactively (i.e. your code isn't part of a larger application)
- working with relatively small datasets that fit into memory
- don't need any deep learning libraries
then both R and Python can do a great job and choosing one over the other is simply a matter of preference. I might even lean slightly towards R because its data frames are a bit easier to use than pandas and RStudio's REPL is the best.
But if you need to deploy your code somewhere, or high performance, or the latest deep learning libraries, then Python absolutely crushes R. And it's not even close.
It also seems that the actual open-source ML-community (vs. Google: we want you to use our software to ensure you can't ever own your stuff) supports R just fine: https://mxnet.apache.org/api/r
Hopefully I’ll have time tomorrow to write a rebuttal for some his arguments. Particularly, the preference for CRAN and code longevity strikes me as being shortsighted.
I've yet to see any argument for R that doesn't boil down to 'well, I know it better' or 'well, I prefer the syntax'.
R to data science is as Matlab is to engineering. It's a stopgap 'non programmer' language that thrived at a time when most academics didn't know any programming. Now school children learn programming. There is no use case for these languages anymore.
If you meant to type "machine learning" I'd probably agree, but R is much much better for small scale data exploration, visualization and modeling (i.e. 95% of DS) than Python. Pandas is an absolute horror show of an API compared to dplyr, and the best plotting libraries for Python are just copying features from R. Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using it. R is inferior to Python as a programming language, no doubt about it -- but most data scientists are not programmers. Which is the point of TFA.
This is the crux of the problem with R and why R is increasingly blacklisted at large orgs. It attracts non-programmers which may have been okay 5 years ago but is no longer acceptable.
With the exception of some engineering powerhouses hiring pure research PhDs to write R code, the trend established over the last 2 years is that fewer and fewer employers are hiring data scientists that aren't programmers. There are too many candidates who know data science and can also do data engineering and even generalist SE tasks. Non-programmer data scientists are not competitive in the industry anymore except that small top-end research niche that doesn't exist in most orgs.
Which brings us back to the fact that R was a successful niche language that allowed non-programmers to write models, but that's simply not enough anymore. Businesses want models that can be plugged into production pipelines, models that can scale without needing a dedicated team to re-implement them, and they want staff who do engineering in addition to whatever it is they specialise in.
Virtually all data scientists graduating today are programmers, and pretty good ones. Candidates who only know R can't compete against them.
> Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using
So you'd agree that you fall into the 'I prefer the syntax' bucket then? I don't really see any arguments against Python in your comment. Funnily enough, it's trivial to implement a pipe style operator in Python and there's at least two popular libraries for that.
Eh, I call BS. Names and sources please. I know for a fact that R is used at all of FAANG and about a bazillion other "large orgs" too. I'm sure it's true that R is not used for customer-facing "web scale" products, but then again neither is any other language except for like two.
Being good at programming is useful skill, but so is being good at statistics, and they are not interchangeable. "Productionizing a model" is not the only show in town when it comes to data analysis. Many programmers know shockingly little statistics. An equally large number of really strong statisicians prefer R, for good reasons. Orgs who simply refuse to hire those people do so at their peril.
People are usually pretty surprised when I take the stance that R is faster than Python for the things most people actually care about, which is data manipulation and model building.
If employers are hiring non-statisticians as data scientists then the problem is the employers.
Job applicants who only know R and have no grasp of SE are increasingly less and less competitive. I don't expect there'll be any market for them in another 5 years.
The places where Excel are used are fairly appropriate. Way downstream, for simple tasks.
Visual programming platforms like Knime is the next step for these teams, and then onto something like RStudio as they complete the transition towards employing data science in their pipelines.
That's an interesting take, given that to me the magrittr operator seems to have been added to mimic the object oriented 'attribute' operator.
Of course the object oriented variant makes it harder to extend the behaviour of a class after it has been defined (although strictly speaking that isn't impossible with python), you'd need to add your methods up front, or extend the class.
There are entire ecosystems of academic libraries built around Matlab that can’t all just be picked up and moved to Python. This argument probably doesn’t realise just how ingrained Matlab is in STEM non-CS academic departments.
Example: my girlfriends department writes a world-leading MRI analysis library in Matlab. They offer training courses on it (so departments around the world now know it) and it’s frequently used within academic papers (so there are now resources available on it). Why would they move to Python?
They can and they are. Python is increasingly displacing everything in the data industry and especially proprietary legacy platforms like Matlab. The number of things you can do in Matlab but not Python is converging on 0, while the inverse is not even worth trying to count.
Major universities are abandoning Matlab, Labview, SPSS, Minitab etc for Python, which is basically the end for them all. The next wave of CS/SE/DS/ML graduates had no exposure to Matlab. It'll linger in electrical engineering for a few more years but will suffer the same fate. In the end, proprietary platforms have no chance against FOSS.
> Example: my girlfriends department writes a world-leading MRI analysis library in Matlab
Siemens is leading the MRI industry and the only place where they're still using Matlab is the legacy platforms that aren't yet listed for updates or aren't worth updating.
The actual leading stuff is done with the same ML tools as the rest of the industry, mostly Tensorflow. Siemens and GE both also have programs to engage and eventually acquire 3rd party ML platforms not a single one of which has anything to do with Labview or Matlab outside of occasionally interfacing with legacy components.
Just to add another point of anecdata.
I helm a large data science effort in the defense industry. We are actively moving away from MATLAB and to Python. It's easier for us to find Python coders, easier to train people to use Python, more maintainable for the restrictions we have on our networks, and cheaper.
Yep, NASA used it alongside Matlab for the Orion's Guidance and Navigation Control systems. I've never had the chance to use it though, it looks pretty interesting.
python is the most popular programming language in the world and getting better.
I can imagine this will change in the long run, but right now there are many valid reasons why people use these tools.
I haven't done a lot with Julia, so I don't know if it's easier to teach Julia to novices than Python or R.
For teaching, I think that Julia indexing (for example: vector[2:end-1]) is easier to explain than numpy (vector[1:-1]). On the other hand, I like python's plain English operators and/or versus && / || in Julia.
Also loops tends to be more readable than vertorised code in some circumstances (e.g. computing the Laplacian by finite difference). In Julia, loop and vectorized code are both quite efficient, while in python and R, one has to vectorize the code.
The community is great. But small. For a lot of situations, I'd be hesitant to invest in Julia, because I don't know if the community will stay that way or if it fades away.
While in absolute numbers, we may be at 20% of R or Python communities, I am always curious to understand what people mean when they say the community is too small. What would be a signal that a particular community is big enough?
Essentially the base ecosystem of a language is driven by a core group of contributors and the derivation an ability of that group matters more than most other factors. When doing scientific and or data science I personally care more about the core quality and what the platform enables. Lately I’ve considered learning R as it has a lot of well done says which simply aren’t available in Python, and aren’t ready yet in Julia. Last time I tried to calculate a confidence interval in Python for an obscure probability function I ended up wanting to pull out my air in frustration. There’s libraries that kind of handle it in Python but they are (we’re?) nigh impossible to modify or re-use for a generalized case. Much less getting a proper covariance matrix with enough documentation to know what to do with it. I used R examples to figure out the correct maths. R’s NSE seems appealing in allowing generalized re-use. I’ve had similar ability to re-use library features in Julia for solving problems outside that libraries initial scope.
This enables you to also select libraries: a standard float type will use blas for matrix ops; a gpu float type will use cuda.
In the areas I work in (scientific computing and scientific machine learning), you can really only find the packages in Julia while R and Python's ecosystems are quite lacking. R has stats and Python has ML, but the rest of the scientific ecosystems there just aren't as complete.
They used to work with R. And the old generation of engineers used to work with Matlab. The old generation still does.
The new generation has been using actual programming languages, typically Python, since high school. They were the first wave of graduates in 2019 that specialised in a discipline and were also competent in software engineering.
The old generation is going to be driven out of the job market by the new in the span of 5 years as they saturate the senior tier of their respective fields. How do you compete for a job when all you know is R and your discipline, against someone who's a full fledged software engineer who knows your discipline and can put models directly into production use?
I know a fair bit of Python (mostly for ML/DL applications), bash, and just a smidge of HTML/CSS/JS (just enough tweak a front end demo via R Shiny). I’m OCD enough that I make every effort to write clean and reproducible code and unit test it as I go (is this TDD?). I can implement some stats algorithms (e.g. EM algorithm, MCMC) from scratch with a pseudocode reference, but I rarely if ever have the occasion to do that for obvious reasons. I understand the concept of computational complexity, though I don’t have any figures memorized.
But I’ve never taken any CS course beyond Programming 101. I wouldn’t know how to navigate a complex production codebase. Embarrassingly, I know almost nothing about git. I’m 100% sure I’d get slaughtered in a whiteboard interview or similar. For that matter, I could easily get nailed on some holes in my core data science knowledge (cough SQL cough).
So, do I rebuild my foundation around software engineering, or just patch up the obvious holes? Grind towards a management position and let my inferior skills rot away?
I love love love what Hadley has done with Dplyr for the most part, at least in spirit, though I think the implementation could have been done better so as not to be so clunky, esp wrt NSE. But I think he is just trying to work within the current R ecosystem.
Which makes me ask.. is it then time for R2? (Like a Python 3). Before you shoot me.. Do we need to save the good things we have innovated from within the R ecosystem over the years and consider doing things from scratch?
Is this what Julia tried to do? I haven’t gotten around to trying it yet.
That said, I think R is always going to be there and have it’s place.
Frankly if they could just make an IDE like Rstudio that ran python, I’d probably be happy enough with that. I heard with reticulate you can run both, curious to hear of others experience with this..
You can make jupyter lab behave mostly like rstudio, but it can do a lot more, especially in terms of visualization and rich display of data objects.
I do like the point in Learn the user’s language, as friendlier error messages is something we should all strive for, although I’ve never had an issue with particular problem in Python’s stacktraces, and actually having types like Julia or at least annotations via mypy seems a better solution.
CRAN is a great point, and pythons packaging is in a sorry state with a crazy number of approaches and undeclared dependencies. R does a great job here.
Functional programming section is ironic given the lack of functional patterns in the post. R has even fewer higher order functions than the python standard library.
It’s hard for me to see how R is better for production than python, and the argument against pandas seems a bit strawman considering that numpy/scipy are quite stable and more central to the ecosystem than DataFrame. R is fantastic for data science and highly productive, until you need to do data mugging or anything else that involves a general purpose language.
But for production usage, again it's a huge pain. It's difficult to keep version stability with developer machines since there's no standard lock file, and the CRAN servers often delete or silently update old versions of packages.
Packrat and Microsoft's MRAN really helps, but another curious issue is that it and other CRAN servers seem to have terrible stability - often going down for hours at a time (or worse).
Python's packaging is ~impossible to understand for new developers (and really absolutely needs improving), but in an organisation you just pick an approach suited for your use case.
I think I'm a good developer, I can't understand idiomatic Python. Python could use some verbosity for the sake of everyone else. If R slows you IQ 9000 people down, please, make it standard.
Those 90 seconds of googling constitute my entire knowledge of ggplot2.
To me, R always felt a bit quirky.
I don't think R is much better at functional programming than Python. I found R to be limiting in terms of general programming.
Also, we now have type checking. I'm positive you could combine type checking and clever type declarations to handle application state like in elm.
R is a functional programming language. Does Python treat functions as first-class citizens? Pass arguments by value? Store expression trees as data structures?
For me the main defining feature of functional programming is the ability to pass functions and the ability to avoid side effects.
If I remember correctly, I often had trouble with side effects in R.
> Native data science structures.
DataFrames are often easier to use than Pandas. However, in production workflows we're often using more datatypes than DataFrames, and R is weaker there. For example:
* Lists must be accessed with `[[ ]]`, instead of `[ ]`. I've seen many silent bugs slip through due to this.
* There are 3 competing implementation of classes. This results in classes being mystical and rarely understood.
* R is a Lisp 2. Variables and functions may share the same name. This leads to confusing errors.
* Catching specific types of errors can be awkward.
* Adding elements to a list iteratively is slow 
> Non-Standard Evaluation
This can be handy while quickly working in RStudio, but it's not easy to maintain. I've seen code that failed because it specified `f(!!variable)` instead of `f(!!!variable)`. I like R's formula notation, but I'm happy enough with sklearn's API that I don't miss it.
> The glory of CRAN
CRAN is not set up for production. It makes pinning versions very difficult . Many people resort to using MRAN, which is a Microsoft supported snapshot of CRAN at a specific time, so a dev can just pretend they are installing software as if it were 6 months ago. I have seen MRAN go down multiple times . Not to mention, the owner of CRAN is notoriously prickly  and packages will not be accepted to CRAN unless the maintainer ensures their software runs on Solaris . Hadley Wickham has done so much for the community with `devtools` and his books. He gets a lot of praise, but it's not misplaced.
> Functional programming
Okay, this is actually pretty great. Hooray functional programming! Not totally related, but R has a great polymorphic dispatch of functions, which really can't be undersold (the way automatic documentation generates for this is kisses fingers)
Ultimately, R is a cool language. In interactive settings, I would rather work in RStudio than Jupyter any day. I like RMarkdown better than Notebooks for sharing analysis, too. If there is a specific Bayesian model necessary only available in R, that's fine, wrap it in a container. But the rest of the ETL and pipeline code feels easier to write and maintain in Python.
Dispatch in R is generally fine, but I see a great deal of UseMethod calls, and switch statements for types in the libraries I've worked with, which OTOH is just users using tools badly, but OTO R should enforce using a particular tool to solve problems. And R is particularly bad at enforcing anything, which is why we're left with S3, S4, and R6.
There's also the FFI issue across the board for Python/R where functions frequently barely clean naked FFI calls and leave it a complete mystery what's going on under the hood. I think R is generally worse at it though, where I've had memory leaks and sigterms that aren't visible in RStudio.
I do like the functional programming though. I had an excuse to use multi.argument.Compose from the functional library recently and it made me wish I had things like that to hand in all languages
Isn't that a Lisp-1, then? Maybe I've got them backwards. CL is a Lisp-2, and it's not unusable, so either #'readmacros are good enough or there's something else going on to balance out the ambiguity.
EDIT: I see what you're saying now, it's a Lisp-2. They can share the same name at the same time, not just 1 name referring to one value or the other.
Felt like a really low bar, to get real-world data forms imported, and find the <- deplyr functional bindings pretty much 'not there'
I liked Shiny, because of the low barrier to deployment of a GUI inside the company leveraging R for the graphing. But integration is .. painful.
But its good to know the interface got some eyeballs. I retract the accusation of abandonware.
This is a problem I struggled on for weeks in college. Eventually having to hack something together that relied on modifying the underlying data frame.
I've not return to R since as python has always had better libraries and easier to deploy.
You can find the man pages right in R console - look up `?lapply` for column-wise operations and `?apply` for row-wise.
When it comes to data.frame transformations you are typically better off using packages from `hadleyverse` - check https://github.com/hadley/reshape and https://tidyr.tidyverse.org/
Of course, what's important is not the technology used, but the problem solved. Fantastic that python works for you.
Yes, CRAN is magickal and wonderful, a package manager that works 90% of the
time and doesn't make you want to eat your computer. Yes, Python's package
management is borked with a borky borker.
But- Python, like most programming languages this side of Leboge was designed
_as a programming language_. Like, _for programming_. From the very start. So
all you need to know, once you have it up and running is _what you want to
program_ and not H O W to program it.
The "H O W" is terribly, awfully important. Because in R, anything you want to
do, you have to know the super secret mystical occult incantation that does it
(and nothing else will do it). You can't just intuit syntax. Oh no. You _have_
to know _exactly_ what code to write. Otherwise- run, you fools!
Here's my latest and greatest. I wanted to automatically adjust the position of
a legend in a plot so that it will not overlap the lines in the plot. How do you
do that? Well, it turns out that the function legend() returns ... the position
and dimensions of the legend's rectangle relative to the plot margins.
Why would it _do_ that? Why would a function called "legend" not return, oh, I
don't know - a _bloody_ _legend_?
Well, because it's R, that's why.
That's not how most programming languages work, because that's not what most
programmers expect, because that's not how most programming languages work. R
is far out there in the oubliette of languages used by non-programmers, that
work not like programmers would epxect them to work because they're languages
not for programmers. No shit non-programmers take to it. Because who else would?
Well. Not programmers.
/first 2020 programming rant