Hacker News new | past | comments | ask | show | jobs | submit login
Why I use R (shotwell.ca)
185 points by cardosof 9 months ago | hide | past | favorite | 117 comments

Not mentioned are the fantastic user communities, especially the culture of inclusiveness and openness fostered by RStudio ([code of conduct](https://github.com/tidyverse/dev-day-2019/blob/master/CODE_O...) and [rOpenSci](https://ropensci.org/community/). Basically the inverse of SnarkOverflow.

Especially rOpenSci's peer review process ([more here](https://devguide.ropensci.org/softwarereviewintro.html)) for R packages is fantastic.

I do most data engineering in R (RMarkdown workbooks), and most software engineering in Python/Django. It took three separate, dedicated attempts to get warm with R (pre-tidyverse, showing my age), now I'm interrupting work on an RShiny app to write this comment. The ecosystem around the tidyverse helps immensely to convert my colleague's workflows from Excel to R. Clarity and simplicity wins over purity here (you may now light your pitchforks). And NSE still breaks my brain.

I think the community that R users have managed to create is it's strength. Python has a community of software developers, while R has a community of people who use programming to solve their problems. It's a fundamental difference in mindset, one which really shapes the community.

What also helps is that R is so focused on data and statistics. It gives a focus to the users that really helps when it comes to finding help. Python is famously second best at everything, but that also means it's community is spread thinner over more subjects.

It's this same reason that running anything that depends on R in production is a PITA.

I run R in production and its absolutely fine and wasnt harder than pretty much any other thing in software development.

I didn't suggest it was hard. I found CRAN repos to be insecure and unreliable.

I am not a R user (using primarily julia and python), but can you expand on the insecure aspect of CRAN. Do you refer to (potentially) missing package signing (similar to [1])? I am not aware that python or julia support this either. Or is the software download over ftp/http instead of https?

[1] https://wiki.debian.org/SecureApt

I too got started pre-tidyverse. I've done some minor analysis the last two years and been blown away by how easy it is to get up and running with that code. Way easier than it used to be. I actually bumbled my way through building a simple report building system pre-1.0.0. It was horrible in comparison.

I've gotten some adoption of R Studio at two companies now. It's amazing for exploratory analysis and its cloud capabilities are wonderful.

To give a quick comparison with Julia:

1. Native Data types - this is one of the things that Julia was designed to do very well. That is, native-like-treatment for all data without needing to have a C-family underbelly like Python does for its high performance code.

2. Non-Standard Evaluation - Julia has Metaprogramming[1] and Symbols[2] which provide similar ideas in a different way. It uses abstract syntax trees and is very lisp-like in that way if you wanted to get into writing Macros and such.

3. Package Management - Julia has a best-in-class built in package management system with versioning. Julia also has first-class support for documentation, so its very easy for developers to write relevant documentation. As an R user before RStudio, package management was a pain but RStudio hides the manual work that used to be searching for, downloading, and unpacking packages. Packages usually work really well together, usually automatically so you can often get really cool results[3] where other languages would require a lot of coordination (like Tidyverse).

4. Function paradigm - Julia is multi-paradigm and is conducive to functional, imperative, object-oriented, among others.

I'm a big Julia fan, after having gone R -> Python -> Julia. Not to make this totally in favor, I still like R for plotting because it's more mature. RStudio also is very nice for dynamically interacting with datasets, but Juno comes pretty close there too.

1: https://docs.julialang.org/en/v1/manual/metaprogramming/# 2: https://stackoverflow.com/questions/23480722/what-is-a-symbo... 3: https://www.youtube.com/watch?v=HAEgGFqbVkA

Actually, for plotting I prefer to use PyPlot in Julia which is based on matplotlib which is very mature and complete in my opinion. I tried to use other (more native) plotting packages like GR, Plots and Makie but they did not provide all the plotting types I needed or where to rough around the edges.

In any case, I am looking forward to new julia versions which should address the delay in plotting (as far as I known).

why do you have to look forward to new versions of the base language if you have a problem with plotting?

Better static compilation and compile times which will reduce the time to first plot.

Python as a text processing language has been less convenient than Perl for a long time Python+(Django, Flask) as a Web service language hasn't been as convenient as Ruby on Rails for some time Python+numpy as a numeric computation language hasn't had all the features of Matlab Python+pandas+matplotlib as a data science language hasn't had all the features of R

but using Python throughout, or even Python with a sprinkling of Cython/C++/C for performance, allows for cleaner and faster engineering than using a special language for each niche.

I don't think that R has a bigger problem with non-programmers being bad software engineers than Python has - there are plenty of people who know Python passably and are quite happy that they can be productive without being good software engineers (versus Java where the intent of the language is biased for everyone to write code with a minimum quality standard rather than everyone to write code with focus on being productive). But you can find decent Python software engineers, and more recently you can find decent software engineers who also know enough of the niche in question to produce high-quality production code in that niche from the get-go rather than throwing models over a data science - engineering wall that exists between two departments.

I think this is a very valid point. There are times when Python is not the absolute best choice for a given problem, but in most cases it can allow you to achieve the desired goal in a way that is easy for newcomers to the particular codebase to grok very quickly.

There are times when it may be better to use another language and there is nothing wrong with that. I default to Python and if I think there will be specific issues with it, then I can look at a more specialized language.

As an applied economist, I can assure you that R is wayyy better in terms of statistics than Python. Doing causal models in Python is a real pain because its data science community is mostly focused on machine learning. Even a robust linear model with IV or fixed effect (which is standard in causal models) is really hard to achieve in Python. Of course this argument is about the libraries and note the core language, but it's a strong argument to stay with R. This argument is also valid outside academics, as more and more data scientists try to adress causality.

As a very frequent R user, I actually find non-standard evaluation to be more hassle than good in most situations.

If you want to program with most of the tidyverse libraries, you are forced to implement a bunch of non-sense into your function to properly evaluate arguments within a function. Sure, NSE may be useful in some circumstances, but more often than not, it just increases the likelihood of introducing a bug.

Especially for new programmers, NSE is a huge leap and very confusing.

The old solution was exporting underscore ie: ‘mutate_()’ suffixed functions that used standard evaluation. And this was fine. But, then RStudio decides to deprecate these functions and force NSE on users. I’m not happy about that, and I often avoid using libraries like dplyr when writing functions so that I don’t have to deal with it.

Agreed. Recent updates to rlang have made programming dplyr/ggplot2 functions a bit better, but it still feels super clunky. I use data.table for most things for programmability and speed reasons.

As much as I like ggplot2, I find the rest of the tidyverse to be solving problems it invents (e.g. quosures to fix the problem of not permitting string arguments for dplyr verbs) and monopolising an open source ecosystem.

I struggle to find the logic in the data.table interface, whereas dplyr is a joy to use, at least interactive.

I agree that it is slow, and when things break apart, heavy the NSE use in dplyr really comes back to bite you.

I steer clear of tidyverse as well. At some point I used to like ggplot but this is because I didn't know lattice yet. lattice is a lot like ggplot but it doesn't come with "strings attached" and the plots look a little nicer.

Same. The moment I felt that my brain was melting over NSE and dplyr was the moment I started to phase out most tidyverse stuff from my work. I've switched to data.table and plain R for most of my stuff now.

I actually use R mostly because of its data.table package. It is much faster and more concise than pandas, which is a nightmare to work with. Sure, you can get the job done in pandas, but you often have to wait ~10x longer for your commands to run and sometimes, I simply cannot use pandas at all because I run out of memory.


People are usually pretty surprised when I take the stance that R is faster than Python for the things most people actually care about, which is data manipulation and model building. Python has its datatable library which is approaching data.tables speed however it is very much a work in progress, and does not have very useful features yet.

FWIW it looks like pandas is slow/OOM-ing because the benchmarks solely use Categoricals, which aren't as heavily used by pandas users compared to R.

In particular, I suspect the benchmark sizing is forcing falling back from numpy's int64 to Python ints as categorical labels, which easily could explain a 10x or more differential.

If you're:

- working interactively (i.e. your code isn't part of a larger application)

- working with relatively small datasets that fit into memory

- don't need any deep learning libraries

then both R and Python can do a great job and choosing one over the other is simply a matter of preference. I might even lean slightly towards R because its data frames are a bit easier to use than pandas and RStudio's REPL is the best.

But if you need to deploy your code somewhere, or high performance, or the latest deep learning libraries, then Python absolutely crushes R. And it's not even close.

How are you defining 'high performance'? I think R's data.table is quicker on a number of metrics than comparable packages from Python.

while with high-performance you mean documented FFI and friends? this should be possible with R as well.

It also seems that the actual open-source ML-community (vs. Google: we want you to use our software to ensure you can't ever own your stuff) supports R just fine: https://mxnet.apache.org/api/r

Note that you can call Python from R, and vice versa too.

For the record, the last time I called python from R its memory usage ballooned to 10 times its normal size.

That's one of those things that make for a nice Medium article, but in practice work horribly or not at all.

This mostly strikes me as ”I know R better than I know Python” - which is fair for deciding what to use, but is obviously not an objective comparison.

Hopefully I’ll have time tomorrow to write a rebuttal for some his arguments. Particularly, the preference for CRAN and code longevity strikes me as being shortsighted.

R is objectively worse than Python for almost all data science tasks, and R is a huge PITA to productionise compared to Python.

I've yet to see any argument for R that doesn't boil down to 'well, I know it better' or 'well, I prefer the syntax'.

R to data science is as Matlab is to engineering. It's a stopgap 'non programmer' language that thrived at a time when most academics didn't know any programming. Now school children learn programming. There is no use case for these languages anymore.

> R is objectively worse than Python for almost all data science tasks

If you meant to type "machine learning" I'd probably agree, but R is much much better for small scale data exploration, visualization and modeling (i.e. 95% of DS) than Python. Pandas is an absolute horror show of an API compared to dplyr, and the best plotting libraries for Python are just copying features from R. Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using it. R is inferior to Python as a programming language, no doubt about it -- but most data scientists are not programmers. Which is the point of TFA.

> but most data scientists are not programmers

This is the crux of the problem with R and why R is increasingly blacklisted at large orgs. It attracts non-programmers which may have been okay 5 years ago but is no longer acceptable.

With the exception of some engineering powerhouses hiring pure research PhDs to write R code, the trend established over the last 2 years is that fewer and fewer employers are hiring data scientists that aren't programmers. There are too many candidates who know data science and can also do data engineering and even generalist SE tasks. Non-programmer data scientists are not competitive in the industry anymore except that small top-end research niche that doesn't exist in most orgs.

Which brings us back to the fact that R was a successful niche language that allowed non-programmers to write models, but that's simply not enough anymore. Businesses want models that can be plugged into production pipelines, models that can scale without needing a dedicated team to re-implement them, and they want staff who do engineering in addition to whatever it is they specialise in.

Virtually all data scientists graduating today are programmers, and pretty good ones. Candidates who only know R can't compete against them.

> Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using

So you'd agree that you fall into the 'I prefer the syntax' bucket then? I don't really see any arguments against Python in your comment. Funnily enough, it's trivial to implement a pipe style operator in Python and there's at least two popular libraries for that.

> R is increasingly blacklisted at large orgs.

Eh, I call BS. Names and sources please. I know for a fact that R is used at all of FAANG and about a bazillion other "large orgs" too. I'm sure it's true that R is not used for customer-facing "web scale" products, but then again neither is any other language except for like two.

Being good at programming is useful skill, but so is being good at statistics, and they are not interchangeable. "Productionizing a model" is not the only show in town when it comes to data analysis. Many programmers know shockingly little statistics. An equally large number of really strong statisicians prefer R, for good reasons. Orgs who simply refuse to hire those people do so at their peril.

I actually use R mostly because of its data.table package. It is much faster and more concise than pandas, which is a nightmare to work with. Sure, you can get the job done in pandas, but you often have to wait ~10x longer for your commands to run and sometimes, I simply cannot use pandas at all because I run out of memory.


People are usually pretty surprised when I take the stance that R is faster than Python for the things most people actually care about, which is data manipulation and model building.

>It attracts non-programmers

If employers are hiring non-statisticians as data scientists then the problem is the employers.

Stats and programming aren't mutually exclusive. The current generation of DS graduates are strong in both.

Job applicants who only know R and have no grasp of SE are increasingly less and less competitive. I don't expect there'll be any market for them in another 5 years.

In Statistics they teach programming. For example we have studied one semester C, one semester OOP (C++), one semester SQL and relational database design. All these were must courses. Other than these also R, Matlab, Minitab, SPSS and SAS. All of my classmates knew programming. It is stupid to think that in this age a statistician won't be able to write a program. How you are going to make analyses of a population census in a country as a statistician? Statistical packages do not always provide all you need. Sometimes you need to transform a data. Sometimes you need to check/validate a data. Sometimes you need to query from an X location. Sometimes you need to pipe through some process. Few people from our department wrote R packages that didn't exist (new statistical analyses).

and yet all these businesses run excel in production. I'd rather implement R code (with localizreed variables and dumb algorithms, so there's something funny in it) than an excel-spreadsheet. But somehow there's a difference...

They really don't. I've been at such an org as described by OP. I've owned a system of production R/Excel and it was migrated to cloud + python + ETL over 4 years.

The places where Excel are used are fairly appropriate. Way downstream, for simple tasks.

A lot of hard science is still done on the back of Excel, only begrudgingly adopting a data science mindset as the instruments produce more and more data. Data science is more than just streaming data, data lakes and machine learning.

Visual programming platforms like Knime is the next step for these teams, and then onto something like RStudio as they complete the transition towards employing data science in their pipelines.

>Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using it.

That's an interesting take, given that to me the magrittr operator seems to have been added to mimic the object oriented 'attribute' operator.

Of course the object oriented variant makes it harder to extend the behaviour of a class after it has been defined (although strictly speaking that isn't impossible with python), you'd need to add your methods up front, or extend the class.

> There is no use case for these languages anymore.

There are entire ecosystems of academic libraries built around Matlab that can’t all just be picked up and moved to Python. This argument probably doesn’t realise just how ingrained Matlab is in STEM non-CS academic departments.

Example: my girlfriends department writes a world-leading MRI analysis library in Matlab. They offer training courses on it (so departments around the world now know it) and it’s frequently used within academic papers (so there are now resources available on it). Why would they move to Python?

> There are entire ecosystems of academic libraries built around Matlab that can’t all just be picked up and moved to Python.

They can and they are. Python is increasingly displacing everything in the data industry and especially proprietary legacy platforms like Matlab. The number of things you can do in Matlab but not Python is converging on 0, while the inverse is not even worth trying to count.

Major universities are abandoning Matlab, Labview, SPSS, Minitab etc for Python, which is basically the end for them all. The next wave of CS/SE/DS/ML graduates had no exposure to Matlab. It'll linger in electrical engineering for a few more years but will suffer the same fate. In the end, proprietary platforms have no chance against FOSS.

> Example: my girlfriends department writes a world-leading MRI analysis library in Matlab

Siemens is leading the MRI industry and the only place where they're still using Matlab is the legacy platforms that aren't yet listed for updates or aren't worth updating.

The actual leading stuff is done with the same ML tools as the rest of the industry, mostly Tensorflow. Siemens and GE both also have programs to engage and eventually acquire 3rd party ML platforms not a single one of which has anything to do with Labview or Matlab outside of occasionally interfacing with legacy components.


> Major universities are abandoning Matlab, Labview, SPSS, Minitab etc for Python

Just to add another point of anecdata.

I helm a large data science effort in the defense industry. We are actively moving away from MATLAB and to Python. It's easier for us to find Python coders, easier to train people to use Python, more maintainable for the restrictions we have on our networks, and cheaper.

Also, Simulink - there is no equivalent in Python and many industries use this.


Yep, NASA used it alongside Matlab for the Orion's Guidance and Navigation Control systems. I've never had the chance to use it though, it looks pretty interesting.

In python, your MRI analysis library could be trivially hooked up with other cloud data pipelines. Companies would require fewer training courses on average.

python is the most popular programming language in the world and getting better.

Because more more and more people realize that using closed source, proprietary programming languages and libraries is not compatible with open, reproducible science.

Sure but in reality most if not all educational institutes that I'm aware of have Matlab licenses, it's what everyone in that particular field uses, and it's better to publish something with Matlab code than nothing at all which is I guess the alternative (it's a means to an end after all).

I can imagine this will change in the long run, but right now there are many valid reasons why people use these tools.

Than I guess Julia will be an even more powerful language to learn, as it combines the flexibility of Python with all the use cases of all major technical/scientific languages, while being efficient and fast.


I think Julia is better designed than R. But it lacks R's ecosystem, so Julia is a tougher sell.

I haven't done a lot with Julia, so I don't know if it's easier to teach Julia to novices than Python or R.

The interoperability of Julia and Python is really good (Julia calling Python and vice-versa). There is also the possibility to call R functions (but I do not use it personally). To some degree, one can leverage python's ecosystem in Julia. Our group switched from matlab to Julia are we are quite happy with the move.

For teaching, I think that Julia indexing (for example: vector[2:end-1]) is easier to explain than numpy (vector[1:-1]). On the other hand, I like python's plain English operators and/or versus && / || in Julia.

Also loops tends to be more readable than vertorised code in some circumstances (e.g. computing the Laplacian by finite difference). In Julia, loop and vectorized code are both quite efficient, while in python and R, one has to vectorize the code.

Julia’s ecosystem has been progressing since 1.0. The GLM.jl lib has become much better as has the data frame package. It’s more consistent than Python‘s data-science ecosystem since it’s not tying disparate C code together. But having strong types (actual, not mypy) helps make code more consistent. Still Julia’s ecosystem seems to be building more from R’s more solid academic approach.

Yes, Julia has a lot of benefits in some regards.

The community is great. But small. For a lot of situations, I'd be hesitant to invest in Julia, because I don't know if the community will stay that way or if it fades away.

Out of curiosity, how would you know that the community is large enough, or committed enough? For example, while Julia has been in development for almost 10 years, a lot of the community has now been around for 5 years. There's about 2,500 Julia packages, with the ability to call C, Fortran, R, Python, Java, etc. All the key community stats based on downloads, website views, videos show a healthy growth every year.

While in absolute numbers, we may be at 20% of R or Python communities, I am always curious to understand what people mean when they say the community is too small. What would be a signal that a particular community is big enough?

For me as long as a core group appears to be active I’m fine with a communities survivability. Julia’s data-science and plotting have continued to improve in terms of documentation and feature parity, both are critical in an immature ecosystem as they indicate an active core group of developers. Also many libraries appear to be driven by academics creating cutting edge libraries or developing "workhorse" libraries. One good example is Steven G Johnson’s involvement in Julia [1,2], since he created the FFTW library and NLOpt I’d put him in the category of ‘prolific data science contributor’. Or are take the Julia GaussianProcesses.jl [3] library which has a surprisingly thorough implementation along with academic research (and its citable!) for speeding up GP fitting. Pretty cool! Plus it’s pretty performant to say use Otim.jl to optimize the "hyper parameters" for a set of GP’s. That enables a lot more iterations of data exploration.

Essentially the base ecosystem of a language is driven by a core group of contributors and the derivation an ability of that group matters more than most other factors. When doing scientific and or data science I personally care more about the core quality and what the platform enables. Lately I’ve considered learning R as it has a lot of well done says which simply aren’t available in Python, and aren’t ready yet in Julia. Last time I tried to calculate a confidence interval in Python for an obscure probability function I ended up wanting to pull out my air in frustration. There’s libraries that kind of handle it in Python but they are (we’re?) nigh impossible to modify or re-use for a generalized case. Much less getting a proper covariance matrix with enough documentation to know what to do with it. I used R examples to figure out the correct maths. R’s NSE seems appealing in allowing generalized re-use. I’ve had similar ability to re-use library features in Julia for solving problems outside that libraries initial scope.

1: https://en.wikipedia.org/wiki/Steven_G._Johnson 2: https://discourse.julialang.org/t/steven-johnson-as-a-juliac... 3: https://github.com/STOR-i/GaussianProcesses.jl

What's the difference between julia's types and mypy other than julia lacking a static checker?

Julia uses types as compiler hints. If you dispatch f(x) as f(1.0), in most cases it will lazily compile f to be float-optimized. When you run f(1) it will recompile it to be integer optimized.

This enables you to also select libraries: a standard float type will use blas for matrix ops; a gpu float type will use cuda.

Mypy types don't usually have any runtime enforcement.

>But it lacks R's ecosystem, so Julia is a tougher sell.

In the areas I work in (scientific computing and scientific machine learning), you can really only find the packages in Julia while R and Python's ecosystems are quite lacking. R has stats and Python has ML, but the rest of the scientific ecosystems there just aren't as complete.

R definitely leads in molecular biology too. A lot of the bioconductor tools have no equivalent in both Julia or Python.

I think these folks are working on it: https://biojulia.net/

Statisticians, economists, biologists, social scientists all learn and work with R. They publish new packages in R, not in Python. There is no trend at all that this is moving towards Python. Python is far, very far behind when it comes to state of the art research in anything stat-related (besides machine learning i guess, but R is pushing hard to close that gap).

> Statisticians, economists, biologists, social scientists all learn and work with R

They used to work with R. And the old generation of engineers used to work with Matlab. The old generation still does.

The new generation has been using actual programming languages, typically Python, since high school. They were the first wave of graduates in 2019 that specialised in a discipline and were also competent in software engineering.

The old generation is going to be driven out of the job market by the new in the span of 5 years as they saturate the senior tier of their respective fields. How do you compete for a job when all you know is R and your discipline, against someone who's a full fledged software engineer who knows your discipline and can put models directly into production use?

This simply isn't true. There are more of them that know Python these days in addition to R, but as a recent graduate of a respected statistics graduate program, I can assure you that R is still the overwhelmingly preferred choice in the field, and also is in economics.

So assuming this is true, what plan of action do you recommend for an “old gen” data scientist who is strong in math, stats, ML theory, R, dataviz, ETL, research, etc., but who is not by a long shot a “full fledged software engineer”? I will soon be competing against this new crop of statistician/engineer superhybrids you speak of.

I know a fair bit of Python (mostly for ML/DL applications), bash, and just a smidge of HTML/CSS/JS (just enough tweak a front end demo via R Shiny). I’m OCD enough that I make every effort to write clean and reproducible code and unit test it as I go (is this TDD?). I can implement some stats algorithms (e.g. EM algorithm, MCMC) from scratch with a pseudocode reference, but I rarely if ever have the occasion to do that for obvious reasons. I understand the concept of computational complexity, though I don’t have any figures memorized.

But I’ve never taken any CS course beyond Programming 101. I wouldn’t know how to navigate a complex production codebase. Embarrassingly, I know almost nothing about git. I’m 100% sure I’d get slaughtered in a whiteboard interview or similar. For that matter, I could easily get nailed on some holes in my core data science knowledge (cough SQL cough).

So, do I rebuild my foundation around software engineering, or just patch up the obvious holes? Grind towards a management position and let my inferior skills rot away?

Learning git is never a bad thing. But if you encounter a company expecting you to be a software engineer, run away. You're not that, you're a data scientist. You wouldn't expect a software engineer to be able to recreate some statistical proof from scratch, as you're testing the wrong set of skills.

Whenever someone makes a statement that one language is "objectively" better than another, it is followed only by opinions of the author.

I would say that for data manipulation and data visualization, R is objectively superior to Python. And for most statistical methods. Its pretty even for many machine learning algorithms. Python only really outclasses R in deep learning in my opinion.

Insightful article on many fronts. I haven't had time to learn R but have been impressed by what it offers especially RStudio as an environment. I use Matlab from time-to-time and like having equivalent features that are of great help in initial data exploration and code experimentation. I haven't found anything fully equivalent in the Python world.

R was that language I learnt so many years ago. I love it actually, have so many fond memories. It’s hard to find something better especially for prototyping. I will admit productionalizing has almost drag and drop simplicity in python compared to R.

I love love love what Hadley has done with Dplyr for the most part, at least in spirit, though I think the implementation could have been done better so as not to be so clunky, esp wrt NSE. But I think he is just trying to work within the current R ecosystem.

Which makes me ask.. is it then time for R2? (Like a Python 3). Before you shoot me.. Do we need to save the good things we have innovated from within the R ecosystem over the years and consider doing things from scratch?

Is this what Julia tried to do? I haven’t gotten around to trying it yet.

That said, I think R is always going to be there and have it’s place.

Frankly if they could just make an IDE like Rstudio that ran python, I’d probably be happy enough with that. I heard with reticulate you can run both, curious to hear of others experience with this..

`reticulate` is a brilliant package for running Python within R. I guess it's good as long as it's local machine. I've not tried productionizing a code that has both R+Py. But it can work within Shiny too, helping bring Python's Datascience stack (esp. scikit-learn) within R.

You should try Jupyter lab. It has notebooks, text editors and more.

The whole problem with notebooks is that it's not a simple text file. On the other hand has anyone tried Rodeo for Python?

You can save notebooks to simple text files. You can write simple textfiles in jupyter lab and have a split view where you can execute it. You can have a notebook, console and graph view with the same kernel on the same page as split windows.

You can make jupyter lab behave mostly like rstudio, but it can do a lot more, especially in terms of visualization and rich display of data objects.

I wanted to like it but it kept crashing when I tried it a year or two back, I may give it another shot. I do recall it was closest to what i wanted..

If you need a multi-language notebook that plays nicely with version control then RMarkdown in RStudio is worth checking out

I've used it. It does have the feel of RStudio IDE if that's what you're getting at.

jupytext --sync

I like Jupyterlab, but there's a huge gap between it and something like RStudio.

The part on “Make code more concise“ struck me as an anti-pattern that should instead be a simple function composition. Or the functional programmer in me notes that the author re-invented the Maybe monad.

I do like the point in Learn the user’s language, as friendlier error messages is something we should all strive for, although I’ve never had an issue with particular problem in Python’s stacktraces, and actually having types like Julia or at least annotations via mypy seems a better solution.

CRAN is a great point, and pythons packaging is in a sorry state with a crazy number of approaches and undeclared dependencies. R does a great job here.

Functional programming section is ironic given the lack of functional patterns in the post. R has even fewer higher order functions than the python standard library.

It’s hard for me to see how R is better for production than python, and the argument against pandas seems a bit strawman considering that numpy/scipy are quite stable and more central to the ecosystem than DataFrame. R is fantastic for data science and highly productive, until you need to do data mugging or anything else that involves a general purpose language.

"CRAN is a great point, and pythons packaging is in a sorry state with a crazy number of approaches and undeclared dependencies. R does a great job here"

But for production usage, again it's a huge pain. It's difficult to keep version stability with developer machines since there's no standard lock file, and the CRAN servers often delete or silently update old versions of packages.

Packrat and Microsoft's MRAN really helps, but another curious issue is that it and other CRAN servers seem to have terrible stability - often going down for hours at a time (or worse).

Python's packaging is ~impossible to understand for new developers (and really absolutely needs improving), but in an organisation you just pick an approach suited for your use case.

Nix really shines here, has ~17000 R packages pinned and reproducibly built: https://hydra.nixos.org/eval/1560470#tabs-still-succeed

The problem with Python conciseness is that it requires cognition to write. That cognition needs to multiplied ten fold for understanding.

I think I'm a good developer, I can't understand idiomatic Python. Python could use some verbosity for the sake of everyone else. If R slows you IQ 9000 people down, please, make it standard.

R is interesting in that it has one library which has zero real competitors to the point where it becomes a justification for using the entire language: ggplot2. It's almost like ggplot2 crosses over from a package to an application, and R is the user interface. Any examples like this in other languages? I can't think of any, maybe Python & scikit-learn 5 years ago.

Can you compare ggplot2 to matplotlib? 90 seconds of googling didn't seem to indicate to me that ggplot2 is particularly different, either in terms of its power or expressivity, than matplotlib.

Those 90 seconds of googling constitute my entire knowledge of ggplot2.

From people that have used both extensively (which does not include you or me, it would seem), I always hear that ggplot2 is unsurpassed. Admittedly, that isn't a satisfactory answer to your question.

Ruby and Rails?

R is the right tool in many cases but I've observed several cases where people have become overreliant on it and hacked together things that would be better written in bash, python, or other languages.

Is there any language for which the same could not be said?

One advantage of R is that it is easier or faster to teach enough R to non-programmer scientists such that they can do their own statistical stuff.

R is good for machine learning and for production. We have helped big orgs to incorporate this technology in their it ecosystem. We used our open-source product called R Suite to manage deployment issues. https://github.com/WLOGSolutions/RSuite

Some good points. Certain things are easier in R and often you can find code snippets that perform certain statistical tasks "the right way" and they are more concise than they would be in Python.

To me, R always felt a bit quirky.

I don't think R is much better at functional programming than Python. I found R to be limiting in terms of general programming.

Also, we now have type checking. I'm positive you could combine type checking and clever type declarations to handle application state like in elm.

> I don't think R is much better at functional programming than Python.

R is a functional programming language. Does Python treat functions as first-class citizens? Pass arguments by value? Store expression trees as data structures?

First class functions and pass-by-value have been in basically every language that was designed in the last 30 years, and have been added to most of the popular ones that existed before that. Homoiconicity is less common, but is still present in some form in most recent languages that were originally designed to be interpreted or JIT-compiled (Python is no exception here).

Python can do all that. It doesn't force you to do it, and handling expression trees may be a bit complicated (using ast).

For me the main defining feature of functional programming is the ability to pass functions and the ability to avoid side effects.

If I remember correctly, I often had trouble with side effects in R.

Could you elaborate on that? R is not a pure functional language (like, e.g., Haskell), so it's possible to produce side effects, or even "leak state", but it's hard to do it accidentally.

I'm happy to have read this article as it has a different perspective on R than I am used to. My general take is that Python really is a better suited language for production workflows, while R is superb at interactive workflows. I love R, but chiefly for RMarkdown, Shiny, ggplot and obscure stats packages rather than large machine learning codebases. Here's why:

> Native data science structures.

DataFrames are often easier to use than Pandas. However, in production workflows we're often using more datatypes than DataFrames, and R is weaker there. For example:

* Lists must be accessed with `[[ ]]`, instead of `[ ]`. I've seen many silent bugs slip through due to this. * There are 3 competing implementation of classes. This results in classes being mystical and rarely understood. * R is a Lisp 2. Variables and functions may share the same name. This leads to confusing errors. * Catching specific types of errors can be awkward. * Adding elements to a list iteratively is slow [1]

> Non-Standard Evaluation

This can be handy while quickly working in RStudio, but it's not easy to maintain. I've seen code that failed because it specified `f(!!variable)` instead of `f(!!!variable)`. I like R's formula notation, but I'm happy enough with sklearn's API that I don't miss it.

> The glory of CRAN

CRAN is not set up for production. It makes pinning versions very difficult [2]. Many people resort to using MRAN, which is a Microsoft supported snapshot of CRAN at a specific time, so a dev can just pretend they are installing software as if it were 6 months ago. I have seen MRAN go down multiple times [3]. Not to mention, the owner of CRAN is notoriously prickly [4] and packages will not be accepted to CRAN unless the maintainer ensures their software runs on Solaris [5]. Hadley Wickham has done so much for the community with `devtools` and his books. He gets a lot of praise, but it's not misplaced.

> Functional programming

Okay, this is actually pretty great. Hooray functional programming! Not totally related, but R has a great polymorphic dispatch of functions, which really can't be undersold (the way automatic documentation generates for this is kisses fingers)

Ultimately, R is a cool language. In interactive settings, I would rather work in RStudio than Jupyter any day. I like RMarkdown better than Notebooks for sharing analysis, too. If there is a specific Bayesian model necessary only available in R, that's fine, wrap it in a container. But the rest of the ETL and pipeline code feels easier to write and maintain in Python.

[1] https://stackoverflow.com/questions/17046336/here-we-go-agai... [2] https://stackoverflow.com/questions/17082341/installing-olde... [3] https://github.com/Microsoft/microsoft-r-open/issues/51 [4] https://www.reddit.com/r/rstats/comments/2t5oqp/dont_use_the... [5] http://www.deanbodenham.com/learn/r-package-submission-to-cr...

>Not totally related, but R has a great polymorphic dispatch of functions,

Dispatch in R is generally fine, but I see a great deal of UseMethod calls, and switch statements for types in the libraries I've worked with, which OTOH is just users using tools badly, but OTO R should enforce using a particular tool to solve problems. And R is particularly bad at enforcing anything, which is why we're left with S3, S4, and R6.

There's also the FFI issue across the board for Python/R where functions frequently barely clean naked FFI calls and leave it a complete mystery what's going on under the hood. I think R is generally worse at it though, where I've had memory leaks and sigterms that aren't visible in RStudio.

I do like the functional programming though. I had an excuse to use multi.argument.Compose from the functional library recently and it made me wish I had things like that to hand in all languages

> R is a Lisp 2. Variables and functions may share the same name. This leads to confusing errors.

Isn't that a Lisp-1, then? Maybe I've got them backwards. CL is a Lisp-2, and it's not unusable, so either #'readmacros are good enough or there's something else going on to balance out the ambiguity.

EDIT: I see what you're saying now, it's a Lisp-2. They can share the same name at the same time, not just 1 name referring to one value or the other.

For those interested, there is a nice Vim plugin for R called Nvim-R.

I used to use R because of RStudio until JupyterLab came and flipped it all over now I use only bash and python for small data analysis as well.

I use R for the statistical packages, handling data frames, and plotting. For other types of work I would switch to python.

I tried R with JSON and R with Elastic Search and both interfaces were 'abandonware' with 'just use the HTTP paradigm and deal with the data frame afterward' outcomes.

Felt like a really low bar, to get real-world data forms imported, and find the <- deplyr functional bindings pretty much 'not there'

I liked Shiny, because of the low barrier to deployment of a GUI inside the company leveraging R for the graphing. But integration is .. painful.

Were you using jsonlite? Great library that can parse directly from a URL and transform into data.frame depending on JSON representation. I agree it is not clear from the get go which library should be used when you do something outside of base R. You kind of need to be in the known

I think I was, yes. This is 2+ years ago. I haven't gone back because I got there with REST fetches and internalized re-framing of the JSON. I was just amazed how 'crude' it was, and that the dplyer bindings for ElasticSearch were non-functional (for me at least)

The elastic library on CRAN got a major update a few months ago which made it much more powerful and easier to use. Makes uploading easy too, especially if there's a direct relationship between the structure of your dataframe and the Elastic json. Uploading nested structures is still a bit of a pain though.

Thanks. Good to know, we may re-visit things. Elastic itself is a pain to manage, and with data growth we're re-examining what we get from it. For time series data, there is an element of "why?" about it: PostGres as a time series JSON blob store, might work better for us. The sharding by time is easy to understand and it has more affinity to the underlying host OS. Elastic, you have to understand a LOT of moving parts to make it work well (or be a Kubernetes expert, or pay somebody else) -if you already have an investment in PG then its possibly better use of your time.

But its good to know the interface got some eyeballs. I retract the accusation of abandonware.

How do I apply a function to every element of a data frame where that function takes as input the i,j indices of the element along with its value?

This is a problem I struggled on for weeks in college. Eventually having to hack something together that relied on modifying the underlying data frame.

I've not return to R since as python has always had better libraries and easier to deploy.

In R you'd typically want to operate over vectors (rows or columns, with columns being the faster option) rather than on individual values. This requires a bit of a mental shift when you come to R from a C/Cuda background or even python.

You can find the man pages right in R console - look up `?lapply` for column-wise operations and `?apply` for row-wise.

When it comes to data.frame transformations you are typically better off using packages from `hadleyverse` - check https://github.com/hadley/reshape and https://tidyr.tidyverse.org/

Of course, what's important is not the technology used, but the problem solved. Fantastic that python works for you.

You use one of the family of `apply` function.

If you need i,j indices to solve your problem you should probably not be using a data.frame but a matrix type.

If he/she chose to use dataframe then the data is probably multiple type. Matrix data structure can only have one type.

It is wrong to compare R with Python. R should be compared with NumPy + Jupyter.


How does R work for text? Easy to batch process words in data frames?

Here's a great book on this: https://www.tidytextmining.com/. Incidentally, the author is joining RStudio next month. However, in my experience sklearn's feature extract methods [1] are more straightforward and NLP libraries like Pytorch/Tensorflow, Spacy, NLTK, Gensim and Snorkel are more geared toward python as well.

[1] https://scikit-learn.org/stable/modules/feature_extraction.h...

Look, I use R (for plottng mainly) but it's a hideous abomination. It's like some kind of giant, horrible patchwork chimera put together by evil golems with the word "CONFUSION" carved on their foreheads.

Yes, CRAN is magickal and wonderful, a package manager that works 90% of the time and doesn't make you want to eat your computer. Yes, Python's package management is borked with a borky borker.

But- Python, like most programming languages this side of Leboge was designed _as a programming language_. Like, _for programming_. From the very start. So all you need to know, once you have it up and running is _what you want to program_ and not H O W to program it.

The "H O W" is terribly, awfully important. Because in R, anything you want to do, you have to know the super secret mystical occult incantation that does it (and nothing else will do it). You can't just intuit syntax. Oh no. You _have_ to know _exactly_ what code to write. Otherwise- run, you fools!

Here's my latest and greatest. I wanted to automatically adjust the position of a legend in a plot so that it will not overlap the lines in the plot. How do you do that? Well, it turns out that the function legend() returns ... the position and dimensions of the legend's rectangle relative to the plot margins.

Wait what?

Why would it _do_ that? Why would a function called "legend" not return, oh, I don't know - a _bloody_ _legend_?

Well, because it's R, that's why.

That's not how most programming languages work, because that's not what most programmers expect, because that's not how most programming languages work. R is far out there in the oubliette of languages used by non-programmers, that work not like programmers would epxect them to work because they're languages not for programmers. No shit non-programmers take to it. Because who else would? Well. Not programmers.

/first 2020 programming rant

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact