As a software developer forced to work with data scientists who refuse to learn ...

civilized · on Oct 3, 2021

Of course, if you read the article, you find out that the problem had nothing to do with R. It was a misconfiguration of the underlying linear algebra libraries that R (and Python and everything else) relies on. The author even made a minimal reproducible example in a single C script, no dependencies on R whatsoever.

I hear a lot of "R is bad, Python is Enterprise Production Quality (TM)" blather at my work. It's always because the people involved don't understand computers, don't read documentation, don't debug, don't do root cause analysis, and want to quickly pass off responsibility for their laziness and incompetence. Meanwhile I and my team are happily chugging away, producing millions of dollars of reliable value for my company in R year after year.

Python lags far behind R in wide swaths of data science. Pandas is inferior to both dplyr and data.table, and R's modeling capabilities blow Python's out of the water in breadth and depth. You only use Python when you have to, e.g. for unstructured data and deep learning type stuff.

If your colleagues make you deal with their bad R code, that's too bad, but don't blame the language. It's designed to be easy to use, so a lot of bad coders use it. Go train your bad coders or hire better ones.

laichzeit0 · on Oct 3, 2021

I would completely concede that R has better libraries. However, getting stuff like online prediction into production is a real pain when the models are developed in R. And R is single threaded. There is no way to hide that detail.

civilized · on Oct 3, 2021

R isn't the best for production predictions for sure (it can work though). But it's not hard to translate well-designed R processing pipelines and models into other languages if you must. The problem is that R programmers often don't know how to write good code in any language.

Same issue as Excel, really. Easy to use, so you get a lot of users with very thin engineering skills.

The solution is for production engineers to understand just enough R to set standards for data scientist code that enable reliable translation of the models to the production language. As with JS, you can complain about the yucky parts, or you can accept that it's the best tool for some jobs and make an effort to work around the yucky parts, or use the tools of those who are doing that (e.g. tidyverse and Wickham).

If you want data scientists to produce production-ready results, you have to hold them to the standards of production engineering.

mellavora · on Oct 3, 2021

"Same issue as Excel, really. Easy to use, so you get a lot of users with very thin engineering skills."

Huh?

While I totally agree with your quote, I'd think it applied a lot more to python than to R. Especially given that python seems to be the dominant "first language for people to learn when they get into programming" because it is "easy".

civilized · on Oct 3, 2021

The proportion in R is higher because the community of software engineers working in R is a lot smaller. R coders are overwhelmingly data analysts, while Python coders have more diverse roles. People who use R are also much more likely to have learned R, and only R, from their university courses towards a data science-related degree, especially if that degree is in statistics.

dragonwriter · on Oct 3, 2021

R is a language people use when they get into statistics, not even thinking specifically of programming.

deng · on Oct 3, 2021

> And R is single threaded. There is no way to hide that detail.

Python isn't much better in this regard, thanks to the GIL.

What I actually found most baffling when I delved into R is the fact that it doesn't support 64bit integers (lack of proper native UTF-8 support coming a close second).

laichzeit0 · on Oct 3, 2021

> Python isn't much better in this regard, thanks to the GIL.

Take some standard ML model built with Caret or LME4 and try serve predictions with Plumber in R. It’s significantly more painful than using sklearn + FastAPI. You either need to use future::promise (which still sucks because it’s forking new R runtimes) or forgo this and go K8s or something similar.

I don’t get the love for RStudio either. It crashes frequently for me, or locks up randomly. The debugging experience is abysmal compared to PyCharm. Getting reproducible R builds are a pain, slightly alleviated by Renv. But not really if you want separate dependencies for dev and production.

Python and R tooling are not comparable. You will have serious issues operationalising R. Skills that most statisticians are simply not equipped to deal with, and serious software engineers will hate about R.

civilized · on Oct 3, 2021

FWIW I have many years of full-time RStudio dev experience, and while I've definitely had a few hard-to-explain crashes, I'd characterize it as very reliable overall. When problems arise they tend to be due to community-contributed packages, especially packages that call out to C++. (My name is on the bug fix log for some major packages.)

Unintentional and unnecessary creation of huge, memory-hogging objects is a closely related footgun. Packages are often not built with large data in mind and make choices that scale terribly, such as storing multiple copies of the data in the model object, or creating enormous nonsparse matrices to represent the model term structure. It's a legacy of the academic statistics culture R grew out of. Most researchers test their fancy new method on a tiny dataset, write a paper, and call it a day.

No argument about the debugging experience. I find it very slow, especially with large datasets, and try to avoid it. Not much experience with reproducible R builds but I wouldn't be surprised if it was a pain.

WhompingWindows · on Oct 3, 2021

Wow, tell us how you really feel. How much have you used R and Python? Maybe those data scientists would prefer if you didn't viscerally hate the main data/statistics language and didn't call it useless for things beyond a narrow use-case. It may lead to better outcomes if people hated things less and tried to understand the valid use-cases, for instance the reams and reams of statistics that can be done on R where Python may lag behind, since R is the lingua franca of statistics and research.

gnufx · on Oct 3, 2021

I've never seen anything for Python that allows you to a linear algebra-based code and run it at maybe petascale with trivial modifications. There's an R example somewhere under https://pbdr.org/publications.html

otabdeveloper4 · on Oct 3, 2021

R is way more powerful and flexible for data science stuff. (Going from Python to R is almost like going from Excel to Python.)

mellavora · on Oct 3, 2021

with regard to good software engineering, there is a funny thing about python. Simply cut-and-paste code from one env to another can completely destroy the program, if the cut-and-paste messes with the indentation.

Now some people say this can be solved with a good IDE. Which might (or might not) be true if you can reliably identify, by manually reviewing the code, the ends of the functions, loops, etc which got munged in the paste.

But interestingly enough, jupyter notebooks (which seem to be the go-to tool these days) aren't IDEs. Making it incredibly easy to fubar otherwise perfectly working code by pasting it from your local IDE into, let's say an AWS Sagemaker instance, to pick one random example of a current widely used jupyter implementation. So even if the problem could be fixed by a good IDE, there is no guarantee that that IDE is (easily) accessible for production code.

I just have a hard time seeing how such a fundamental flaw in a language can lead to "good software engineering"

goerz · on Oct 4, 2021

So don’t mess up the indentation when you paste. Seriously, in my 15 years of using Python on a daily basis this hasn’t been a problem once.

kgwgk · on Oct 3, 2021

It could be worse. They could learn python and still prefer to use R!

tharne · on Oct 3, 2021

I don't know why you're getting downvoted. I was one of the data guys you mentioned who learned R first and resisted python. There are a lot of things about R that leads users to develop very bad habits. The only reason R caught on in the first place is because python did not have mature libraries for data analysis for a long time.

civilized · on Oct 3, 2021

All languages strike a tradeoff between flexibility and enforcing a regular structure. A lot of people seem to think their preferred language hits the perfect point on that tradeoff, and judge any language that makes a different choice. Python lovers judge R, Java users judge Python, C++ users judge Java, Rust users judge C++, Go users judge Rust, and everyone judges JavaScript.

A language that's more flexible than your favorite "encourages bad habits", while a language that's less flexible than yours is "bureaucratic".

tharne · on Oct 4, 2021

It's not the flexibility that encourages bad habits. Lisps are incredibly flexible, for instance, but do not generally encourage bad habits. R encourages bad habits because the language itself and its libraries are not very well-designed. The language is powerful and useful, but it's also a mess.

R encourages bad habits for the following reasons:

- R is made "for statisticians, by statisticians" so a lot of the example code out there is very poorly written

- The syntax is very inconsistent across libraries, and even within base R

- There are a lot of syntactical quirks that cause a lot of confusion for anyone who's learned another language, like using dots in function tables, e.g. "read.csv". There's also the 1 indexing.

civilized · on Oct 5, 2021

> Lisps are incredibly flexible, for instance, but do not generally encourage bad habits

People often say "LISP is so powerful but nobody can understand anyone else's code". That's the dominant explanation for why LISP isn't more popular (along with the "oatmeal with toenail clippings mixed in" syntax, which most people don't find readable, regardless of the fervent beliefs of the LISP community to the contrary). The community stays small because of the unappealing syntax, and even within the community people find it hard to work together, because everyone has their own style, so the kind of coding and collaboration that produces generally useful libraries doesn't tend to happen. I would argue there's no meaningful distinction between "bad habits" and "habits that inhibit the development of generally useful software". In fact, that's the most useful definition of "bad habits" I can imagine.

Flexibility is a root cause of bad habits thus defined, because flexibility is what enables people to make bad choices. Language designers have long recognized this. It's why certain languages impose heavy restrictions on how you can structure your code, from Java's everything-is-a-class to Python's indentation-based scoping. They have a certain vision for what constitutes effective code and they know it won't happen unless they force everyone to follow it. In other words, they choose to reduce flexibility to prevent what they consider to be bad habits.

> R encourages bad habits for the following reasons:

Your reasons are just common complaints about R issues, not actual arguments that these issues encourage bad habits.

> - R is made "for statisticians, by statisticians" so a lot of the example code out there is very poorly written

At best this is an argument that widely publicized badly written R code encourages bad habits, not R itself.

> There's also the 1 indexing

There's a big difference between "language feature I don't like" and "language feature that encourages bad habits". A language that has different conventions than your favorite language is just different.

project2501a · on Oct 3, 2021

it is not much as R, as the (relative) unwillingness to break compatibility or enforce global standards. Why do all string functions do not accept UTF-8?