personally Im surprised R is still in active development when the main use case ...

peatmoss · on Dec 5, 2020

R vs. Python flamewars always strike me as a Budweiser vs. Miller kind of argument. Neither is really a “craft beer” of programming languages. Neither are super remarkable as programming languages. Both made a bunch pragmatic tradeoffs to appeal to large audiences that share similar values—both are “average joe” beers.

Python has comparative advantages over R in production roles. R has comparative advantage in statistical libraries, visualization, and meta programming. Neither are exemplars for production deployment or meta programming (R is an exemplar for stats libraries however).

canjobear · on Dec 5, 2020

Tidyverse absolutely has a hipster craft beer feel to it. I think it's great, but it's true.

ct0 · on Dec 5, 2020

There is nothing more hip than library(tidyverse) that I've found in python.

civilized · on Dec 5, 2020

I'm really into this package that lets you manipulate tabular data using dozens of different systems with the exact same code

Yeah, you've probably never heard of it

civilized · on Dec 5, 2020

Nah, it's not nicer. dplyr is way better than pandas. But there is no end to the supply of Python fanbois who only know Python and assume that whatever's in Python just has to be better

NewJazz · on Dec 5, 2020

I don't mind pandas so much, although dplyr is quite nice IMO (feels like natural language and declarative/SQL like, whereas pandas ends up with lots of procedural idioms).

ggplot is something that I don't think matplotlib is comparable to at all, though. I am so much faster at iterating on a visualization with R/ggplot than Python/matplotlib. Maybe it is my tooling, though. How about others who have used both? What are your experiences?

dm319 · on Dec 5, 2020

No, same here. I tried to recreate some covid rate graphs in python. The ggplot code did facetting and fitted a LOESS to the data. Nothing ground breaking, but it really hit the limits of what seaborn was able to do, and I wasn't able to tinker with it much further. It got to the point where to make it look good I needed to calculate all the curves manually.

t_serpico · on Dec 5, 2020

ggplot >> matplotlib and dplyr >> pandas. its not even close imo.

hated · on Dec 5, 2020

Pandas is used in some top 10 banks for analytics. Its performance is abysmal at the scale used there. Nobody wants to invest resources in training analysts to write high performance code so here we are. I have never viewed SQL more highly after seeing the mess that analysts make when writing imperative code.

civilized · on Dec 5, 2020

No surprise there - pandas encourages ugly, inefficient code with its bloated, unintuitive API.

Once I was a lead on a new project and asked the intern to write some basic ETL code for data in some spreadsheets. I said she could write it in Python if she wanted, because "Python is good for ETL", right?

This intern was not dumb by any means, but she wrote code that took 5 minutes to do something that can be done in <1 second with the obvious dplyr approach.

Also, if your bank analysts pick up dplyr, they can use dbplyr to write SQL for them :)

peatmoss · on Dec 5, 2020

R’s meta programming facilities are head and shoulders above Python’s, which I think explains the brilliance of dplyr and dbplyr. But I feel like with R you have to scrape back a bunch of layers to get to the Schemey parts. I’ve always wondered what Hadley and Co would have done with dplyr and dbplyr had they had something like Racket at their disposal.

kgwgk · on Dec 5, 2020

Unfortunately R success killed xlisp-stat: http://homepage.divms.uiowa.edu/~luke/xls/xlsinfo/

Edit: or maybe it's not dead? I just found http://www.user2019.fr/static/pres/t246174.pdf

civilized · on Dec 5, 2020

I was offended the first time I encountered R's nonstandard evaluation, but it didn't take long to accept it. Now I wonder why anyone would want to write `mytable.column` a million times when it's obvious from context what `column` is referred to, and the computer can reliably figure it out for you with some simple scoping rules. It's a superior notation that facilitates focus on the real underlying problem, and data analysts love that.

em500 · on Dec 5, 2020

IMO they should just bite the bullet and learn proper SQL. I say this as a data scientist who learned SQL later than C, Matlab, R, Python/Pandas (though earlier than PySpark).

civilized · on Dec 5, 2020

I agree. SQL is nothing to be afraid of, and there's no happier place to be analyzing huge tabular datasets than in a modern columnar database

orhmeh09 · on Dec 5, 2020

R’s data.table package is faster at these things out of the box than any single instance of a database server I’ve encountered. This is frustrating because I’m trying to explain some systemic issues we suffer by not using a relational database, but it’s really hard to make my case when data.table is one install.packages away and a version upgrade from Postgres 9 to something a little faster is gatekept by bureaucracy. I’ve been trying for months!

civilized · on Dec 5, 2020

You need a columnar database for good performance. Try DuckDB to ease them into it, it's a columnar SQLite.

orhmeh09 · on Dec 6, 2020

Thanks, I’m checking it out, it seems pretty interesting to keep an eye on. Lots of properties that would be useful in our shared computing environment like not requiring root or Docker.

hated · on Dec 6, 2020

Might also be worth running a local instance of Postgres 13. Super easy to do on Windows without administrator rights.

smabie · on Dec 6, 2020

Pandas/python is amazingly prevalent at trading firms. And everyday, we bitch about the performance, we bitch about the stupid API, we bitch about the GIL, the lack of expressiveness. The list goes on and on. But for some braindead reason, we never switch to Julia. It's masochistic.

ivirshup · on Dec 6, 2020

I do think Julia is a far better language for numerics than python, but compared to DataFrames.jl, pandas can be quite fast. I know, "but it's easier to make it faster in Julia". Last I checked `sort(df, :col)` was significantly slower than `df[sortperm(df[:col])]`. Someone actually has to go through and make these libraries fast.

Second issue, in my field (bioinformatics) the script is still a pretty common unit of code. Without cached compilation being a simple flag, Julia often is slower.

smabie · on Dec 6, 2020

Yeah, that's a good point. DataFrames.jl starts to really shine what the cookie cutter pandas functions arent adequate for what you need to do. DataFrames.jl can certainly be slower in some cases, but you should expect a consistent level of performance no matter what you do. This is a farcry from Pandas, which tanks by large factors when you start calling Python code vs C code.

In regards to Julia's compilation problem, you can use https://github.com/JuliaLang/PackageCompiler.jl to precompile an image, allowing you to avoid paying the JIT performance penalty over and over again.

fithisux · on Dec 5, 2020

I use Python at work, but R is the uber weapon.

SubiculumCode · on Dec 5, 2020

yeah. Pick one, learn it, and you'll be fine, no patter if you chose Python or R.

Personally, I prefer R for my use case which is longitudinal analysis of experimental data.

canjobear · on Dec 5, 2020

About 8 years ago I agreed with this point, but with the development of tidyverse, R has become far superior to Python for anything involving dataframes.

I teach classes involving data analysis, some in Python and some in R (different topics). The amount of time the Python students spend fighting pandas---looking up errors, trying to parse the docs, trying out new arcane indexing strategies---is obscene. On the other hand, the R students progress rapidly. I'd move everything to R if I could, but Python is still better for NLP pipelines.

iaw · on Dec 5, 2020

I know R because that's what we used at my first company. I would love to switch to Python/Pandas but I'm comfortable with R and it does everything I need it to with one exception over ten years of heavy use.

Python is wonderful but the cognitive load for switching in industry and academia without a clear cost benefit isn't worth it to most people I know in my shoes. I encourage new coders to learn Python but discounting R feels a bit asinine.

Hadley is still actively doing work for R which has led to a graphing packages that is substantially better than anything in Python (last I check). I have no doubt that Python will steal it and implement it eventually (as they should) but R is still doing firsts that Python hasn't (note the native implementation of Piping, they're late to the party on lambda functions obviously)

Icathian · on Dec 5, 2020

I made the switch years ago and there is lots that python does better. I really, really wish for a perfect port of dplyr and ggplot2. Those are what I truly miss, everything else I'm pretty happy with.

rjmorris · on Dec 6, 2020

plotnine isn't a perfect port of ggplot2, but it's pretty close. https://plotnine.readthedocs.io/en/stable/

civilized · on Dec 6, 2020

It will never happen. Python doesn't trust programmers with the power to make packages like dplyr and ggplot

civilized · on Dec 5, 2020

R already has a better lambda than Python, simply by virtue of having first class functions. This is just a bit shorter notation for something that already existed.

zwaps · on Dec 5, 2020

I use Python whenever I can, but R has loads and loads of statistical libraries that Pyrhon doesn’t. It is not even close.

Emphere · on Dec 6, 2020

Yeah, basically this. I assume HN has a higher number of people who work in ML jobs in fields like finance etc. If you're working in any sort of social/public health research, then most new methods seem to be implemented as R packages. I'm thinking of things like new methods for propensity score, sequential trial designs etc. Also seems to be the preferred language on the Stats Stack Exchange posts.

free2OSS · on Dec 6, 2020

What kind of stat problems?

Also I used to love Python... Until I got a full time job and learned why static typing exists.

zwaps · on Dec 6, 2020

Any sort of statistical or econometric estimator is typically published as an R package.

So for example, I recently saw a paper with a quite complex estimator based on dynamic panels and network (or spacial) interdependence that could identify missing network ties. For that, an R package exists.

If you want to use it in Python, you'd have to replicate a whole estimation infrastructure yourself, starting by extending the basic models in statsmodels.

That example is quite typical in my opinion.

Like I said, really like to code in Python and I don't like R all that much. But if someone says: "Why would you use R, Python is better", then we can confidently say the person does not know what R is actually used for.