Enhancing R: The Vision and Impact of Jan Vitek's MaintainR Initiative

rrr_oh_man · 2024-05-16T13:06:34

I adore R.

It's the language I'm most proficient in. Tidyverse is the most human-friendly way for explorational data analysis. Data.Table is blazing fast. RStudio is the best IDE I've ever seen and so tightly coupled with the mostly amazing documentation that it's a pure delight. CRAN's quality control next to none.

That being said, I prefer Python for my production use cases.

Why? What's missing in R, imho, is:

a) a decent interface to the web.

b) a decent way to use async processing / utilise the CPU fully.

I've found ways around both, but compare `Shiny` to `FastApi` + `Jinja2` and `future` to `asyncio`.

Shiny just feels clunky, no matter what you do. And R `future` (which saved my butt in 2019 when I was processing millions of geospatial data points for a project, so not throwing any shade) can be a total mindfuck and bug out (at least back then).

(Man, I miss working more in R, though.)

CornCobs · 2024-05-16T14:00:02

I too am a Tidyverse shill. Hadley Wickham truly did an amazing job designing the whole ecosystem and API. I personally haven't encountered another API that has given me the same feeling of mastery and empowerment - just the correct blend of expressiveness, cohesion and ease of use.

Of course this is partly attributed to R's great DSL capabilities and making documentation first class. But I've definitely seen terrible APIs in R too.

Wonder if anyone else has had a similar experience with another ecosystem? (Regarding API design)

mhogers · 2024-05-16T15:29:10

On the async part, definitely worth checking out https://github.com/shikokuchuo/mirai, also integrates nicely with Shiny/plumber

tetris11 · 2024-05-16T07:16:01

I don't really understand the need for a compiler in R. All the main packages that have to do any heavy lifting have had their functions outsourced to fortran or C decades ago.

R is also thriving in bioinformatics with Python trailing behind as an afterthought. I think the main reason why R is fading is simply why some other languages fade: fashion, and prestige of the new masters (read: professor X in field Y likes esoteric language Z, so his grad students write out libraries A/B/C to corner the Z library space, and professor X gets more clout than he ever would in a default language).

_Wintermute · 2024-05-16T07:56:32

> R is also thriving in bioinformatics with Python trailing behind as an afterthought.

Maybe this is sub-field dependent, I'm a bioinformatician who hasn't touched R in about 5 years, and everything is now in python.

mbreese · 2024-05-16T12:54:38

It’s also training dependent.

It took a long time for Perl to go away from daily use in bioinformatics, solely dependent upon how common it was twenty years ago.

In our lab, there is me, a (mid career) polyglot who mixes Python, Go, Java, and R daily. I use tab delimited text files to transfer data. I also grew up coding in C++ and like learning new languages.

We also have a (mid career) staff scientist who grew up in Perl, but switched 100% to R and an (early career) postdoc who has always used 100% R. For both of these people, if work can be done in R, it is. If it can’t be done in R, they figure out how to do it in R anyway (even if that is shelling out to another program).

We also have a (young) grad student that is 95% Python. They try to keep to the Python tooling, even though they are quite aware of the R ecosystem.

There is a generational shift in the field, and it is more apparent each year. I find it interesting that Python took over from Perl first and now it’s trying to take over from R.

tetris11 · 2024-05-16T11:20:33

oh wow, I made a sweeping generalization - apologies. In single cell, bulk RNA-seq, ATAC, ChIP, I'd say that there are more R packages for the analysis of these omics than Python packages.

flobosg · 2024-05-16T11:44:49

Python is catching up on single-cell transcriptomics; see AnnData, Scanpy, et al.

tetris11 · 2024-05-16T14:47:05

ScanPy/AnnData has been dead in the water for a while now, and most people use Seurat due to its operability with many many downstream extensions

_Wintermute · 2024-05-17T07:28:19

I have no dog in the Scanpy/Seurat argument, but AnnData is becoming very popular as a data format even outside of single-cell omics.

flobosg · 2024-05-16T16:05:11

> has been dead in the water for a while now

Both are under active development and are used in several transcriptomics atlas projects, as far as I can tell.

tetris11 · 2024-05-16T18:39:53

Those atlases were established back when Scanpy and Seurat were relatively beta and were still fighting out the tool space.

Look at the packages now for integration, pseudo time, pseudobulk - R (and therefore Seurat) dominates heavily

valarauko · 2024-05-16T21:45:32

Disagree - Seurat had first mover advantage with single cell but sucked with larger datasets that Scanpy could handle till the big change in Seurat 5. The preference for either Seurat/Scanpy is incredibly lab specific. That said, Seurat is better documented for sure, but the ecosystem for both is incredibly rich and flourishing.

cd4plus · 2024-05-17T01:25:24

yeah, I also disagree with this. It's true Seurat is still heavily used for scRNA/scATAC but I see most new models increasingly being written/tooled for python and based on anndata. Geneformer, scGPT, scVI etc. I wish there was better operability between the scverse stuff and Seurat, but Seurat went their own way from SCE/bioconductor so that's probably not going to happen.

tetris11 · 2024-05-22T07:16:41

To be fair to them, getting anything submitted to buoconductor requires a ton of effort, and the pay off is often less concise code

enmce · 2024-05-16T09:24:06

Same expirience,about 13 years in the field, most tasks done in python.

wodenokoto · 2024-05-16T08:46:46

> All the main packages that have to do any heavy lifting have had their functions outsourced to fortran or C decades ago.

And apparently that's the problem, FTA:

> Fortran isn’t as popular as it used to be, and we encounter issues when compiling it with modern compilers like LLVM. Ensuring Fortran compiles across all desired architectures and operating systems has been a persistent challenge.

I don't compile any old C or fortran code, so I can't say anything about if and what kinds of problems arise from using modern compilers against modern targets.

pklausler · 2024-05-16T14:42:09

It became painfully obvious during the development of LLVM Flang that it is very hard to define what the Fortran language is, and that it is tricky to write truly portable Fortran. See https://github.com/klausler/fortran-wringer-tests?tab=readme...

vhhn · 2024-05-16T11:00:11

We use embedded R in production in a way some other companies would use Python and I can say having a better compiler would definitely help.

Even if most people use R interactively, having contributers working on compiler has many positive spillovers for the language.

Also note that the R code running behind the scenes of your scripts (powering the functions of your favourite packages) is quite a different language, using less dynamic features. This is where a better compiler would always be appreciated.

UniverseHacker · 2024-05-16T14:04:26

It depends on your lab/institution, and sub field, but I think Python is more common in bioinformatics than R nowadays. I prefer R - and personally developed many of the most widely used Bioinformatics packages in R, but have mostly shifted my lab to Python because of more extensive library options. Most of the big institutions - like large sequencing facilities are using Python. If you’re hiring teams of software engineers to make professional quality tools, it’s a lot easier to recruit for Python.

derbOac · 2024-05-16T15:08:09

I've used R since it was in beta, as well as many other languages: python, perl, C/C++, Fortran, lisp, Julia, ... others I'm forgetting.

I think you're right about fads and the appeal of a new sexy language. No dispute on that point.

However, some of what constitutes fads are really more like a coincidental convergence of advantages. So for example, something gets picked up in field X because it's more convenient and all the people learned that in college, and then the same thing happens in another field, and then when libraries in the two fields interact, it's like multiplying the reasons. These network effects happen everywhere in tech. It's not necessarily good, but it happens.

With R in particular there's a long arc to reasons why it might fade. I've heard about R fading before and then it picked up again, so who knows, but it will probably fade and there are reasons why.

If what you're doing involves mathematical or computational fundamentals, all that wrapping around C and Fortran gets annoying really fast. Not everything involving heavy lifting has been coded in fortran or C already, and sometimes passing back and forth between those heavy lifting routines becomes a huge bottleneck.

R is slow as hell, and yes you can write things in C or Fortran (or Rust etc?), but it turns something that should be fairly straightforward into a library project on its own almost. It's just easier to be able to write all the underlying stuff and IO/API stuff all in the same language and have it perform optimally. In fact, I'd probably rather just write it all in C or Fortran than write parts in one language and then wrap it in R — the R wrapping would mostly be to make it accessible to others (which is important, but there's the library bit).

R too has become horribly fragmented in my opinion. A lot of things like ggplot and tidyverse are great, but it's led to this kind of fracturing of syntax in what was already a kind of fuzzily defined syntax in some ways.

For what it's worth, I'm not sure I greatly prefer python. It's more general-purpose than R so has that advantage, but also has the same performance issues as R, and doesn't seem quite as well suited to statistical and numerical computing to me. Maybe it's just the object-heavy structure of python or something — maybe I prefer something closer to either lisp or C in the end — but python has never felt quite right to me. I'm looking forward to seeing what happens with Mojo, because that could be a real game changer, but am not holding my breath.

Julia is appealing to me and seems to check all the right boxes, has a lot of the fun I had with R early on, but I agree with some of the criticism it's gotten for library interdependence problems due to type flexibility and conversion-type issues. I also think error handling needs a lot of work. It's great when it works, but sometimes just becomes intractable to debug.

For me, statistical/mathematical computing is exciting in the last several years in a way it hasn't been in a long time, but also feels not quite "there" yet. I could still see a lot of currently used languages be superceded pretty dramatically by something new.

mfld · 2024-05-16T15:29:59

Thanks for that thoughtful comment. I particularly agree on the fragmentation of syntax: there are several syntax styles (plyR, S2 Vs S3, ...) and I always had a hard time figuring out the underlying data structures.

I think similar issues with syntax is what drove many students away from perl, preferring python with its one-way-to-do-it philosophy.

ideamotor · 2024-05-16T08:54:34

Agree that this discourse about “extending R’s useful life” by adding a compiler is mostly just silly. It’s equally valid to say the reason for building any of the (numerous) libraries for R is to “prolong R’s usefulness”.

I think you’ve proposed a real reason a language’s popularity changes. But let’s add a few more: First, as languages grow in use that in itself leads to more use (and vice versa). Second, users of languages use said languages as protection of their jobs, livelihoods, culture, and so on by simply not learning less popular languages or allowing the use of less commonly used languages. There is power in numbers.

IMO R’s problem has nothing to do with R’s limitations (which when limited are easily expanded). To the contrary, if anything, R’s ability to enable users to do so much is more of a problem than its limitations. The norms (culture) of a company say using many different languages for different organizational roles can clash with an R user who can succinctly work cross-functionally. Employees could be interested in preserving siloed roles and employers could prefer limited scope for employees. Instead of one person building something and presenting to users, you could have many employees serving in many roles using standard languages for each role. R is basically a wrapper language, and the ease at which third-party libraries can be built and installed for R, and R’s flexibility is inherently “useful” but less dogmatic. Less dogmatic but also less standard and less common.

… So I just don’t think the problem for R is that “it’s not deemed useful” and I actually think such an argument is disingenuous on the part of people who want to limit the power of R users. Granted, I think, particularly in large organizations, using the most popular languages in itself has valid justification. But the reason is to attract employees and to build siloed expertise; not to enhance “usefulness”.

wodenokoto · 2024-05-16T11:43:43

Isn’t Python even more of a “do everything” language than R?

ideamotor · 2024-05-17T05:08:21

I very much think R is more of a “do everything language” because put simply, R lets developers do a lot more than Python in the language itself. It’s how the implementation of “polars” in R basically looks the exact same as in Python. R is like English in its ability to bring in other languages. Take a look at how many OOP systems there are. The best book on these subjects is https://adv-r.hadley.nz/metaprogramming.html.

clusterhacks · 2024-05-16T13:54:18

Multiple times in the last 3-5 years, I have coded up wrappers to new/evolving C++ libraries using Rcpp for internal use by bioinformaticians.

Especially anything using multi-threaded algorithms or from the recent algorithms research world, I really get a kick out of hacking into a usable R module. It is almost like doing computer science again (as opposed to the real day job, which is data munging for reporting tools).

fiforpg · 2024-05-16T13:01:06

> professor X in field Y likes esoteric language Z, so his grad students write out libraries A/B/C to corner the Z library space

This description seems to fit Julia really well. What other languages are like this?

uniqueuid · 2024-05-16T09:05:13

Interesting that they mention Unicode problems on Windows. I've ran into these a couple of times where data exported from Windows R had unicode codepoints swapped and/or double-encoded.

The all-in-one ecosystem of R is nice, but text encoding is still a major pain point (e.g. people try to put emoji into RMD to translate into pdf via tinytex, and fail miserably, of course).

bonadrag · 2024-05-16T11:43:20

It's sad to see the downfall of R. I still use R a lot interactively, but Python takes at least 50% of the share.

R had a good run, and IMO they still have some packages that are just too good to see it languishing into the future, notably the package data.table. I have not come across a better library for data manipulation, in R or Python. The syntax is excellent and it is faster than most alternatives, especially the popular ones.

I think an important factor that contributed to R's downfall is the decreasing hype around data science, as well as the fact that the core base of R users do not have a background in the STEM fields, but rather the humanities and social sciences. I do believe that the R community of developers are dedicated and perhaps, in relative terms, are more involved with their projects than a typical Python developer. But that's not enough, the sheer numbers of Python developers eclipses R's and that is too hard to overcome.

teekert · 2024-05-16T11:49:06

I am seeing a downfall in industry, where R has a researchy image, when stuff gets serious, R scripts are rewritten into Python packages.

But in the academic world (I'm in bioinformatics)... There seems to be nothing but R, in my experience. I don't really like that, because we have Snakemake and a lot of ML stuff, all Python, and the R people have a barrier to get started. I myself associate R with "just scripts and notebooks". Never do the R people seem to make anything into a well maintainable module. They make the notebook, use the build-in R functions and then their work is done. It seems to be different in the Python world where I see people writing modules that are "re-useable assets", and then those are used in notebooks for data science. This is probably my industry bias and perhaps Python-using academics also never make packages.

I guess also that there is no such a thing as poetry in R? I'm not entirely sure...

bonadrag · 2024-05-16T12:00:31

> I am seeing a downfall in industry

Exactly, it is an industry-driven change IMO. In fact, R has gained a lot of popularity in academia, especially in the Social Sciences --though perhaps Python has gained even more.

But in industry, R's falling hard. I also think that the growing popularity of cloud analytics platform solutions such as Azure Synapse (now Fabric) is a significant factor. Though SparkR is a decent R-native API to Spark, Python has so much support in those cloud analytics ecosystems, it's hard to keep doing things in R.

fiforpg · 2024-05-16T12:58:09

> and perhaps Python-using academics also never make packages.

Your guess is (largely) correct. For most users, the workflow is exactly as you described:

> They make the notebook, use the build-in R functions and then their work is done.

kjkjadksj · 2024-05-16T15:48:43

You know you can use snakemake to run r scripts too.

mjhay · 2024-05-16T12:13:07

The poetry equivalent would be renv.

R is a highly functional language modeled after Scheme. In many ways it's more powerful than python, especially due to its metaprogramming capabilities. It's possible to write maintainable, readable, high-quality code in R (just look at the Tidyverse[0] libraries). The issue is that the user base is mostly scientists and statisticians, and they just don't.

The bioinformatics space seems especially dire with dumpster fires like Bioconductor.

[0] https://www.tidyverse.org/

wdkrnls · 2024-05-19T15:20:51

This is so wrong and all FUD. At my company R users write modular packages and deploy them to internal CRANs while python users write scripts and email them because it is way easier to write modular code in R because R favors easy on ramps for new users while embracing good ideas like immutable semantics and functional programming. Further, R has targets which is way more general than snakemake. Even from a reproducibility perspective, R wins hands down. Getting an R package into Guix is trivial compared to a python package. This means that in 10 years, more R code will still run than python. It really saddens me to see the level of dishonest crap which goes unchallenged to make an inferior prototyping language kill the only lispy language with good numerical libraries.

teekert · 2024-05-21T08:09:32

To be fair, I was sharing my perspective, thanks for sharing yours. I guess we both offered some anecdata here.

wodenokoto · 2024-05-16T11:58:32

I always thought the core user were statisticians and the hype around data science and ML called for a new way of looking at statistics (accuracy of predictions over interpretation of models) and with this a new language of choice.

“R is for statistics, but if you want to do machine learning you have to use Python”

Which is ironic given that pythons success as a data science tool is built on the back of a matlab clones.

bonadrag · 2024-05-16T12:01:50

Perhaps in the past, not that statisticians left, but I am referring to "users", not developers.

fabk · 2024-05-16T11:42:39

I am yet to find a replacement for R's dbplyr in Python. Read: Python-to-SQL code generation (dbplyr performs R-to-SQL). This is a very powerful package that alone can make me stay in R.

padthai · 2024-05-16T13:51:10

Ibis would be the equivalent of unified frontend for tabular data. It was created by the Wes McKinney, also Pandas creator, that now works in Posit (old RStudio).

https://ibis-project.org/

jmount · 2024-05-16T13:50:51

There are several. I am tinkering with one called data algebra: https://pypi.org/project/data-algebra/ .

dash2 · 2024-05-16T06:59:02

I think one problem R might have is that the core language is developed in a fairly tight-knit and informal way. There's no equivalent to Python's PEP process, say. For a young language the informality might be an advantage, but for an older one the risk is that the core developers become a bit insulated from the community of users.

kuhewa · 2024-05-16T10:44:03

> the risk is that the core developers become a bit insulated from the community of users

I think I like that actually. The core is quite conservative, meanwhile the packages that the average R community member uses break their own backwards compatibility every few months for silly reasons like renaming an argument `linewidth` instead of `size` in a function that was otherwise backwards compatible for a decade.

j7ake · 2024-05-16T12:35:47

Great write up thanks for your decade of work!

R is still unbeatable for wrangling tabular data, visualising, and fitting complex models on tabular data (complex up to GAMs let’s say).

0xflarion · 2024-05-16T09:20:38

I can also see that R is having trouble simply bc a lot of people already know Python and don't want to learn "a new language".

zaptheimpaler · 2024-05-16T09:08:18

It would be better to let it die and port the libraries over to Python instead.

civilized · 2024-05-16T09:11:15

People have tried to port core R data science packages like dplyr to Python many times now and all that's happened is Python has fallen further behind.

It's not just a matter of manpower and funding, which Python has more of. Python is actually a less capable language in some important ways. It's not that expressive, and has basically no affordances for domain-specific languages.

I'll be happy when we have a successor language for R, but it won't be 2024 Python.

YWall39 · 2024-05-16T09:49:46

Could not agree more

> I'll be happy when we have a successor language for R, but it won't be 2024 Python.

When it comes to statistics, and especially having to program stats analysis instead of just calling libraries, R is a much better language than Python or Julia. When it comes to other things, they are better than R. I use all 3 daily.

Onawa · 2024-05-16T14:18:48

And with Quarto being available, why wouldn't you switch to whatever language is best for the current task? I'm throwing ObservableJS in there now too with the Quarto compatibility. You can do your data cleanup in R/Python, then use native web libraries for displaying pretty plots in the browser.

padthai · 2024-05-16T14:29:03

It is clear that R is not doing as well as Python. It is not clear that it is not doing well overall. It is still the best in statistical analysis and interactive usage (sklearn, statmodels, etc are not good enough). Maybe there is a healthy community that can thrive. For the moment being it seems that they are trying to eat SAS userbase.

YWall39 · 2024-05-16T09:43:01

The R code I run daily has at least 10 stats libraries that don't exist in Python, an important new one dropped a few days ago.

And as someone who uses both languages daily, Python is much better at some things, R in others. R is much better in coding stats analysis.

Competition is great, may both thrive.

kuhewa · 2024-05-16T10:47:46

And I don't think younger statisticians, those that will have 30+ years in the career tank, are now favouring Python over R (a few, Julia perhaps). So I don't imagine new stats functionality dropping in R first is going to change any time even close to soon.

YWall39 · 2024-05-16T10:57:35

yes, what I see. R is so dominant in stats, and the momentum against it around me is more towards Julia not Python.

williamcotton · 2024-05-16T10:56:59

I’m curious and new to R. What are those R libraries you’re referring to?

YWall39 · 2024-05-16T11:03:23

most are specialised in my field. for the more general, I cannot work without dynlm and plm. Just last week I needed to compare the empirical distribution of something to a known reference, with either the empirical or QQ, and searched Python, Julia and R. The R libraries were by far the best. More subjectively, I like data.table better than any alternative in any language for the type of work I do.

bonadrag · 2024-05-16T12:04:42

There is nothing like data.table in Julia, Python or JavaScript (if you want to stick to high level programming languages). It's the best combo you can get for speed + syntax.

jmount · 2024-05-16T13:52:07

Syntactically, no. But functionally Polars is new in the space.

bonadrag · 2024-05-16T14:03:40

Yes, Polars does extremely well in popular data manipulation benchmarks.