Hacker News new | past | comments | ask | show | jobs | submit login
Why do we use R rather than Excel? (shkspr.mobi)
241 points by edent 6 months ago | hide | past | favorite | 251 comments

If you need to do something once, use a GUI (Excel).

If you need to do something ten times, use hotkeys and shortcuts.

If you need to do something a hundred times, write a script (R).

I usually use the command line as the example for why writing code and scripts are better than the more intuitive and lower-learning-curve GUIs.

If I want to move a file from one folder to another then I just drag it across. Easy.

If I want to move a thousand files from one folder to another, I will benefit from learning `CTRL-A` or shift-clicking (slightly more obscure than the 'intuitive' drag each file across individually or drag a large box around them all to select them).

If I want to move a thousand files beginning with 'UTR-77' and ending with '.csv' then I would benefit from learning `mv UTR-77*.csv $folder`, but I could still do it manually if I didn't know that was an option.

If I want to move a thousand files beginning with 'URT-77' to another folder at a moment's notice or at Thursday 1am, then the only options I really have are scripting.

I almost feel like before people learn the 'basic' stuff as outlined at the beginning of the article, they should be shown some 'magic' that is only really possible with scripting so that it's clear from the outset why you wouldn't 'just use excel'.

I often write a script even if I only have to do something twice. In this case, the point is not to save time, it's to be secure in the knowledge that I did the thing in exactly the same way both times, because it was the same script that did it both times.

(And then it saves time anyway, because it turns out I had to do the thing 20 more times after all.)

It’s not full proof, but the heuristic that your stakeholder will ask you for the same thing next week (or immediately ask you to do the same thing but slightly differently) has held pretty true for me

yes, I've got a little heuristic I use to tell the beginner analysts at work when to think about automating their work or putting it into a script.

if you're doing something once and once only, maybe it's a candidate for the manual or gui way.

if you think you might do it twice, it's almost certainly time to automate or start programming it.

Anything encountered in business that you encounter more than 1 time is likely to be encountered N times more.

The good old "allow it never, once or n times" :-)

This is spot on! I think sometimes if you do same thing 3 times it is already better to script it, unless you are sure that no further modifications will take place. Scripting will ensure reproducibility and it is invaluable when you can follow your logic later when you already forgot how you come up with that chart.

this is obviously laziness, the first of the three virtues http://threevirtues.com/

Exactly. A little {insert favorite language} snippet is immediately inspectable, also weeks or months or years in the future.

You underestimate my ability to write code that I myself can’t make sense of years (or sometimes even months or weeks) in the future.

Same for me – but that's usually an indicator that the spreadsheet situation would be even worse. We just suck, apparently, but we suck less with code :-)

Also, if you need to do something (even once) and have to know how you did it three months later, write a script.

I usually save my command history to version control every time I git commit. It helps a lot for less frequently visited projects.

You can set your HIST_FILE per project. That’s what I do in vscode (terminal env), so I have a history per project.

Other than that, I usually document some ops/quick read me about how to use or prepare the project

Yes, the lack of reproducibility is really Excel’s biggest weakness in data analysis. Is this 5th workbook that is a 4th generation derivation from the original data statistically valid? Can you go back to each step and verify it was valid? A script that does all processing and analysis starting from raw data can be validated and audited.


> I want to move a thousand files beginning with 'UTR-77' and ending with '.csv'

1. Use a file manager with stable sorting [1] (I use Thunar which does this, but I suspect lots of file managers keep the sort order stable).

2. Sort by type.

3. Sort by name.

4. Select the first file named UTR-77.

5. Scroll to the last file, and Shift+Click it.

6. Cut then paste to your desired directory.

[1] - https://en.wikipedia.org/wiki/Sorting_algorithm#Stability

This feels pretty squarely in the realm of "but you could still do it manually". The CLI method still seems cleaner and faster, with less likelihood of error since you don't select the files manually.

I had the same initial reaction, and I think conversations like this often include desperate justification of GUI workflows driven by an inferiority complex.

But the parent comment was responding to a portion of its parent that specifically said "there's no option other than scripting". It sounds like he's responding to that, not making some general claim about the GUI being a better option.

How is not selecting it manually but implementing code that could select the wrong files any better?

> 5. Scroll to the last file, and Shift+Click it.

With thousands of files, this won't be particularly fast or easy.

If you know how to use a scrollbar, it is fast and easy. Grab the thumb and drag it down. If you overshoot or have trouble landing on the last file, use the page up/down areas of the scrollbar or its up/down arrow to fine tune the scroll position. Should only take a few seconds.

Another way is to hold down the Ctrl key and use the PgDn/PgUp keys to scroll. If you hold down Ctrl+PgDn you can whip through thousands of files quickly. Because you have the Ctrl key down, it won't affect your selection of the first file.

If all you know is two finger touchpad scrolling, then it will be very tedious.

Unfortunately, from observing a number of people - and apps that hide the scrollbars - it seems to me that two finger scrolling has largely taken over from scrollbar scrolling or the scroll keys.

There are many circumstances in which it's not.

- Remotely accessing a system (terminal or GUI). - An overloaded system (GUI response is ... inconsisstent) - Remotely accessing a system from a touch-based device. An increasingly common scenario. - Walking someone through a process (text is unambiguous). - Repeated operations (something that has to be done multiple times, in multiple directories, on an ongoing basis, on a scheduled basis, reliably, consistently, provably, with documentation and debuggability).

Additionally, the scrollbar seems to be increasingly unpopular. I'm on record as not being happy about this.

https://ello.co/dredmorbius/post/0hgfswmoti3fi5zgftjecq (HN discussion: https://news.ycombinator.com/item?id=21356511)

I enjoyed your flippin' scrollbar rant! I couldn't have put it better myself.

Thanks :)

rstudio is an excellent gui for R. So you can already start with R.

So is https://exploratory.io - I almost always start with it and export the R script once I have some intuitive confidence in my data.

> I would benefit from learning `mv UTR-77.csv $folder`

In far manager (freeware, open source) and total commander (commercial, not too expensive, trial available), numpad `+` key open “expand selection” popup, where you can type “utr-77.csv”, and it will select just these files.

Unlike the command line, you can inspect what had been selected before moving these files. You can also use insert/numpad +/numpad - keys to modify the set of files going to be moved/copied/deleted/zipped/etc.

Most file systems and file managers don’t have undo support. It can be important to review what going to happen before actually moving any files.

This is a great comment and the line of reasoning for work at different scales makes so much sense.

Not sure I agree with this. Most statistical analysis is a 'do once' thing, but using Excel would not be a good choice. Reproducibility and immutability of the source data are two good reasons to write a script.

And then there is the fun tasks, which I have always considered to pit the CLI guys against the GUI: Copy all cat pictures to different directory. (No metadata or names)

But why not use the scripting languages built in to Excel?


I found that "Automation"[1] by xkcd (R Munroe) closely captures the lifecycle of my "clever" scripts.

I have a few of them glancing at me from the corner of my desktop and I hope to do something "over the summer".

[1] https://xkcd.com/1319/

> If I want to move a file from one folder to another then I just drag it across. Easy.

Even this is easier with scripting, especially if you're already used to thinking in wildcards and tab-completion. Scrolling, hunting for files, dragging, clicking: these are inherently clumsier and slower steps, optimized for new-user intuitiveness and simplicity over efficiency. The upfront investment of making your brain think in CLI is fairly high, but once you've done it, there's vanishingly little reason to bother with file browsers. I don't think I've used one in a decade, even with (eg) Nautilus's ability to match wildcards with ctrl+s.

For those who haven't used it, ctrl-r in the command line is the best thing in the world. It does a reverse-search over previous commands. Set your max history size to unlimited, and your CLI will remember /all/ commands you ever type, making it very easy to find long, strange incantations that you figured out once, so long as you can remember some small part of the command.

For example, How did I do that magical ffmpeg thing last time? Just hit ctrl-r, type ffmpeg, and keep hitting ctrl-r to find previous examples until I find what I'm after.

For more complex commands I use often I sometimes comment the end of my line with a keyword I can search on

Does this work on Mac OS?

I’ve got a bunch of saved notes with various incantations that would become redundant if so!

> I’ve got a bunch of saved notes with various incantations that would become redundant if so!

Throw them in scripts! Another advantage of using a CLI is that it provides a friction-free path from ad hoc commands to saved commands to messy scripts to properly supported tools. I've ridden this gradient multiple times prfoesssionaly, especially now that I work with a bunch of PhDs who are less comfortable around OSes than I am. It's a good feeling to have such a clean pipeline from "this command/series of commands feels awkward to me" to a reviewed, checked-in tool that becomes a critical part of the team's workflow. The best part is that there's value at every incremental step, so you don't need to invest any organizational effort, which is especially useful in a chaotic execution environment.

Seems like it should!


Good to double check how to turn off the history size limit before throwing away the post it notes entirely.

Big big reason: R doesn't alter data without being asked to, Excel does so, silently, sneakily. Anything important should not be done in Excel.

https://www.theverge.com/2020/8/6/21355674/human-genes-renam... https://stackoverflow.com/questions/165042/stop-excel-from-a...

While that is a very real problem, it isn't totally persuasive. It is just as easy to create traps in R (I lost days of my youth to R's challenging-to-defend jungle of an almost-type-system). It is even harder to debug than in Excel too - I would expect less bugs are found in R scripts because less people are up to the challenge of trying.

People shouldn't do important work in Excel. If it is important, people should be involved who have invested the time in learning something more powerful. Indeed, we could ask they aspire all the way to good practice and store their data in a database and their code in git. But there needs to be a process to verify model correctness no matter what tool is being used and bugs will exist in R as well as in Excel.

I agree that R isn't ideal, and a proper programming language is preferable to both. Which one of course will depend on the domain, but proper control over float precision, fixed-point arithmetics, bignum and debuggability isn't good in R either. It's just that R is still far above Excel.

I mostly agree, but at least Excel is more easily inspectable by a human, which makes a certain class of errors (specifically, silly logic bugs) less likely to get through.

Not really. Logic in Excel is based on stuff like SUMIF( ...) which is totally unreadable without lots of context about the cell or range it refers to. If someone fumbled the fill command and some cell is missing the formula? You'll never know and the result will be automatically and silently zero. References to that result will just use the zero as input. Mixed fixed/nonfixed references like A$3? Unreadable and hard to understand. References per se? Unreadable because undescriptive, maybe there will be a column or field label, maybe there won't. Also, formulas are always hidden by default, you'll just get a sea of numbers.

Calling Excel easily inspectable is laughably wrong imho. Just the opposite.

Data tables and keeping one tab per table solves all of those.

Sounds more like an issue with the skill of average excel users than a feature gap.

It's not obvious to me that Excel is more easibly inspectable, can you say more?

To me, R seems more easily inspectable, as all the logic of a program is visible just by looking at text files, where in Excel it's hidden "under the surface", you have to click on cells, look at what's there, go click on other cells that relate to it, remember what you were looking at in the first one that's now invisible, etc.

It's the visualization of logic that I'm thinking of. You can press F2 on a cell and visually see what data is being pulled into the function, and you can easily manually check that that single calculation is accurate by re-creating that calculation by hand, and then infer that all the other stuff in the column is also accurate.

For anything very complicated though, I'd prefer R, as Excel eventually gets unwieldy. Although, I'm saying that from the perspective of being a reasonably experienced coder. The majority of the population should just use Excel, especially in a work context in non-technical teams. No matter how much you push for R, other people in the team aren't going to see the value and aren't going to go along with it, and your R code will be useless after you've left.

Why isn't R a proper programming language? What would you use over R for data analysis or statistical analysis?

R has a beautiful functional ability based on S-expressions, allowing some clever stuff to be done (i.e. tidyverse), incredibly fast (data.table is faster than Python, Julia, Matlab etc).

Can data.table play well with the tidyverse yet?

And as a side note, I believe it's moving down the rankings in the h2o benchmarks [1].

[1]: https://h2oai.github.io/db-benchmark/

Oh that's great. I'm really glad to see Julia getting faster there.

> Anything important should not be done in Excel.

Hold on a second while I go shut down the global economy for two years so we can teach everyone finance person how to program.

Do not shutdown the global economy. Just maybe move to tools with less issues around process governance:

"The 7 Biggest Excel Mistakes of All Time"


"The financial fails and business risks of spreadsheets"


"Nightmare on spreadsheet: take Excel use seriously"


"Excel – The Dirty Secret"


"8 Challenges When Using Excel For Accounting"


As though mistakes wouldn’t happen with non-programmers working in R.

In R, you can use accepted methods of software development and engineering, such as version control, automated testing, CI/CD to catch mistakes and prevent them from causing harm. In Excel? Not really.

Do you think the average finance department employee is going to do that?

Especially those departments should do so and learn to do it. Financial departments are always the ones complaining about the cost of mistakes, compliance risks (of the monetary kind) and doing due diligence. Preventing the production of wrong figures in financial excel sheets ticks all those boxes. They should bloody well be doing it, but I'm also not holding my breath ;)

And there are massive bugs in programs that use these tools like the Linux kernel still.

Would not necessarily be a bad idea.


Not that this is Excel's fault, but researchers should definitely either seriously learn how to use computer stuff, or just don't.

A person can't drive on the highway without a license, the same rigor should be applied here, especially in academic circles.

This. Most finance people who live by Excel have ninja level skills with Excel. I have had to unpack logic from some excel sheets into webapps and they had so much sophisticated programming embedded into it. And it is so much easier intuitive to use Solver for linear equations in Excel than any web application.

Regarding the graphing ability , R may have more power but plotting the graphs in Excel is so much more WYSIWYG.

finance people who learn to program will outperform those who don't.

I believe that’s true. There still aren’t enough to go around and won’t be for a very long time.

Excel is the closest we’ve come as an industry to building a tool that enables non-programmers to program. If it disappeared tomorrow, as some arrogant posters apparently wish it would, tremendous value would be destroyed. Not just in terms of existing workflows but in terms of new workflows that would not be done in R but instead would be done by hand or not at all.

I’m a hobbyist programmer who has worked in various finance jobs and is now working in investment banking. I would love to use programming in finance but I have to say it’s come in handy very little.

Well, prolly not in IBD. But in Sales and Trading or at a lot of quant funds it is very common.

Actually when you use R functions to import CSV files, they also guess data types and make some conversions by default. The difference is that in R you can turn them off if you want to. But in Excel there are automatic changes you can't turn off.

You can set the data type to "text" when importing CSVs, that turns most things off.

If you try reading an Excel file via SSIS you are out of luck. You can't turn of the auto guessing feature.

Most, but not all. That is the entire point: Excel is never safe in any configuration.

Not sure about that. It is my understanding that Excel only tries to guess the type of an input if it has not been explicitly specified by the user.

The problem with "things that look like dates being interpreted as dates" comes from not specifying that a column has type "text".

Even if you do specify a column as a text excel WILL assume that it contains dates sometimes.

Happened to me multiple times, even when i set each cell as a text.

Seems like copypasting a tab separated values resets the cells to their default state.

If "not all", can you provide an example of what gets converted if you specify text?

Just want to know to avoid future gotchas.

I'm not sure, have to try it out but text fields can be delineated by starting with a single quote ('), that might be a special case. It might not be though, I don't have Excel open to check. Pretty much everything is left alone if you specify "text" at the import stage though. (This is using the text import wizard, or text-to-columns)

As a total R beginner I was hugely put off by the complexities of data cleansing especially columns with ?different data types.

"Floating-point arithmetic may give inaccurate results in Excel"


"Align numerical precision Excel 2013 and R"


"Numeric precision in Microsoft Excel"


Almost all of these are issues with floating point in general, I'd say. Some are with naive implementations of Excel functions, fair enough (for example variance as (sum x_i^2/n - (sum x_i/n)^2) that are indeed better implemented in R or Julia.

IEEE 754 just has unintuitive properties.

Kahan (the "father of IEEE 754") has a rant (among many others) about Excel as well, and how it tries to hide some of the floating point complexities more or less successfully:

Floating-Point Arithmetic Besieged by “Business Decisions”


But you also have issues like 'Catastrophic cancellation':



"OOPS XL Did It Again"


From the Wikipedia article:

"Although Excel can display 30 decimal places, its precision for a specified number is confined to 15 significant figures, and calculations may have an accuracy that is even less due to five issues: round off,truncation, and binary storage, accumulation of the deviations of the operands in calculations, and worst: cancellation at subtractions resp. 'Catastrophic cancellation' at subtraction of values with similar magnitude."

That's an issue inherent in floating point computations with fixed precision (such as IEEE 754).

  julia> 1e20 + 1000 - 1e20

  julia> 1e20 + 10000 - 1e20

  Python 3.9.6 (default, Jun 28 2021, 19:24:41) 
  >>> 1e20 + 1000 - 1e20
  >>> 1e20 + 10000 - 1e20

We never got the results of our covid tests from early in the pandemic because someone imported the medical data into Excel, which predictably and destructively modified the patient IDs by stripping all leading zeroes.

I've used perl, python, and R for scientific data for a really long time but have always made a concerted effort to avoid Excel. My reasoning feels the same as when people say Java is the best language because it can be run on any device, which is like saying anal sex is the best sex because you can do it with any animal.

Maybe I'm missing out on a great experience, but the notion has always made me uncomfortable.

> My reasoning feels the same as when people say Java is the best language because it can be run on any device, which is like saying anal sex is the best sex because you can do it with any animal.

Please tell me you have said this to someone in a work meeting. This is hilarious.

Even if it wasn't used, I may end up using it at work myself.

There is nothing sneaky about it. Excel imports things it thinks are dates as dates. It's not a silent modification. It's a loading issue. Properly loading data into R is not particularly easy either.

So if I have Zip codes with leading 0’s, use R?

>So if I have Zip codes with leading 0’s, use R?

Several other software tools also mess up leading 0s including R if used in a naive way without specifying extra options. My previous comment about this: https://news.ycombinator.com/item?id=25017116

Like R, MS Excel can also preserve leading 0s -- if you specify the option on import. (Click on Excel 2019 Data tab and import via "From Text/CSV" button on the ribbon menu and a dialog pops up that provides option "Do not detect data types" (Earlier version of Excel has different verbiage to interpret numbers as text))

The problem that I've had with this issue and Excel is that Excel performs overly-clever-automagic type conversions in certain situations where you don't want them. And then especially when you're collaborating on files with non-technical normal office workers, they're not expecting this behavior. So the end result has been that this issue has bitten me many times, despite knowing how to ideally tell Excel to handle it.

If you have zip codes, don't store those in a numeric type (doesn't matter if it's Excel, R, or any other system)

Zip codes aren't numbers, they are strings that happen to contain only numeric characters

WAIT. All these words on SO and here have been written because people don't know to change the column type from "General" to "Text"?

They would like to be able to just open a csv file in excel without going through a file import wizard. Changing the column type after the fact is too late.

If you get to the point that you are using r and python, it should be because you are doing working with larger data sets, or it’s specifically sophisticated.

If you can learn a language, you could constrain type conversions easily once you have for knowledge.

Its not like there aren’t painful gotchas in other tools- it’s an issue if you aren’t aware of them and if they impact your work.

If it’s big, unusually complex, you probably want a DB before analysis.

If it’s repetitive, or advanced modeling/ml: python/r

I haven't used Excel since I was a child and learnt to program. I didn't realise just how bad it was. Someone asked for a list of ids for some inventory and they imported it into Excel. They were all 4 digit alphanumeric but the ones that looked like numbers, like "0001" it just treated it as a number, showed it as 1 no option to disable this ridiculous behaviour.

Excel doesn't alter data without being asked either. What it does do by default is try to change how the data is displayed. If the column is properly set to the correct data type the data will be shown correctly because Excel doesn't change the actual data. Column types can be set when csv files are imported.

The problem described in the article isn't an Excel issue. It is an issue of the geneticist failure to learn the basics of how their tools work.

> Excel doesn't alter data without being asked either. What it does do by default is try to change how the data is displayed.

This is easily falsifiable. In a "general" cell, when I enter 0002, it gets changed into 2, not just in display, but in actual content. When I change the cell type to text, it'll still be 2. Only if I enter 0002 after changing the type to text is the content kept.

Similar when I want to have the text 3/17 or SEPT1 in a cell, I have to format it before typing or the data does get altered. If you try setting it to "text" after you typed it, you'll get some number that's not very useful to you.

If you enter it prior to specifying the formatting you are entering it as a date. The number you see when you change to text is the numeric value reflecting the date/data you entered.

Just type ‘0002.

the only way to prevent Excel from converting date-like number values into 'real dates' is to format the number values as text. or escape each and every number value so this doesn't occur.

this is not just a workaround - it's recommended by Microsoft. because you literally cannot turn this functionality off.

how is that not an "Excel issue"?

That is how spreadsheets work. Try Libreoffice or Google Sheets. You will find they operate the same. Spreadsheets format all cells as a generic type where the value entered to it is gets assigned a more specific type by the spreadsheet application. If you want to insure that the value you entered is interpreted as a specific type you need to identify that to the spreadsheet application prior to entering the value.

I was responding to the claim that Excel "doesn't alter data without being asked either." I guess it does, and it should?

It also interprets and coerces values in formulas.

Empty cells are interpreted as zero, which can be downright catastrophic, if the data is just missing.

For non-programmers who do any kind of data processing, spreadsheet applications (Excel, Numbers, Google Sheets) are the closest thing to programming they have, and have a lot less overhead - just throw in the data, write a formula, you get results, no compilation or fidgeting with input and output and so on.

The steps of an algorithm are reflected by cells that reference cells that reference cells.

I've always thought that every highschooler should be taught how to use Excel properly, it really is a superpower in many contexts.

> it really is a superpower in many contexts.

It's a trap because once you get comfortable in Excel you have a lot of resistance to try anything more productive than Excel. Seen that numerous times with people who work really fast with Excel yet end up very limited as to what they can actually deal with beyond simple problems.

Can confirm. When I started working in academia, we built very sophisticated Excel models. But every new project would ask form ever more sophisticated sheets and workarounds. Once the calculations took more than a few minutes, I moved on to macros. But at that point I realized that I am using Excel as nothing more than a data container. That finally pushed me to learn Python and I felt never more liberated. Whenever I supervise PhD students now, I push them hard to learn programming early on. I am pretty sure never has any one of them had regret for following my advice. It simply adds a few degrees of freedom to their work.

Disagree. Wanting to automate and learning functions within functions pushed me to learn how to program in python and R

I was about to comment the same thing. Learning how the logic in excel worked was a fantastic first step into learning programming (python specifically).

Depends, I worked for a few life science companies and was quite surprised with the VB adoption.

Basically it grew out of the fact that many of these companies are focused on Windows, given the software of the data readers and laboratory robots.

So it is quite common to have Visual Studio licenses around.

A common pattern for the history of many VB packages I found out across the business units, was software that started in Excel, alongside VBA macros, and eventually was ported into VB.

I've always thought that every highschooler should be taught the basics of programming with an easy language like Python instead. That really is a superpower in a lot more contexts.

Yes, but often different contexts.

Let's say my volunteer org wants to keep track of events, who volunteered in them, etc. and wants to give an award to the volunteer who gave the most hours. How do you do that in Python? Why would you?

For me, Excel is a tool mostly for entering data manually. In the case you described, it would be good for that purpose. I suppose you would have a column for each volunteer and then just enter data. Then you can read the file with python (pandas) and make a sum over the columns and give the one with most hours. Of course this case is also easy to do in Excel, however, the idea is that if you know more powerful tools (programming in general), then you will be able to do other things that otherwise you would regard as more time-consuming. Maybe you want then to give different awards for age ranges. And if you have too many volunteers then it would not be so easy any more in Excel, but in python you would just know how to do it.

It is also a reasonable output format. I have written more than one Python thing that ended up by generating an Excel which I then emailed to people. It was imperfect but saved me a lot of “can you make a graph blah blah blah” type work.

That's becoming more common in the US. My spouse teaches middle school STEM and programming is part of the curriculum, beginning with Scratch JR and working up to Python and JavaScript. It's not ubiquitous yet, but it's coming along.

They are in the UK though. I feel like Python is a poor choice these days though. Typescript would make more sense.

Yeah, typesafety alone makes a HUGE difference.

Personally, I'd go to something like haskell or f# though, because of a even better type system IMO.

Altough, for learning it, typescript is probably easier and more applicable in the real world.

Knime is way more powerful, but it requires more learning too. It has nodes for programming (R & Python, and I'm pretty sure it has something with JAVA) your own transformation if you want to integrate complex stuff, but it's very capable out of the box.

Excel has PowerQuery too, which is very nice but you hit the ceiling pretty easy. Knime eats a lot of data down the throat with modest PC, Excel really struggles with large datasets, not matter how you use or tune PowerQuery.

I know here in HN people talk down visual programming, but I've done pretty heavy and complicated stuff with it. It would be way more complicated with pandas.

This week I got a bit more familiar with R while adding it as a scripting language for a data IDE I'm working on (it already supported other languages).

It is a very hard language and community to get into! The documentation is very sparse. Library documentation is published as PDF (I guess?) and also very sparse. The default `print` behavior is pretty hard to understand. It's 1-indexed and it took me a while to realize every time I think `array[1]` I should write `array[[1]]`. I can't tell the difference between `<-` and `=`.

My guess is that it was probably a great language at some point but is way behind other numeric scripting languages like Julia or Matlab in terms of community attention and language ergonomics.

I know it's highly used but other than legacy reasons I'm not sure why you'd want to learn it over Julia.

I'm also curious to investigate how the aspects I'm critical of differ in Octave.

Edit: totally fair, 1-indexing shouldn't have been a "critique". Lots of languages do that.

I highly recommend anybody getting into R to skip the base language (which indeed is ancient and full of gotchas) and go straight for the Tidyverse[1]. You can always go back in and learn how to do things the old way later.

Over the last decade, the R community has largely standardized around tools like dplyr, ggplot, tibble, purrr, and so on that make doing data science work way easier to reason about. Much more ergonomic. At my company we switched from using Python to using R for most analytical data science work because the Tidyverse tools make it so much easier to avoid bugs and weird join issues than you get in a more imperative programming environment.

[1] https://www.tidyverse.org/

i would recommend getting comfortable with doing stuff with base R, then trying tidyverse. Starting with dplyr might get you results quick, but its "special evaluation" actively confuses your understanding of how the base language actually works (speaking from experience with an R course and subsequently helping other confused folks)

Consider this example:

  # base R
  starwars[starwars$height < 200 & starwars$gender == "male", ]
  # dplyr
  starwars %>% filter(
    height < 200,
    gender == "male"
(Source: https://tidyeval.tidyverse.org/sec-why-how.html)

Where'd `height` and `gender` come from in the dplyr version? They're just columns in a DF, not variables, and yet they act like variables... Well that's the dplyr magic baby!

dplyr (and other tidystuff) achieves this "niceness" by doing a whole bunch of what amounts to gnarly metaprogramming[1] -- that example was taken from a whole big chapter about "Tidy evalutation", describing how it does all this quote()-ing and eval()-ing under the hood to make the "nicer" version work. it's (arguably) more pleasant to read and write, but much harder to actually understand -- "easy, but not simple", to paraphrase a slightly tired phrase.


[1] IIRC it works something like this. the expressions

  height < 200
  gender == "male"
are actually passed to `filter` as unevaluated ASTs (think lisp's `quote`), and then evaluated in a specially constructed environment with added variables like `height` and `gender` corresponding to your dataframe's columns. IIRC this means it can do some cool things like run on an SQL backend (similar to C#'s LINQ), but it's not somthing i'd expose a beginner to.

My experience is that this weird evaluation order stuff is only confusing for students with a lot of programming experience who already expect nice lexical scope. For those coming in from Excel, the tidyverse conventions are no problem and are in fact easier than all the pedantic quoting you have to do in something like Pandas. It only gets confusing when you want to write new tidyverse functions, and even then, base R isn’t any simpler: the confusing evaluation order is built into R itself at the deepest level.

EDIT: i gotta admit, you sound like you've got more experience with teaching R than me. so perhaps my opinions here are a bit strong for what they're based on, i.e. tutoring a couple of non-programmer friends and my own learning process. still...

> My experience is that this weird evaluation order stuff is only confusing for students with a lot of programming experience who already expect nice lexical scope

fair point, but for the most part, R itself does use pretty standard lexical scoping unless you opt into "non-standard evaluation" by using `substitute`[1]. so building a mental model of lexical scoping and "standard evaluation" is a pretty important thing to learn. after that, the student can see how quoting can "break" it, or at least be able to understand a sentence like "you know how evaluation usually works? this is different! but don't worry about it too much for now". and i think dropping someone new straight into tidyverse stuff gets in the way of this process.

> and even then, base R isn’t any simpler: the confusing evaluation order is built into R itself at the deepest level.

i mean, quoting can't really work without being deeply integrated into the language, can it? besides:

- AFAICT base R data manipulation functions don't use it a lot. [2]

- for the most part, R's evaluation order can be ignored (at a certain learning stage) because it's not observable if you stick to pure stuff, which you probably should anyway.


[1] http://adv-r.had.co.nz/Computing-on-the-language.html#captur...

[2] admittedly, stuff with `formula`s is similarly wacky, and if you're doing stats you're going to run into that sooner or later...

It's true that if you write code that is pure and error-free then you will never bump up against R's strangeness.

But try out this bit of base R:

> hello = function(cats, dogs) { return(cats) }

> hello(100, honk) # where honk has never been defined

> hello(100, print("hello!"))

> hello(100)

yeah, the intersection of lazy evaluation and side-effects (incl. errors/exceptions) gets confusing, you definitely have to be there to help the student out of a jam. but i think it's useful to start out pretending R follows strict evaluation (because it's natural[1]) and then, once the student gets their bearings, you can introduce laziness.


[1] well, not "natural", but aligned with how math stuff is usually taught/done. in most cases, when asked to evaluate `f(x+1)`, you first do `x+1` and then take `f(_)` of that.

Wow, now I understand the reasons. As Python guy I'm having trouble to understand how it is possible in R.


tldr: basically, R passes all function arguments as bundles of `(expr_ast, env)` [called "promises"]. normally, they get evaluated upon first use, but you can also access the AST and mess around with it. AFAIK this is called an "Fexpr" in the LISP world.

(originally i had a nice summary, but my phone died mid-writing and i'm not typing all that again, sorry!)

it's very powerful (at the cost of being slow and, i imagine, impossible to optimize). it enables lots of little DSLs everywhere - e.g. lm() from stats, aes() from ggplot2, any dplyr function - which can be both a blessing and a curse.

I would recommend the opposite - pick stuff from tidyverse (mostly dplyr and ggplot2) only if you need them. Knowing base R goes a long way on its own.

Pipe operator is now in base R so one of the nice features is now standard



I found the book R for Data Science (which is free http://r4ds.had.co.nz) to be a very good introduction to R with Tidyverse.

I'm on the same page as the other commenter here, except stronger.

Avoid tidyverse like the plague, except when you can't, or when you don't actually care about the sanity of your code and are happy copy/pasting pre-prescribed snippets without needing to understand let alone modify them.

> ggplot

One day I’ll have a whole week free so I can sit down and learn an entire graphical grammar so that I can remove the egregious amounts of chart-junk in the ggplot defaults.

This is the typical „CS-people“ reaction to learning R. R is a language written by statisticians for statisticians which can be a good but also a bad thing. Fact is that R is the lingua franca of statistics, most new methods will be first available in R, rarely in python, almost never in julia. Despite the superior design of julia, there are good reasons to still use R, for example, there are many state of the art libraries such as ggplot2 or data.table which beat any alternative from python or julia.

And as far as community support, when I was on the R user listserv, Prof Ripley from Oxford would explain why your question was using the wrong analysis for your data, explain what analysis you should use, shoe the sample code for the right way, and then mockingly show the dumb way you could have your question answered,. Like twenty emails each British morning. I learned a lot about stats from reading it. Not so much R syntax but stats and stats reasoning.

Shouldn't Vega be comparable to ggplot2, at least if you're working with HTML output?

The joke about R is that it is great because it was written by statisticians for statisticians and that R is terrible because it was written by statisticians for statisticians.

The two best reasons to use R, IMO, are that many statisticians write up their new methods in R, so it is a window into current statistical research and practice [0], and that R is home to ggplot and the rest of the tidyverse (or Hadley-verse), a systematic approach to common data analysis tasks.

[0] https://cran.r-project.org/web/packages/available_packages_b...

[1] https://www.tidyverse.org/

You highlight four issues:

1. Documentation is accessible via the interpreter. You can type ?funcname to get documentation or ?libname for the entry point for almost every library, or use the Help tab in RStudio, the most common interpreter. Package documentation is typically hyperlinked text and of a high quality. You can also see syndicated versions of library documentation online in HTML format. Here is for instance, the HTML documentation for the stats library (the built-in library which covers most of the statistical functions you want): https://stat.ethz.ch/R-manual/R-patched/library/stats/html/0... or rdocumentation.org or really any dozens of web syndicated versions. The PDF version you mentioned is linked from CRAN, the package repository, but is by no means the only entry point for documentation.

2. The print function -- actually not a single function, but rather a commonly implemented S3 method -- is easy to understand if you understand how the S3 object system in R works and how function dispatch works. What it does depends on the class of the object and whether an S3 print method has been implemented for the class of the object. This is true in most languages. If you're looking to something closer to a bare metal print function you should consider cat, but in general I don't find print confusing at all.

3. The subset operators available in R are documented. Because everything is a function in R, you can easily see the documentation by typing ?`[` or ?`[[` -- both have the same documentation page, which describes the essential difference between the two subsetting operators. This is tricky to learn at first but given that the two operators do different things, both desireable in different contexts, it's sort of difficult to argue this is an ergonomics issue and not a user error. If you want a more hands on discussion of the differences, you can try http://adv-r.had.co.nz/Subsetting.html

4. Assignment, similarly, is documented. You can check ?`=` if you have some concerns or read the documentation online here: https://stat.ethz.ch/R-manual/R-patched/library/base/html/as... The short version is that although <- is idiomatically preferred by style guides, there are basically no contexts where = would do anything different. You may want to be aware of -> and <<- as other assignment operators. The former allows right hand assignment, which is a fun bit of syntax, and the latter overrides the default assignment scope and forces a global which I personally find distasteful.

One final note: the inner workings of any R function for which the implementation is in R can be inspected. Simply type the name of the function and press enter to see the source code of the function. A lot of low level stuff is implemented in C, so you'll find a stub function that calls internal things, but for almost anything else, this is a good way to learn how things work. Like, run-length encoding is implemented in the rle function so just type rle and press enter and voila, you see the full implementation.

R has a number of core language issues and things that are annoying but the ones you named read like you puttered around for 10 minutes and didn't do the kind of basic homework you need to do to learn a new language. I wouldn't complain about what a bad language Go is because I don't understand the distinction between := and = as assignment operators.

The syntax issues yeah that was a mistake to include. Although my confusion about = and <- was that I also wasn't getting warnings when using it incorrectly. In Go I cannot mistakenly use = for declaring a variable (without using 'var').

Far more severe is the scant documentation online.

Don't take this as an attack on the language or community. I'm a huge fan of Standard ML and it's arguably in a worse state!

I was interested in supporting R in the first place because I knew of its importance (if only vaguely).

Do you mean a lack of documentation on syntax or lack of documentation on statistical methods?

If it’s statistical methods, then you’ll need to look outside of R because R documentation isn’t trying to teach statistical methods.

You can try [1] the series of books teaching statistical methods using R.

In my stats degree we learned R methods alongside the statistical methods. R, to us at the undergrad level, is a fancy calculator. Yes it has functions and can do some “programming,” but it’s purpose is to facilitate using statistical methods and writing up reports.

If the audience of your IDE is programmers who want to do data analysis, then I think that’s a different audience than statisticians, who I think are the majority of users of R. R studio is already a decent IDE that statisticians are familiar with, so it might be a hard group to get to switch.

The syntax in R isn’t great. There are multiple ways of sub setting that depend on the data type you're subsetting.

Many of the top stats programs have notes on R or courses designed to teach R that are freely accessible on the internet.

[1] https://www.routledge.com/Chapman--HallCRC-The-R-Series/book...

Really, really confused about your take on documentation. 9/10 you have documentation in the form of a pkgdown site with examples, parameter definitions, change log, articles, etc.

Here’s a good example: https://dplyr.tidyverse.org

Equally confused about your take on the community. R community is sort of a perfect example of an inclusive community actively trying to include everyone with organizations like “rladies” for women in tech.

In the first place, I went looking for information on assignment operators in R. There's no canonical source I can find. There's an ETH edu site that looks pretty old that seems to describe it well, but it's definitely still very minimal: https://stat.ethz.ch/R-manual/R-patched/library/base/html/as.... Is this even the right documentation? I don't know. (I'm not asking you, I'm sharing my thought process.)

Then I went looking for how to interact with JSON. There's no builtin library I guess but rjson seems to be what people use. There's no official documentation I can find on how to install a package but there are many blog posts. The rjson's only official documentation seems to be in PDF and again it's pretty minimal: https://cran.r-project.org/web/packages/rjson/rjson.pdf.

Again, to be fair, anyone looking into Common Lisp or Standard ML or OCaml would probably feel the exact same way about their ecosystems.

Does it cause a problem for existing users? Probably not. Is it the friendliest thing for first-timers to get into? Probably not. Is that a problem? Again probably not?

Are you using R or R Studio?

R: https://www.r-project.org/

There are manuals that you can find there.

R Studio (the IDE): https://www.rstudio.com/

R Studio cheat sheets: https://www.rstudio.com/resources/cheatsheets/

There is usually enough information on stack overflow / stack exchange to get through some questions. There is also a stats specific version that can sometimes be helpful. https://stats.stackexchange.com/

You are talking about documentation for user / community contributed packages. There is a minimum amount of standardization that needs to be followed, but yes, I do agree that the documentation could be better.

Hopefully this helps with installing packages: https://jtleek.com/modules/01_DataScientistToolbox/02_09_ins...

Thanks! I did end up figuring everything out this week I was trying to do (basics of syntax, 3rd party libraries, etc.). I posted the original comment just to document my experience on the path to figuring out these basics.

Ah, okay agreed. Documentation on base R isn’t great. I was more so coming from the perspective of packages which I realize isn’t the full picture.

Regarding first timers, I think that once they’re aware of RStudio and the content they put out, learning becomes much more friendly and modern.

Still, I definitely agree and would appreciate a modern manual on base R.

Everything you're complaining about is just petty bikeshedding issues.

Just learn it, it's more powerful than the alternatives.

P.S. I'm a Python person and not at all an R fanboy, but it's undeniably more powerful and versatile for data science tasks than Python or Julia.

If you've come from an imperative style of programming, you may end up fighting R before figuring out how it works best. It has a beautiful functional style, best demonstrated by tidyverse and map/apply functions. R has the fastest data table / tabular implementation (yes, faster than Julia), which is the standard format for most data and statistical analyses.

1 indexing is standard for numerical languages. The documentation is referenced to papers on the statistical method - I find it usually sufficient, but depends on what package you are talking about.

Use the tidyverse and use the cheatsheets for dplyr and the book R for Data Science. If you're trying to use primarily base R, you'll be limited and hamstrung. Tidyverse is the modern framework of choice.

> I'm also curious to investigate how the aspects I'm critical of differ in Octave.

Personal opinion, but happy to oblige.

> It is a very hard language and community to get into!

Same. Especially where "Matlab isn't Octave; Octave isn't Matlab" is concerned. Having said that, on stackoverflow at least, the matlab community seems more hostile to octave questions than the other way round.

> The documentation is very sparse.

Octave is actually fairly well documented, but unfortunately this is spread out significantly between manuals, helpstrings, and esoteric gems hidden as comments in the actual source code. However, this tends to be less of a problem, since often enough an equivalent function is documented in matlab, which is typically somewhat better in the documentation aspect. (octave is pretty good too though).

As for R, I think R is actually really well documented; you do kinda have to get used to its documentation format, but once you do there is nothing you'd want to do that you'll find yourself lacking documentation for.

(Proper R, that is. Tidyverse is a slightly different issue; but then again Tidyverse isn't R).

> Library documentation is published as PDF

You can have excellent in-terminal documentation using "?" and "??" (or "help" / "help.search" ). I have never needed to look at external manuals, but, yes, they do exist, typically in PDF form on CRAN. Furthermore, R is very good at accompanying documentation with examples / vignettes/ demos etc.

Octave, in theory, also does the same, but in practice I find many functions don't actually provide the demos. Typically they provide an in-doc example though.

> The default print behaviour is hard to understand.

Indeed. In fact, R seems to have some sort of infatuation with bash commands doing things in R-space, when in fact it would probably have been much more reasonable to leave the bash commands to do bash things. E.g. ls to list variables, rm to remove them, etc. And, yes, 'cat' to effectively print strings on the terminal verbatim, without other markings.

Octave is better here, bash-commands are generally identical within octave. 'print' is provided, but basically it's a wrapper to fprintf.

> It's 1-indexed.

Yes. Yes it is. This is not a bug, it's a feature. Same with octave, and same with julia. 0-indexing makes sense when you're working primarily with structures that depend on offsets (like pointers). 1-indexing is far more appropriate for languages that abstract such offset-based-structures away, and require ordinal, 'human-indexing' logic instead.

> it took me a while to realize every time I think `array[1]` I should write `array[[1]]`

Perhaps the chosen syntax is rather unfortunate, but Octave effectively uses the exact same logic here. If you have a cell array, you can either index it with () to obtain another cell array structure, OR you can index it with {} to obtain the 'contents' of that cell element.

> I can't tell the difference between `<-` and `=`

There are two main differences.

1. "<-" is assignment. "=" is "define" and is only valid at 'top level' of a particular scope; as such, its most appropriate use is to define default arguments in a function's signature. You can use it elsewhere, as long as it's toplevel, but you're discouraged from it.

2. Contrary to '=', the '<-' operator can be interpreted in a way that calls an appropriate 'assignment' function (typically denoted as 'functionname<-' when searching for help). E.g. the line "rows(var) <- x" calls the "rows<-" function, which assigns x to var.rows. It does not evaluate rows(var) first, and then assign x to that.

> My guess is that it was probably a great language at some point but is way behind other numeric scripting languages like Julia or Matlab in terms of community attention and language ergonomics

False. Not sure what else to say about that. Once you start looking you'll be very surprised how active and cutting edge the R ecosystem is. It's just that language-preference seems very compartmentalised within different communities. R happens to be thriving in genomics / psychology crowds, whereas it's virtually unheard of in mainstream CS crowds.

> I know it's highly used but other than legacy reasons I'm not sure why you'd want to learn it over Julia.

Because, it has very interesting language designs. In fact, having effectively learned Julia first and R second, it became obvious to me that many of the aspects that I liked in Julia were effectively ideas taken from R. In fact, even though Julia is often compared to Matlab due to its superficially similar syntax, Julia is probably far more similar to R than matlab/octave.

IMO this really asks the question: Why is there not a code view for an excel spreadsheet?

I get that some of the basic operations probably create expressions that are too wordy / very "specific data" intensive. That is, if you took the first step and just did your best to create that code view it would have a lot of stuff conditional on specific things.

But it's the next step that gets interesting. Now that you've got it, in what ways can the visual UI change to have the code view create tighter expressions. Now that you've got this view, how can it become super handy for doing things that today are clunky?

IMO there's an interesting "no-code" path in there somewhere and there's also an interesting "make spreadsheet re-use more powerful." Or maybe not, what do I look like, an Excel engineer? Ha!

>Why is there not a code view for an excel spreadsheet?

The MS Excel grid with code underneath each cell is declarative (not iterative loop) formulas so what would the ideal "code view" be?

Because of the architecture based on formulas, Excel does already have "Show Formulas" option (keyboard shortcut Ctrl+`) and "Trace Precedents" and "Trace Dependents".

For Excel iterative code like VBA Macros, it does have a typical "code view" (keyboard shortcut Alt+F11).

An ideal "code view" of Excel could be a declarative programming language that states the relations that hold between cells, for example a logic programming language such as Prolog or Datalog, using constraints to express the relations.

> what would the ideal "code view" be?


You might want to try out Power Query (get & transform).

Step-by-step repeatable transformations where the UI records steps and writes code in the background.

Almost exactly what you're talking about, and comes out-of-the-box in the last few versions.

Cool! Next time I'm opening up excel, I'll give it a look.

Also DAX and power pivot - if you think excel doesn’t have the features to reuse data sets I think you will be pleasantly surprised! This works alongside power query to create a relational data model from everything you import.

I wrote a VBA function that returns the formula of a cell as a string, or "const" if the cell is not a formula, or "empty" if the cell is empty. (The latter is useful for finding bugs). Then I use it to display the formula in the cell immediately adjacent to where the formula actually resides. It's not a panacea, but overcomes the problem of "the code is invisible when reading a spreadsheet."

I don't remember the macro, it was more than a decade ago. Something along the lines of:

  function foo(c)
    s = c.Formula
    ' do something with s
    foo = s
Then in a cell, I could write something like:


> Why is there not a code view for an excel spreadsheet?

Isn't that VBA?

Well, I'm not a VBA / excel expert -- but my neophyte view is that VBA is used to add code to excel, which is great -- but it's not the same as every single spread sheet is automatically creating a "this VBA is the equivalent of your spread sheet."

Again, not an expert. I'd expect I've never written one line of VBA. (Yay for me!)

>Why is there not a code view for an excel spreadsheet?

There is, just press Ctrl+`. There's also dependency tracking, sort of debugger.

There is no reason to use R when Excel can do the jobs. It is when Excel can’t handle the task at hand, R shines.

I am well familiar with R, but for simple data manipulation, quick chart, pivot table, usually resort to a spreadsheet program. I maintain my Options Trading Journal in spreadsheet where I record all the Options trade I made, positions I hold, P&L, etc. I just can’t imagine doing that in R.

R starts with CSV, Excel ends with CSV.

> There is no reason to use R when Excel can do the jobs. It is when Excel can’t handle the task at hand, R shines.

I think the problem is more nuanced because Excel can do a lot. There is a large set of use cases and tasks that are possible in Excel but your team would be a lot more efficient and productive using a programming language instead. The trick is to recognise those tradeoffs and it it's not as simple as sticking to Excel as long as things are still possible there.

Most data scientists I work with just sneer at Excel. Using a programming language is often using a sledgehammer to crack an egg though - sometimes you just want to look at the data in the csv, and I still see people doing this in R/Jupyter - just open it in a spreadsheet, trust me - the UI is better.

100% this - I encourage junior data scientists to get more familiar with Excel when they start - R/python are definitely more flexible, but if Excel can sort you out it's always the quickest way.

How do you store your surface?

Whenever I hear a process is run on excel I do a quick mental calculation of the relative risk of process defect severity, the time it takes to manually do the task in excel, and the investment + maintenance costs associated with automating.

When running any process in excel after enough times there will be an error. It just depends on whether that error will cost you a lot of money or not.

However, there is also a cost to automating these processes that goes beyond the initial investment. Someone needs to make sure the input data stays clean and coordinate any system-wide changes to the interfaces. This is a new thing that can break whenever system-wide changes get made so data governance becomes a priority.

Unless you need the frontier of statistical methods and are unable to code them yourself, I highly recommend not engaging R.

While Hadley Wickham has done amazing things to make the R language actually useful, Python and Julia are better for data applications. Also, bindings to the R language exist, obviating the need to be tied into it completely. Thus, even if you need the frontier, you have it available to you.

R is first and foremost an interactive language, and the focus of the libraries, tooling, and community reflects that.

I believe working in pandas/sklearn/jupyter ecosystem is much slower (>50%) if you are doing EDA and statistical modelling (ML) than tidyverse/tidymodels/rstudio. The exception is deep learning and adjacent fields (like computer vision).

Of course Python is better at everything else, so adding an extra languages might or might be worth the hassle.

Just as a followup, when asked I also excoriate Python for its gap in typing for data applications. But that gap isn't as wide as the gulf between Python & Julia vs R

I would say R has a beautiful functional approach which is very hard to replicate in python. It is also faster than Python and Julia for tabular/dataframe data. Ggplot is fantastic - I tried to recreate some relatively simple plots (scatterplot with a LOESS line of best fit coloured by group) and failed to do this in Python easily.

One important aspect is the numerical precision. For those in need of correctness, avoid Excel and other spreadsheets. There are several papers discussing the lack of precision of excel and the likes in the literature [1].

[1] https://www.jstatsoft.org/article/view/v034i04

For me, the biggest reason to use a programming language instead of Excel is its capacity for abstraction, and hence ability to organise code and separate concerns.

Attempting to solve a complex analytical problems in Excel is similar to a doctor trying to solve medical problems by reference only to atoms.

Instead doctors use a variety of abstractions: organs, cells, enzyme, etc. to understand and explain a problem.

By using a programming language, we can develop appropriate abstractions to solve our problem in a way which keeps a lid on complexity.

I think this concept does make sense to an advanced Excel user, and can help explain the situations in which they may reach for a different tool. Having made some very complex Excel spreadsheets in the past, I think was aware it can become very difficult to develop them or generalise them further, even before I became a programmer.

The author could have mentioned testing.

The reason I use python scripts instead of spreadsheets for important calculations is that I can unit test the logic extensively and be sure that the code works as expected. This also makes sure I don't break stuff when adding functionalities / refactoring code.

I would never ever use Excel for something important, unless it's completely trivial (e.g., sum/average values of a column).

I assume the same applies to R vs spreadhseets, of course.

Some compelling reasons - I think most of them can be addressed in Excel, but the reality is most people don't know how Excel really works or how to do anything complex in it. Having said that, as your need to batch process numbers goes up, R and other solutions have more value in my opinion.

I'm particularly interested in "half-way" solutions - something between R and Excel. I've been looking at https://www.causal.app/ - no affiliation but I find their approach similar to a Mac app I like called Numi.

For small datasets and simple charts, I'm not sure there are any compelling reasons to use R instead of Excel (unless you're already more comfortable with it). Plus, Excel is great for beginners - most people wouldn't need a class to draw a basic chart.

The more complex the analysis, the more reasons there are to get out of Excel.

The "spreadsheet" interface is now universal and not limited to excel. You have a completely equivalent thing with gnumeric, or libereoffice's calc, for example. This interface is really great for kids, for example to compute averages and simple formulas, before they learn "serious" programming. I was taught how to use spreadsheets in school and it is easily one of the most useful things that I learned there.

Fair - I was using "Excel" when to your point I meant "a spreadsheet".

Is there a tool to transpile an Excel workbook to R source code? It seems like a plausible-enough thing to do.

• Put all the static data first in one section, broken into tables with PDF’s word-boundary logic (the one that allows you to highlight text, despite it being a bunch of individually laid-out graphemes)

• reverse-postorder (topological sort) the formula cells’ definitions, grouping them into “stanzas” by which “tables” of static data they’re transitively touching.

The result would read a lot like the definition of an expert system in Prolog. Facts, then predicates.

Wow those reasons don't sound very compelling to me! (I'm a programmer but don't know R.)

Honestly i struggle with it. My partner recently wrote an R package (phd biology) and I’m a developer. I had done a bit of R but nothing substantial. By the end of the project i couldn’t think of a compelling reason not to use any other language. The R tooling is clumsy, the language isn’t great…the built in graphs are ok? It feels like the reason it’s popular is mostly just people not wanting to update skills because they’re in a tangential field where this is a solution so it’s not a priority to change.

There are also a lot of niche libraries for various statistical tests and analyses that are used for scientific studies. Especially applied scientists who wouldn't necessarily be able to reimplement the math and are not very familiar with programming rely on it as a workhorse. Things like variants of anova, regressions etc. are often just a single line in R, specify the data, the interaction terms etc and you get a nicely formatted output with everything calculated for you, like goodness of fit, p-values etc. There's a long tail of functionality in there that's hard to match even with the scipy stack or other statistical packages in python. Plotting is also much less of a hassle than with matplotlib.

R is more of a free and de-bloated SPSS than an alternative to scipy or excel.

I use R for work a lot, maybe to give some reasons to use R:

1. The language itself isn’t great, but the tidyverse packages are amazing. There’s really not a reason to use R if you’re not going to use tidyverse.

2. It’s integration to make quick webapps, books, Markdown reports, blogs. Is amazing. In a team where we process a lot in Excel, it’s so convenient I can just make a quick webapp that does that task a lot quicker for everyone to use.

>There’s really not a reason to use R if you’re not going to use tidyverse.

I feel the same way, except for the data.table package.

Some years back before the prevalence of notebooks, Rmarkdown was great. More recently I have worked as a data engineer alongside some brilliant datascientists, and my understanding is there is simply some models / algorithms that are not yet implemented in common python libraries - with the draw back that on sufficiently sized data sets the R machine required to be absolutely massive as most packages did not support distributed processing.

In the end we supported an R environment and left its usage up to DS team including any ongoing reports or models and inferences they needed to perform, and encouraged a move to python for any code base that would get thrown back over the fence to us for official support.

I was a C++ developer, and I love R for data analysis. I wouldn't write a compiler in it, but it's just incredibly easy to do data analysis in it. So it's not a question of skills -- it's just well-designed for the task it was designed to do.

The tooling is massively better than Python to say the least. You have a single environment (RStudio) that does most things really well, an excellent package manager, and integrated graphing capabilities and environment variables observation. In comparison Jupyter notebooks feel very limited and clumsy. And Base-R is way more tuned towards data analysis than base Python is, so you end up importing many libraries in Python to do the same thing. Finally, tidyverse has no equivalent in Python as this is a major selling point for R.

The tidyverse packages are pretty great that is true, in particular dplyr was useful. I don't want to sound like I'm just someone against R for the sake of it. There were things I liked, but I just walked away from the project uncertain what the benefit was in using R. There's a solid chance i'm wrong.

The advantage is you get tables, numbers and plots based on solids maths instead of implementing your own understanding of useful analyses. Depending on your skill and interests in implementing your own designs for stats this might be a plus or a minus. For me, I don’t want to spend my time redoing the tricky and subtle stuff, but I can learn the odd syntax and weirdly functional nature of it pretty reliably. If someone does a Golang version of CRAN I wouldn’t think of using R again but for now R wins the stats battle. And much of the community advice and so on about how to do things, outside of the base R vs tidyverse flame rift, are from informed statisticians not autodidact Python folks that think everything is easy. But if you do understand he maths solidly behind what you want, then R might not be the best tool for your use.

I agree, the tooling can be clumsy. I'd wish for a language-wide standard to help with replicable R environments. Perhaps using conda makes sense here.

But when you say the language isn't great, what exactly is the problem?

IMO, R is a fascinating language from a programming perspective. The combination of first class environments, lexical scoping, non-standard evaluation and metaprogramming allows for extremely performant and expressive domain specific languages, e.g. data.table, tidyverse and ggplot.

The R tooling is clumsy

I really like the R tooling. What do you find clumsy about it?

It feels like the reason it’s popular is mostly just people not wanting to update skills

I have the opposite view. I find its the people who complain about R because its a bit different are the ones who are inflexible about learning new skills.

Package management is painful, replicating our two environments as i did a peer review always took way more effort than it would have in many other languages.

I found my partner, Jess, to be using some fairly niche packages for stats, which admittedly would be harder to replicate in other languages. I personally think it’s fine to use, there’s no harm, but as someone comfortable with many languages, most the people i found debating the pros of R (since this project and it coming up in discussion) often are in non software fields and it’s their only language. I personally wouldn’t choose it over python and i might even consider something old school like Java for similar projects if i didn’t have an audience that only understood the work in R.

renv is way less clumsy that whatever Python keeps trying to reinvent with pyenv, virtualenv and whatever new solution they come up with every single year.

The reason to use R is because someone else has already made an amazing library(usually some kind of academic research) for it that doesn't quite exist for python yet.

There is also an aspect that if you are using a technique you don't really understand you can do the same thing in python and R and compare if the results are the same.

For "R" substitute any modern programming language. Python with numpy, for example.

What about M? Excel’s language for data processing.

A while ago, there was a job offer in my area to automate Excel tasks using Python. The thing was posted time after time by several recruitment agencies, for almost a year. Just thinking how expensive that process must have been answers the question.

I've done this kind of conversion, not as a job, but on a case by case basis within my own work. An example would be taking a simple computation that I would share with people (mostly engineers) in the form of a spreadsheet, and instead turning it into a Python program, even replete with GUI. Here are some of the pitfalls that I've experienced.

If I share a computation as a spreadsheet, people already know how to work with the GUI, which is actually quite sophisticated, even if it doesn't prevent you from screwing up. People could extend my spreadsheets, e.g., adding columns to do the same computation on multiple input sets, etc. They could easily extract the output in text format and paste it into something else. And so forth.

A Python program with GUI created the expectation that I was writing commercial quality "software," and that if it wasn't 100% intuitive, I would hand-hold each colleague, and make changes on demand, including converters to multiple file formats (often, so they could put the data back into a spreadsheet). And of course I also had to help each person install Python on their computer.

Of course I'm not a full fledged software developer, just a "scientific" programmer, and of course the lesson I learned is common knowledge: Writing and supporting real software is orders of magnitude more costly than just writing a one-off program to solve a problem, to the point where the conversion to "proper code" could be a net liability to the business. You have to assess whether it will actually add value. In one sibling comment, it sounds like the answer was yes, so I acknowledge that.

The difference was not so much the technology, but the cultural expectations associated with Excel versus "software."

Today, I do all of my work in Python, but when I share a simple computation with colleagues, I will often convert it back into Excel for them.

I’m doing this kind of task, and I tend to avoid any GUI and just have input and output files as simple text, CSV, or Excel spreadsheet. It seems to me that it’s the best way to have a satisfactory UX without too much headache.

Or just have a download the source data for the graph/table in Excel button on every page/for every data source. Sometimes it is nice to do your own cool D3 chart but still have a no-work-for-me answer to requests for slightly different graphs.

I spent a big part of my early career automating Excel, initially via VBA, then Interop (.net), then later via the OpenXML libraries.

I actually really enjoyed it, there's something a bit magical about taking a task that takes somebody hours to do, and putting it under a single button press, or even running it overnight and having everything ready for them when they start work the next day.

That's what I'm doing right now.

The company I'm at had this process for doing salespersons commissions that would take days to do. It was painful to watch the process the first time I was trained on it. They were sorting rows manually in excel, exporting csv's from the ERP software.

Over a couple of months I worked automation into the project using SQL queries and vba/python, I even automated sending out all the personalized reports to each salesperson. Showed it to my boss last week (he had to run the reports this month) and he was blown away by how much time and energy it saved.

It felt so good to reduce a process that took days to do down to a couple of button clicks.

Class :)

I would have been surprised and intrigued if the title read "Why we use R instead of Python". I've used Excel myself as a student in 2003-05, when R or Python weren't known. And now I sometimes teach Economics students, and they all work either in R or in Jupyter Notebooks. Many of them routinely do what was unthinkable -- process 100s of thousands of records, do geospatial queries and joins, etc.

For me, it comes down to the right tool for the job. Excel (strictly speaking, spreadsheets) is very powerful. It is a grid-based visual programing environment that has an interface simple enough to be groked by its original target audience: accountants and file clerks.

What that means to me is that I use excel for "toys" (small data set analysis, rough charts) and ephemeral-yet-shared lists where it is easier to say "evening batch tab, line 3" to communicate which job needs to be updated than try to communicate the same thing in a written request or pull a database table report.

I use R where I need more power: automated reports, automated statistics, and more complex analysis where the paradigm of textual code fits better than the paradigm of a 2D grid. I teach R (or python) when I want to give someone more reproduceable tools than a "magic" spreadsheet that, under the covers, is really a contraption held together with sticky tape and positive thinking.

We can, and should, teach both. Ideally side-by-side with a constant stream of "why are you doing task X this way?" that is largely missing from formal education.

Shameless plug, if you want to use R-Studio and R on AWS,GCP,Azure, we provide VMs with pre-installed R-Studio and popular R libraries which are far cheaper than the official R-Studio . Below are the links:

AWS : https://aws.amazon.com/marketplace/pp/prodview-brc4ybuoee6he...

GCP : https://console.cloud.google.com/marketplace/product/techlat...

Azure : https://azuremarketplace.microsoft.com/en-us/marketplace/app...

Support & Documentation : http://www.techlatest.net/support/r-studio-support/

I have some objections, as someone who doesn't use R but does quite a lot of Excel. The author seems to miss PowerQuery, data model, and PowerPivot functions.

"Visibility: How do you see the code inside an Excel document? How do you tell exactly what is going on? You have to go clicking through cells, or reverse engineer what settings a graph has."

PowerQuery it's a functional way of transforming data step after step. You can see the code/function of each step and it's very easy to reason about those functions. You can transform data to the format you want, and do calculations on it before it enters the spreadsheet. One of the main advantages is that it's easy! I've taught non programmers to reliably use this tool.

"Repeatability: [...] With R, you just change read.csv("1.csv") to read.csv("2.csv") and the exact same calculations are run on two different data sets."

Again, with PowerQuery this is doable, and a normal procedure on my day to day. You change the file parameter PowerQuery will use on step 1, and the rest of steps will follow.

"Batch processing: Related to the above, you can read every CSV in a directory and produce a graph for each of them. You can read data from an API and run the same process on it that you did yesterday."

You can use PowerQuery on folders, it can take a set of files and transform or aggregate them all at once.

In my opinion Excel has gotten pretty powerful after data model and powerquery were added, I think around excel 2013. I barely use cell functions anymore; data model and pivot tables make for robust spreadsheets that are easy to reason about. I know a decent programmer can do most of it in many other ways, but the accessibility of Excel is amazing.

The biggest defect Excel has for me at the moment is control change tracking. Wish I could git Excel changes.

R isn't interactive or visual like Excel. Excel's learning curve is a lot less daunting. Also, Excel is ingrained in the accounting culture and well supported by peers. Plus Excel is everyone's default database and form software.

None of the article's reasons matter to non-IT people and no IT department has the juice to override the CFO and accountants. Unlike, every other interactive tool that displeases programmers, Excel survives.

If you want R to replace Excel you need to build an interactive front end that can do everything Excel can as easily as Excel. Sitting in a class is not and option.

Yeah most of the reasons explained in the article are too abstract for a beginner to understand. Visibility for example. If Excel is doing what the user needs, they might naturally question why they need to see under the hood. It might even be a positive that you get less visibility with it. Excel is exposing a smaller surface for the user to understand. Once you have a more complex task those guardrails are going to feel constraining, but until then Excel works well enough.

Most of what you said only applies when the tasks are accounting related. There's a lot more to Excel, and a lot more than that to R.

And just like, in accounting, you're not going to get the CFO to choose R over Excel, in many circles involving data analysis and statistics you won't get to choose anything but R.

Also you don't need an interactive front end for R any more than you need an interactive front end for any other programming language. The scope of use is too large to allow for a single GUI. There are some packages that provide a GUI for specific purposes, like rattle, which provides an interface for common data mining tasks and models. I'd you want a pretty good spreadsheet GUI for viewing and modifying data, you can use rhandsontable. Just a few examples.

Replacing Excel with R isn't necessary and I think would miss the point & relative strengths of the different tools. Recognizing where each tool is better suited for your needs, and if R is worth the learning curve for your specific needs, is what is important.

No, but it's far easier to see what the hell an R script is doing versus an Excel file. And there's just a ton more that R can do that Excel can't. More powerful plotting options, more tools like Random Forests or mixed effect models. And has less issues with floating point.

Excel is good for accounting tasks because that's what it and other spreadsheet software were originally designed for. Data science tasks are tacked on.

> Plus Excel is everyone's default database and form software.

This is one of many, many things that is terribly wrong with Excel. It is arguably well suited for neither of those applications. I will continue to do my statistics in R, and be very glad that I do not work somewhere with either CFOs or powerful accountants ;-).

With millions of users using Excel, I am not convinced that there are no solutions to the problems of Repeatability, Batch Processing, Extensibility as mentioned in the article.

Currently watching a health care department in a c.8m person city implement powerBI.

To me, similar reasons surface to why it's a disaster waiting to happen.

Formulae hidden away in non document DAX columns. Its horrendous. Its the worst thing to happen to "BI", "Data analysis" that I've seen.

Given me a clearly defined single R markdown document that's easy to follow. If you can't follow the Markdown document, you shouldn't be running analysis in the first place.

Saw CSV mentioned in the article but didn't see it mention that R can read Excel files directly. So, you can use the spreadsheet to capture data and process it in R.

Oops. meant this for my comment below.

I have more than once found myself answering that sort of question. I gave similar reasoning.

In 2020, for a grassroots PPE-relief organization, however, I found that I had sometimes been mistaken. What our group managed to achieve by eschewing (eventually as a watchword) fancy tools and building our entire backend around Google Sheets was speed. Moreover, I learned along the way that simple database tools are sometimes more-efficient or faster than anything I'd have written myself.

I had been blinded by the GUIs -- the (frequently correct) notion that GUIs are generally inferior in the long run to scripting/programming had blinded me to the very idea that perhaps another tool could be superior.

As my career takes me in new directions, I'm presently reprising that experience, this time with SQL. Physicists rarely use it, so we have no idea what it can do. The syntax looks old/quirky/muddy, but the tools behind it are extremely powerful.

If you're great at R, consider sitting down with someone whom you know is just crushing problems with Excel. I'm pretty sure you'll both learn something useful.

I had to "learn" R during my CS course, and I was really impressed by it.

We used RStudio and Shiny to build this: https://henvic.shinyapps.io/accidents https://github.com/henvic/accidents

Another big one for me is speed. Almost all the popular wrangling libraries in R (tidyverse, data.table) use c++ under the hood and are lightning fast. Whereas excel can chug away for five minutes operating across a few tens or hundreds of thousands of rows; R can often do the same operations on a few millions or tens of millions of rows in a few seconds.

Best 'ah ha' moment for why R (or python, matlab, or whatever) might be useful is to have someone do tedious work making a bunch of charts and formatting them in Excel. Oh, and histograms. Oh, or confidence intervals on your linear regression.

Pivotcharts (and tables) help quite a bit for removing some of the tedium, but now suddenly it's interactive. Sometimes you want a format where BAM, everything is there laid out. There's nothing to misclick, all the tables and charts are there, you can just tell whoever, "look at the 3rd figure from the top on the 2nd tab".

The other things that people usually laugh at excel about (aside from silently changing values... that's just baaaaad) is the gong-show of naming files and versioning. R and friends + CSVs by themselves don't fix that problem. They just make it somewhat easier to solve (as in they play with git better).

I don't use R, and didn't read this, but I think the biggest problem with excel is that it hides the code. Figuring out which cell does what, and how, usually means clicking a million places trying to remember what is what

It's the same reason I took the doors off my pantry and cabinets

I think a fairer question today would be "Why do we use R rather than M?"


Absolutely! For power-users of Excel, this article reads as "the reason we don't use Excel is because we don't understand how to use it".

For those who aren't as familiar with Excel, all the functionality the author is describing as missing in Excel is actually built into it out of the box - as a feature called Get & Transform (or 'PowerQuery'). You can load in a CSV, manipulate it, do batch processing, and all the steps are visible, editable, resequencable and deletable (showing you the M code behind it!).

You can even handle hundreds of millions of records this way.

For those who want to learn, I would wholeheartedly recommend this book: https://www.amazon.co.uk/Power-Pivot-Bi-Excel-2010-2016/dp/1...

Second using Get & Transform (or 'PowerQuery'). Note it is also Power BI's data integration functionality. (Though was recently disappointed to learn it is not available on Mac.)

While I use wide range of data integration and visualization tools, Excel/Power Query with pivot tables & charts is often my go-to for quick data analysis or exploration.

Recently used it to retrieve Our World in Data Github hosted csv file data into an Excel file and then click refresh to get most recent data: https://009co.com/?p=1491

I think it comes from Excel hate and thinking Google sheets are good enough.

Many people to this day aren't even aware cells can be named like proper variable names.

I would add testing, collaboration, and complex use cases.

There's no good way to check if you've made a mistake in the logic of your excel spreadsheet. You can easily test your R code.

If you need to work with multiple people excel is a nightmare. It works fine when one person needs to build something for themselves or just to show to someone else. When you get two or more people working together it's like sharing a keyboard with someone else.

Excel can do a lot of things, but there is a limit to what it can do. R or Python is typically what you would reach for when you need to do something beyond what excel is capable of.

As a counter point to his 'Repeatability' point: in Excel you can treat your workbook as a simple output for your data files. Let's say I have my data in a CSV file (like the example), I can just load the data from the CSV file and use Power Query to apply any transformations and calculations to this data. I could then simply swap out the CSV file for another CSV file, and all the transformations and calculations would be applied to it just the same.

This is also how it works in Power BI, but I guess it's a bit of a different way of working than how most people are used to work.

i think that part of the answer is that software engineers simply aren’t (usually) familiar with excel. It’s a remarkably complex and powerful program but if you never used it professionally you will never learn its full potential. Yes, it’s easier to do useful stuff as a beginner in excel vs a real programming language but to get really good at it takes time. You might think you know excel because you’ve used it in some of your classes, but to quote my first boss, if you don’t know how to use excel without ever touching a mouse then you aren’t even an intermediate user.

Meh, if you're diving so deep into excel to learn all the shortcuts, why not just port the data into a scripting language which can do a lot more advanced things more easily and transparently.

When you are learning the basics, you are doing very basic tasks. R can do very basic tasks. MS Excel can do very basic tasks. If you never go beyond the basics, there is no compelling reason to use R (or any other programming language) over MS Excel.

Programming languages come into their own when the tasks stop being basic, but the learning process usually goes through "Hello, World" first.

I mean, if you're just going to print a line of text to the screen, why use Python? Just open an MS Word document. MS Word can even include variables and insert them dynamically!

The reason that we use R or Python or any programming language over applications is integration, capacity, performance or control of process. When I'm asked to replace a spreadsheet it is usually based on the sheet being to large to be performant or it is very buggy, and can't be relied on. I.e. CFO got crushed in a meeting because of a bad excel/googel sheets formula. Lately, I've been thinking about what the "Rust of spreadsheets" would be like... It's really a big problem now.

I don't think there is any excuse for doing analysis in excel at this point, and if any analyst came to me with either a) no coding skills or b) no interest in developing them, I'd laugh them out of the room.

My BiL is this breed. He's a financial analyst. He's got a hilarious mental block about using any kind of coding language in his work to the point that he actually developed a web-scraping app in excel. Don't ask me how. But jimmeny-christmas, use the right too for the damn job.

I think one of the things this is really missing is that Excel fundamentally supports 1 data structure - the table. What if your data isn't a table? What if your data is many tables? What if your data has 3 dimensions? It soon becomes incredibly difficult to map your actual problem into a set of tables.

It's actually one of excel's great features - they so strongly constrained the way you represent data that it's easy to reason about.

Much faster for much data processing workloads too. Like, significantly. And Excel has some stupid limitations like a maximum of a few millions lines.

Because R has a million scientific libraries to enable very specialized research workflows and excel doesn't.

On the accuracy of statistical procedures in Microsoft Excel 2007: https://materias.df.uba.ar/l1aa2017c2/files/2017/08/excel200...

Only slightly tongue in cheek: Removing the ability to save workbooks would still allow people use excel where it shines while not allowing people and organizations shoot themselves in foot by using subpar processes for anything you actually need to be confident about.

Beyond Spreadsheet with R book for those who are venturing into R from Excel:


I use R and Excel daily. Excels graphs are simpler to make than a customized GG plot, but an ice cream sandwich is easier than a sundae with whipped cream and sauce and a cherry.

Different levels of difficulty and customization for different applications.

R and Excel are not the only alternatives. There are also visual, node-based data transformation tools such as Easy Data Transform, Knime and Alteryx. They can do 95% of what R can do, but are much easier to get started with.

R is free. Excel is not. Excel encourages messy behavior, just like Jupyter Notebooks. Sometimes during EDA that might be ok. But anything rigorous shouldn’t be confined to just Excel. Or Jupyter imo.

Most ppl who use excel would benefit from Access and creating some repeatable processes. A lot of the sorting and ranking from excel could be done in linked tables and improve the repeatability

For a program which combines R with an Excel/SPSS interface I can recommend Jamovi [1]

[1] https://www.jamovi.org/

The article is a bit one sided. I think it ought to acknowledge the good things about excel. For example it lets you more directly interact with the data and see intermediate results.

Also -Open source is better for serious work -Excel is notoriously buggy and shouldn’t be used for anything important, yet it is and the costs are enormous. Many articles on this.

One problem we have with excel is merging documents. I wish there was a good tool to use that worked well with merging.

This also provides a good transition path in case you'd like to experiment with R but have existing spreadsheets.

Reason I use scripting languages over excel: I don't want to pay for excel, or a machine that will run it.

There are open-source spreadsheet programs, like LibreOffice Calc. I occasionally use it (Calc) to do things like taxes. I find it quite easy to use, personally, but YMMV.

One big thing missing last I worked on Excel/VBA is the ability to write test cases.

Mainly because they haven't heard of Python apparently.

The article completely misses the point of R. Go ahead. Try and run a mixed effects model with random slopes in Excel, and plot the random effects and the fixed effects in a spaghetti plot. Let me how that went.

These are also good points against no-code

Because in Excel 1900 is a leap year.

Use R when getting paid by the hour. Excel when you have dinner plans.

Worse Is Better™


Excel doesn't scale.

period. full stop.

Excel files are shareable and everyone's able to inspect them. (drops mike)

These reasons are valid of course. But Excel is installed on most of machines, R or Python isn’t. We all know how people like a “binary” that just runs everywhere…

Overall, for serious data analysis work, I’d use R or Python, but Excel is popular enough to have its place, just need to know the risks.

How about other solutions e.g. deno+typescript?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact