Dependently Typing R Vectors, Arrays, and Matrices

blackbear_ · on April 11, 2023

Types in R make absolutely no sense to me, and the whole apply/lapply/sapply/tapply/vapply madness gives a feeling of hacks taped together just enough so that the thing does not fall apart.

> What a programmer might think of as a primitive value is often actually a vector containing precisely one element.

Say what?!

PheonixPharts · on April 11, 2023

> Say what?!

Maybe it's because R is heavily used by non-programmers, there is this weird belief that R should be easy to grok by anyone with programming experience, and if they don't immediately get it, it must be R's fault. Of the programming languages I've used, R is much closer in terms of changing your view of programming to Haskell and Lisp (a good chunk of R essentially is a Lisp) than most other 'blub' languages [0]

R certainly has quirks, but most of them come from being one of the oldest programming languages out there (considering it is ultimately an open implementation of S which came out in 1976) and so has a lot of programming patterns that may not be familiar for modern programmers.

Not understanding that R is fundamentally a vector based language and what a vector based language is is akin to approaching Haskell and not having any understanding of Functional Programming, finding no for-loops and saying it's a silly language.

If you are someone who is very curious about programming languages and different programming paradigms, and likes doing numeric/statistical computing, I highly recommend approaching R with a very open mind, and being willing to explore it more carefully. As just one example: one of it's multiple object systems is using Generic functions in the same way that Common Lisp's CLOS does. The best example is R's `plot` function which is a generic function dispatching based on the class of the argument. Again this is something that less experienced programmers see as confusing and stupid, whereas someone with a better understanding of the development of OOP will likely find this very interesting.

0. http://www.paulgraham.com/avg.html

JHonaker · on April 12, 2023

I was going to mention the (at least) three different kinds of OO-y systems in R, but you already did. I think the reason that it's so hard for people to grok is that it's so old that a lot of code already exists, and a lot of that code is still used today!

Many libraries use different implementations of the same ideas for different things. It can be hard to handle if you're someone who likes to understand the whole system.

R is a fascinating language in its own right though. I'm not aware of many languages that give you first class control over environments as well or as easily as R. It can definitely enable some "lisp curse"-type code though.

R more than any language (even Scheme) lives the "build a DSL for your problem" dream that so many of us like from more standard languages. Consider the various ways to specify formulae for models; the very different ways to specify graphics in the base, GGPlot, and lattice libraries; and the various ways to to describe data transformations via the base functionality, the Tidyverse, the Tidyverse + Purrr, and data.table.

andersentobias · on April 11, 2023

Would you be interested in tutoring a experienced programmer (me) in using R?

JHonaker · on April 12, 2023

What do you want to know?

dan-robertson · on April 11, 2023

This is a bit like other vector/array languages like Matlab or APL: 145 isn’t considered a ‘scalar’ but rather an array containing one element. Though it can make defining how to deal with different shapes for some operation like + tricky. One idea is cycling when one vector/axis is shorter. Another is to special case vectors with one element. Another is for these singletons to be, in some sense, zero dimensional arrays and then you can just have broadcasting with errors if one shape isn’t a prefix of the other.

nequo · on April 11, 2023

Everything being a vector is one of the best things about R. You can write a function that looks sensible for scalars:

  in_circle <- function(x, y) {
    x^2 + y^2 <= 1
  }

It works as you would expect:

  > in_circle(.2, .3)
  [1] TRUE
  > in_circle(1, 1)
  [1] FALSE

But you can also evaluate `in_circle` for multiple points with a single function call. You can pass in vectors as arguments, and you'll get back the results as a vector:

  > in_circle(c(.2, 1), c(.3, 1))
  [1]  TRUE FALSE

Since the columns of a data frame, a tibble, or a data table are all vectors, it becomes very natural to apply simple functions to them. With tibbles:

  > data <- tibble(a = c(.2, 1), b = c(.3, 1))
  > data
  # A tibble: 2 × 2
        a     b
    <dbl> <dbl>
  1   0.2   0.3
  2   1     1  
  > data %>% mutate(inside = in_circle(a, b))
  # A tibble: 2 × 3
        a     b inside
    <dbl> <dbl> <lgl> 
  1   0.2   0.3 TRUE  
  2   1     1   FALSE

mamcx · on April 11, 2023

> Say what?!

This is the sensible design!

Is like "In a OO language, 1 is a object"

But here, this is an array language, so it makes ton of sense that 1 is an array.

And in a relational language, 1 should be a relation.

kuhewa · on April 11, 2023

> /lapply/sapply/tapply/vapply madness gives a feeling of hacks taped together just enough so that the thing does not fall apart

Hard to argue against that when the argument for whether to pass to simplify2Array or leave the result as a list is:

sapply(..., simplify = )

but is

mapply(..., SIMPLIFY = )

It kinda reminds me of English pronunciation as far as consistency of conventions goes

rscho · on April 11, 2023

A 'just-screwed-enough-to-still-be-usable' version of Scheme is apparently a reasonable recipe for success in the PL design world: JavaScript and R have a genesis in common. I have to say that I long for an array language that's closer to original Scheme (for the boilerplate parts of statistical programming) or/and APL (for purely statistical stuff).

Regarding the typing story, I think a contract system (R has libs for that) rather than static types would be far better. Nice as they are, dependent types are not easy to use, and pretty much a nightmare for exploratory statistical work. Since most of the runtime is spent iterating on vectors anyways, contracts would IMO have a much better usability story.

sixbrx · on April 11, 2023

R's pretty serious about it, it's why for example length("abc") returns 1, which is a head-scratcher for people coming from Python.

bachmeier · on April 11, 2023

Well they could read the documentation. nchar("abc") returns 3.

That's a lot easier for new programmers than len("abc"), which requires them to understand Python's string implementation. len("abc\n") returns 4 - confusing as hell for a new programmer working with a very common data analysis problem, because "length" is too generic to know it's the number of characters. nchar("abc\n") being 4 makes sense, because "\n" is a character representing a line feed. Applying the same logic, len(137) returns TypeError: object of type 'int' has no len(), when it should be 3.

At the end of the day, the programmer will always have to make an effort to learn the language they're using.

huijzer · on April 11, 2023

Yes. No function overloading + no object oriented = verbose function names.

What I find even more crazy is that some of the R functions take function names as strings as argument. For example, taking "sum" as an argument will then apply the sum function to the dataframe.

vharuck · on April 11, 2023

>No function overloading + no object oriented = verbose function names.

Forgive me if I misunderstand, but are you saying R doesn't have function overloading? Because R's S3 class type system relies entirely on function overloading to make up for not having class-bound methods. Look at the result of `methods("print")` to see how overloaded that function is.

The real problem with this system is that every function is in the general namespace. It gets hard to avoid name clashes. Some packages have taken to appending prefixes to their function names (e.g., every function from the stringi package starts with "stri_").

>What I find even more crazy is that some of the R functions take function names as strings as argument. For example, taking "sum" as an argument will then apply the sum function to the dataframe.

Think of it like a "naturalized" macro language. When you know you'll always use `sum`, you can pass the function object itself. But character vectors are easier to deal with in logic and control flows. For a toy example, let's say you have a dataset with some columns that contain addresses, some that contain dates, and so on. You could write an overall cleaning function that takes the dataset and a "format" table to be a lookup between column names and functions to use¹. R even has `getFunction` to make this simple.

¹Because functions are regular objects in R, the format table could have a list column with the functions. But I've found my code looks more cluttered when using arbitrary functions in logic. And function names are easier to handle when saving data frames to files or databases.

wodenokoto · on April 11, 2023

In fairness, much newer frameworks such as pandas do the same ie `df.agg(“sum”)`.

What is it about function names that overloading would solve that singular dispatch can’t already do?

0cf8612b2e1e · on April 11, 2023

To carry water for pandas, you can also pass functions into that pattern: ‘df.agg(np.sum)’

kgwgk · on April 11, 2023

Just like in the R example, I think.

programLyrique · on April 11, 2023

Another recent (2020) work on types for R: "Designing types for R, empirically" https://dl.acm.org/doi/abs/10.1145/3428249

hackandthink · on April 11, 2023

"2. A Primer on R Vectors, Arrays, and Matrices"

This should be referenced in every R Tutorial.