
Back to the Future: Lisp as a Base for a Statistical Computing System [pdf] - tosh
https://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdf
======
ScottBurson
_R uses a pass-by-value semantic for function calls. This means that when a
function modifies the contents of one of its arguments, it is a local copy of
the value which is changed, not the original value. This has many desirable
properties, including aiding reasoning about and debugging code, and ensuring
precious data is not corrupted. However, it is very expensive as many more
computations need to be done to copy the data, and many computations require
excessive memory due to the large number of copies needed to guarantee these
semantics._

I don't know what implementation R uses for its data frames, but building
efficient collections with functional semantics is now a solved problem, as
demonstrated by the functional collections libraries supplied by or available
for many languages -- Clojure, Scala, ML, Haskell, and no doubt many others
I'm unaware of (along with my own for Common Lisp [0] and Java [1]). It is no
longer necessary for the important benefits cited by the author to be
considered at odds with an efficient implementation.

I think it's unfortunate that the Julia creators didn't realize this, and have
gone back to imperative semantics for their collections. Imperative
collections are harder to reason about, especially for inexperienced
programmers, and are more bug-prone.

[0] [https://github.com/slburson/fset](https://github.com/slburson/fset)

[1] [https://github.com/slburson/fset-java](https://github.com/slburson/fset-
java)

~~~
nerdponx
A while back I remember someone tried to write an alternative R
implementation.

Can you call R from Julia yet? You can do it from Python, but converting
between R and Python data structures can be painful. I imagine it would be
less so with Julia being "more vectorized" than Python.

There's also the issue that vectorized operations tend to be very _fast_ but
algorithmically inefficient. Even if expressions are evaluated lazily, there
is no compiler to recognize that computing the value of

    
    
        any(is.na(x))
    

can short-circuit as soon as a single NA value is found. Instead all of
`is.na(x)` is computed first, and then `any()` is computed on the result.

I imagine this a problem in any non-compiled (ahead-of-time or just-in-time)
language. Wondering what your thoughts were, since it seems like a related
issue.

~~~
ihnorton
> Can you call R from Julia yet?

Yes. There's even an R REPL mode.

[https://github.com/JuliaInterop/RCall.jl](https://github.com/JuliaInterop/RCall.jl)

------
widdma
For contex, Ross Ihaka is one of the original authors of R. This date back to
2008. Also see Ross Ihaka's 2010 paper:
[https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pd...](https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf)

~~~
simonbyrne
Thanks, I had seen the original link before, but not this. As a Julia
developer, I like that his 5 lessons are ones that Julia has successfully
addressed. In particular:

1\. Julia had the benefit of clever design, as well as not needing to support
legacy interfaces that are difficult to optimise. Python has had similar
problems (which needs to support all the different low-level interfaces),
whereas JavaScript, which provides fewer ways to "mess with the internals",
now has several high-performance engines.

4\. Although Julia does provide the ability for type annotation, it is
actually rather rarely necessary, due again to good language design which
makes automatic type inference feasible.

5\. Julia originally discouraged use of vectorised operations for this reason.
But now we've gone the other way, with special "dot" syntax which avoids
allocating intermediate arrays. See
[https://julialang.org/blog/2017/01/moredots](https://julialang.org/blog/2017/01/moredots)
for more details.

~~~
JadeNB
As a Julia developer, and for the benefit of someone (me!) who doesn't know
anything about such matters, are you in a position to respond to ScottBurson's
post
([https://news.ycombinator.com/item?id=14801186](https://news.ycombinator.com/item?id=14801186))
on efficient collections with functional semantics?

~~~
micro2588
Julia has been designed for single core performance fullstop. Functional
collections may work well with a state of the art GC, with Julia's not so
much. The fact that Julia can interop seemlessly with C code (easily) kind of
bounds the design of the GC.

I think it is a little disingenuous to say that a Julia programmer does not
have to worry about types. Type inference alleviates many burdens, but correct
typing of arguments is essential (and hidden promotions or casting can kill
performance). So while you can write correct programs easily, for efficient
programs you end up worrying about this quite a lot.

~~~
ScottBurson
> Functional collections may work well with a state of the art GC, with
> Julia's not so much.

I don't know the specifics of Julia's GC, but this seems a strange thing to
say in 2017. Douglas Crosher's conservative generational collector for CMUCL
(also used in SBCL AFAIK) supports C interoperation and is entirely adequate
for handling the extra garbage that (admittedly) is generated when using
functional collections. I don't recall exactly when he wrote that collector,
but it must have been 20 years ago at least. It would be strange if Julia
weren't using something at least as sophisticated.

------
usgroup
I'm a big R user; but it's true what the guy says. When you step off the happy
path because the data is an awkward size or because you can't apply, aggregate
or tidyr your way to a good solution, what results is often slow and hideous.

I've switched to Clojure for data work. It's fast, data centric and you can
usually work out an elegant solution no matter how grizzly the problem, and in
truth, despite the huge number of R packages, I use only a tiny portion.

For working with matrices and datasets:

[https://github.com/mikera/core.matrix](https://github.com/mikera/core.matrix)

[https://github.com/emiruz/dataset-tools](https://github.com/emiruz/dataset-
tools)

[https://github.com/emiruz/sparse-data](https://github.com/emiruz/sparse-data)

[https://github.com/clojure/data.csv](https://github.com/clojure/data.csv)

For optimisation, simulation and modelling:

[https://github.com/probprog/](https://github.com/probprog/)

For complex plots:

[http://www.gnuplot.info/](http://www.gnuplot.info/) (+ clojure dsl)

For reports (clojure outputs csv + images):

latex

orgmode + babel

Gorilla REPL (native Clojure notebook)

I still use R lots and lots but mostly in the way that I use awk, and for
shorter scripts. I don't use Incanter because it's been abandoned for so long.
I do feel that Clojure could definitely use a better notebook story than
Gorilla REPL.

~~~
bhnmmhmd
I've been considering Clojure recently for data science. In your experience,
could you tell me how Clojure stacks up against the likes of R and Python?

I guess the performance should not be much better (thanks to JVM), and the
learning curve is a bit steep. Oh and one question about Lisp macros: in your
line of work, have you ever confronted a situation where a macro saved the
day?

Thanks.

~~~
dmichulke
I am not the addressee of your question but I use clojure as well for DS and I
thought I extend on a few aspects of my sibling post.

\- Performance 1: It's really fast iff you avoid boxing, reflection and so on.
Even without this optimization it's mostly fast enough. However, 1 out of 10
times, it get's too slow and you need to change a few things in the code so it
gets faster. It's easy and never requires a lot of time (after the first :P)

\- Performance 2: Memory consumption is sometimes quite high, in this cases
you should use arrays and records instead of vectors of maps. Also JVM args
(e.g. -DXmx8G) is your friend.

\- macros I don't ever use.

\- filtering, aggregating, .. is a breeze and usually a line of code. E.g.

    
    
      (->> a-vector-of-records
          (filter #(> (:id %) 100) ;; all ids above 100
          (remove (comp #{:a :b} :category) ;; without categories :a or :b
          (map :value) ;; take the value
          (reduce + 0) ;; and sum it
    

\- Tools: I use incanter.stats (for statistical things I didn't yet
implement), incanter.charts for the visualization and incanter.optimize for
linear models. Other stuff (GLM, FFT, ...) I implemented myself or directly
use the libs incanter uses.

\- For report generation (notebook-like) I typically use clj-pdf which is
usually the first deliverable of every DS task I have.

\- For learning: I came from Javaland, and I benefited heavily from the
4clojure koans for the practical stuff and "The Joy of Clojure" for the
"clojure way of thinking".

\- For using it: I had to "learn emacs" (and paredit and so on) at the same
time, it complicated everything and also was a drain on my motivation. It's
good if you have a colleague who uses the setup already, alternatively youtube
has many videos on this. Today, I'd never switch back because just using
eclipse (or something else) makes me feel that I need 10s to execute a thought
instead of keyboard shortcut. When you're there you also might want to switch
to i3wm (if you don't have it already).

------
e12e
My initial thought was that this was about a project building atop of clasp,
like CANDO:

[https://youtu.be/5bQhGS8V6dQ?t=322](https://youtu.be/5bQhGS8V6dQ?t=322)

[https://github.com/drmeister/clasp](https://github.com/drmeister/clasp)

[https://github.com/drmeister/cando](https://github.com/drmeister/cando)

But this does seem a bit more like Julia?

"Julia: to Lisp or not to Lisp?"

[https://www.youtube.com/watch?v=dK3zRXhrFZY](https://www.youtube.com/watch?v=dK3zRXhrFZY)

Then there's also the Axiom system, which I guess is primarily for symbolic
computation, but I don't know if it might make sense to use it as a building
block for statistical software?

[http://axiom-developer.org/](http://axiom-developer.org/)

------
souenzzo
Checkout this [https://github.com/pixie-lang/pixie](https://github.com/pixie-
lang/pixie) Clojure-inspired lisp over pypy/jit Performs almost equal
C/compiled languages.

~~~
nerdponx
Last I checked it was unmaintained. No longer true?

------
simonbyrne
Another talk on the difficulty of optimising R is this fantastic one by Jan
Vitek:
[https://www.youtube.com/watch?v=HStF1RJOyxI](https://www.youtube.com/watch?v=HStF1RJOyxI)

Also, my favourite R trick:

    
    
        foo <- function() {rm(foo, envir=environment(foo))}

~~~
curiousgal
Ethan Hunt approves of that trick.

------
blokeley
When discussing python, the authors missed the elephant in the room: numpy.

When almost any statistical work is done in python, we use numpy arrays,
sometimes via the pandas or statsmodels etc. libraries. We seldom use native
python types.

~~~
simonbyrne
This was from 2008: NumPy existed but it was pretty early days (and a pain to
install). I'm not sure if pandas had started at that point.

------
lispm
[https://github.com/blindglobe/common-lisp-
stat](https://github.com/blindglobe/common-lisp-stat)

> Common Lisp Statistics -- based on LispStat (Tierney) but updated for Common
> Lisp and incorporating lessons from R.

------
lausiant
This thread is fascinating to me as someone who learned stats as an undergrad
using Lisp, and then learned S, and then learned about and starting using R,
and read all of these forewarnings by Ihaka and Tierney.

The current era is both exciting and dispiriting from this perspective. It
seems like there's a lot of traction in this area with languages like Julia,
OCaml, Nim, to name just a few, which is wonderful. The discussions here are
great in this regard. However, it's somewhat frustrating that warnings like
the linked piece -- that have been around for a long time -- seem to have been
ignored. My personal experience, too, is that many of the claims of recent
languages regarding "C-like" speed are maybe overstated; my sense is that the
slowest toy benchmarks are accurate reflections of the bottlenecks that will
slow down a large program. For small programs, they are C-like; as
program/library length increases, you start to approach Python or R speeds,
which makes me wonder if it's better to just use C to begin with.

Relying on wrapped C, like in Numpy, is also misleading because eventually you
end up bumping up against part of the program that can't be pushed down into
lower-level code.

I often wish that x-lisp-stat had taken off rather than R. I love both in
their syntax, but in retrospect, it seemed like there was a fork in the path
of numerical computing, either toward a more "native" approach, the other
toward a model where a high-level language is used to interface with a low-
level language, to abstract away some of the complexity. I understand the
rationale for both, but kinda feel like the latter approach, which has become
more dominant, isn't really sustainable. Moreover, this is all occurring while
I've watched C++ become impressively abstracted--if things like Eigen had been
around earlier, I'm not sure Python or R would have ascended as much as they
have.

The "new" issue that seems to be arising all the time in these discussions is
parallel GC models and implementations. Not sure where this will all lead. If
it's lisp I'm going to spit out my drink.

