
What's wrong with statistics in Julia? - benhamner
http://www.johnmyleswhite.com/notebook/2014/11/29/whats-wrong-with-statistics-in-julia/
======
vegabook
Is Julia seriously targeting the R user base? Or, if it is honest with itself,
is it going after the matlab people first. My sense is that engineers (Matlab
users) and to a certain extent scientists (Python rules) will be drawn in, but
the stats crowd requires subtly different priorities, that seem to be alluded
to here. Graphics are one such priority. The excellent ggplot2 gets all the
glory, but base graphics are mega-fast, extremely robust, and deeply
customizable, and there is the unsung hero of grid graphics which provides an
extraordinary set of abstractions for building things like ggplot2 and/or
Lattice. My point is that so much graphical quality speaks _directly_ to one
of the key requirements of statisticians, where at the end of the analysis,
_communication_ is usually required. This is much less the case for engineers
or financiers (big matlab users) for example, where correct and fast answers
are the endpoint. Where is Julia on graphics? Last time I checked it was still
trying to interface to R and/or Matplotlib.

The other thing that intrigues me is Julia's scalar computations being "at
least" as important as vectors. This has the whiff of For loops (an immediate
eye-roller for R people) accustomed to vectorization everywhere and
essentially, exclusively. I am not suggesting that Julia doesn't do vectors
well, just that, like any set of priorities, it is not catering _first_ for
statisticians, whose requirements are often quite different from those of
scientists and engineers who use Matlab and Python.

~~~
yummyfajitas
Julia does vectors well, but scalar computations are very important. The
problem with vector computations:

    
    
        z = x*(v+w)
    

This involves a memory allocation for v+w, and a second one for the product x
times result. One way to avoid this in numpy is careful use of out:

    
    
        z = zeros(shape=x.shape, dtype=float)
        add(v,w,out=z)
        multiply(x,z,out=z)
    

An even faster way, which only requires traversing the arrays once, and is
more readable in my opinion, is a simple for loop:

    
    
        for i=1:len(x)
          z[i] = x[i]*(v[i]+w[i])
        end
    

This also lets you more easily put if statements inside loops, etc. You can
also do accumulative calculations without creating intermediate arrays at all.
I.e. one way to create a random walk is:

    
    
        walk = cumsum(bernoulli(p).rvs(1024))
        end_result = walk[-1]
    

Another way is:

    
    
        end_result = 0
        for i=1:1024
          if rand() < p
            end_result += 1
          end
        end

~~~
makeset
This is not a problem with vector computations, but the particular
implementation. The vector expression in your example is by far the most
readable. It is literally the math you had in mind.

De-vectorization in code is like embedding ASM code: you had to write it out
because the compiler sucks. Good language design should favor lucid and
concise syntax, and good compiler implementation should make it not necessary
to circumvent it for performance. In this case, the compiler should be
implementing those vector expressions without allocating unnecessary memory.

------
nkurz
I found some development discussion about the changes to missing values here:
[https://github.com/JuliaLang/julia/issues/8423](https://github.com/JuliaLang/julia/issues/8423)

I've yet to try out Julia, but seeing productive and intelligent discussion
like this followed quickly by execution certainly inspires confidence that
it's a language with considerable promise.

------
username223
> In a future blog post, I’ll describe how Nullable works in greater detail.

This is the part that interests me. If you're not using sentinel values like
NaN, then it seems like you're left with pointers (terrible) or tags and tag
tests (also pretty bad). If Julia can't use the processor's SIMD instructions
(or the GPU in the near future), it's not suitable for inner loops. Do you
special-case "Nullable{Double}" to use NaN as its "NA" value?

~~~
johnmyleswhite
Thankfully, the solution paths you're enumerating aren't the only viable
solutions to the problem of representing missing values. Because Julia has
value types (called immutables) and because Julia can compile functions based
on their input arguments' types so that the functions don't need to do any
run-time tag checking, a Nullable{Float64} value can be stored using at most
16 bytes -- 8 bytes for a Boolean flag indicating whether a value is present
or not and 8 bytes for the actual 64-bit floating point value. The latter 8
bytes can store arbitrary bits if the Boolean flag is set to indicate that the
Float64 value is missing.

If you're interested in the details of how Nullable{T} works, you can start
with
[http://github.com/JuliaLang/julia/pull/8152](http://github.com/JuliaLang/julia/pull/8152)
and then read through Julia's compiler code to understand how immutables are
represented in memory and how functions that employ them are compiled to
machine code.

UPDATE: Looking back at my answer, I should make clear that, when you want to
work with SIMD or GPU operations, one can (and we will) use a representation
of arrays of Nullable objects that places the values and the Boolean flags
into two separate arrays. The values array is then identical to a normal
64-bit floating point array and allows all of the same operations. Setting the
Boolean flags appropriately requires a bit more work. (How much work is
strongly dependent on the exact computation being performed.)

~~~
nkurz
Yikes. While this sidesteps some of the issues in the other approaches, it
also doubles your memory usage and halves your potential SIMD performance.
Maybe this is better than the downsides of the alternatives, but as someone
working on optimizing large and slow R code with C++ extensions speaking to
someone who is trying to replace R with something better, this seems like a
choice that may haunt you in the future. If you can, I'd keep your options
open for a "Structure of Arrays" approach or "buddy bitmap" that keeps your
data contiguous and compact.

~~~
johnmyleswhite
We'll keep re-evaluating this over time, but I'm currently not sure it's that
serious a problem. For many numerically intensive computations, you'll want to
move over to a pure array of 64-bit floating point values before the long-
running computation begins, at which point you can adopt NaN == NA semantics
if appropriate. I believe the solution I outlined in my updated comment above
may already describe the buddy bitmap solution you're referring to. It
certainly describes a solution that I would refer to as a struct-of-arrays.

------
peatmoss
I've got a perhaps naïve question related to:

> In particular, I’d like to replace DataFrames with a new type that no longer
> occupies a strange intermediate position between matrices and relational
> tables.

Namely, why is an embedded SQLite database not used for all things tabular in
languages like R/Julia/Foo? I was thinking about this as I was attempting to
reconstruct a visualization in Racket using their (pretty good!) 2d plotting
facilities and lamenting not having a tabular data structure.

SQLite is embeddable. It has fast in-memory database support. It can operate
reasonably quickly on data that is stored to disk. It supports indexing. NULL
values already exist to represent missingness. SQLite allows for callbacks to
user-supplied functions that I'd imagine could be created relatively easy in
something like Julia.

As a side benefit, it seems like a SQLite-oriented tabular data store could be
extended, like dplyr has done, to support other databases.

When I think about the use cases I've encountered where I've found myself
reaching for DataFrame or data.frame, I am struggling to think how a tightly
integrated SQLite wouldn't work.

Are there Computer Science reasons why this is a silly idea? I know Pandas
claims nominally better performance than SQLite in some circumstances, but
then again SQLite has also recently seen some substantial performance gains.

~~~
soegaard
Racket comes with support for SQLite (the index of Scribble is built with help
from SQlite for example). That is, it is possible to make such a tabular data
structure with a reasonable effort.

It would be interesting to hear what kind of features such a tabular data
structure. A post on the Racket mailing list is welcome.

~~~
peatmoss
Thank you for the example of SQLite being used with Scribble. I was playing a
little bit with the sqlite interface, but using it rather naively to pass
queries and return values.

Not quite sure I have the Racket chops to implement something like a
data.frame abstraction over an in-memory SQLite database (or even dplyr style
query construction). Maybe a project for after I get off Hacker News and
finish up a couple of articles that have needed finishing...

------
howeman
I'd be interested in seeing things you think are done correctly as well. We're
working to build a statistics package in go as well (github.com/gonum/stat).
It's good that people are taking a fresh look. Our capabilities are clearly
limited at the moment, and the features of Go are quite different from those
of Julia, but there's still a lot that can be learned in common.

~~~
johnmyleswhite
To echo what I said in the post, I think using Nullable as the representation
of missing values is the right strategy. I think it achieves sufficiently
strong performance while also providing a lot of other desirable features.

As for other things that we've done well, I think the Distributions package is
one of the best designed interfaces to probability distributions I've ever
seen in any programming language. I don't know if Go supports value types, but
the existence of value types in Julia allows each distribution to have its own
distinct type without having to pay any cost for maintaining object identity.
When combined with Julia's ability to do multiple dispatch, the Distributions
package allows one to encode analytic knowledge about probability
distributions directly into the type system. And existence of parametric types
lets you talk about things like location-scale families in a natural way.

I'd also encourage you to look at the optimization libraries that have been
built for Julia. The language makes it easy to do automatic differentiation,
which is probably the most under-utilized tool in all of scientific computing.
In general, I think there's really high quality work being done on
optimization -- which I see as essential for enabling ad hoc statistical
modeling.

------
nyir
Isn't the rather big flexibility of R to not eagerly evaluate, or to rewrite
function arguments entirely (and apparently scope manipulation, that's a new
one for me) one of the major points of critic? It always seemed to me that
having "proper" macros is a selling point rather than only an
approximation/emulation.

~~~
jacobolus
> _Isn 't the rather big flexibility of R to not eagerly evaluate, or to
> rewrite function arguments entirely (and apparently scope manipulation,
> that's a new one for me) one of the major points of critic_ [sic]?

For anyone coming from a programming background, the way R deals with scope
and function arguments is incredibly confusing and frustrating, and causes all
kinds of nasty bugs. It seems pathological.

For several of the academic statisticians I’ve talked to though, R seems to be
their first programming language, and they learn (in a slightly fuzzy way, but
enough to do their work) the behavior of its APIs without thinking too deeply
about issues of scope or eager vs. lazy argument evaluation, etc. However it
works is “just the way it is.” When their expectations about what it should do
are violated, they futz with the code until it starts working again and call
it a day. People are typically using R to solve narrow concrete problems
instead of building systems out of it, so most end-users don’t really have
practice with thinking about API design. (R is similar to Matlab in this way.)

Personally I think many of R’s idioms are net negative for the language
because they make the abstractions much less simple/clean, and make it more
difficult to solve problems in general/adaptable ways. But for someone who
already has substantial amounts of time invested in learning R’s API quirks,
and doesn’t know any other way, it might not seem like a big deal.

~~~
johnmyleswhite
A fun test to see if R users understand how R works is to ask them to predict
the behavior of the following snippet of code:

foo <\- function(a, b) { if (a == 0) { return(1) } else { return(2) } }

foo(0, stop("Error!"))

foo(1, stop("Error!"))

~~~
Hansi
Why? Seems to simple. I normally go straight for drop=FALSE the most evil
default issue of the language.

~~~
jghn
Agreed. There are many WTF things I can think of in R, this is absolutely not
one of them.

~~~
johnmyleswhite
My experience asking people about that snippet is that very few people can
correctly predict the results of that function without actually executing the
code. What interests me about the question is that many people's mental model
of how R programs are evaluated is often too vague to enable them to produce
sharp predictions about program behavior prior to executing code.

~~~
RA_Fisher
This is funny because I was heavily skeptical yet completely reversed my
opinion within 10 minutes. I correctly predicted the results, but I program R
professionally. Then I asked my partner who's a non-programmer to read the
snippet and her prediction was that the program would stop. That led me to
reflect back on your statement. I think the key word is "sharp". I correctly
predicted the result, but that's only because I trust lazy-eval heavily. Even
then, it's hard to really cement trust in lazy-eval (for me) as it's difficult
to mentally map. That's not exactly sharp after all!

~~~
jghn
"but I program R professionally"

To be fair to JMW, I used to as well - and perhaps more importantly I was
writing end user software in R and not doing my own data wrangling.

------
lottin
One of Julia's main selling points has been speed. They said there was no
reason a dynamic language had to be slow, but now we see that as soon as they
start adding basic features, such as support for missing values, it's
beginning to take a toll on performance. I wonder if Julia 1.0 will be any
faster than R or Python.

~~~
ninjin
> ...now we see that as soon as they start adding basic features, such as
> support for missing values, it's beginning to take a toll on performance.

Of course there is a price to pay for adding dynamic features, but only when
you use them. This has been the case for Julia as long as I can remember (for
example, fields of type `Any`). As far as I can see, isn't the whole point of
Julia that you can go towards the dynamic end of the spectrum if you need to,
while still being able to generate blazingly fast code by adhering to type
constraints wherever it is necessary? All this within the same language,
rather than with R and Python where you need to write extensions in a lower
level language and deal with a C API.

------
gajomi
I am a big fan of much of the work going on with statistics in Julia.

I'd like to point out another (related) thing though that is wrong with the
state of statistics in Julia, in my opinion. Actually my problem has less to
do with statistics, per se, than the mathematical foundations of statistical
calculations. There are, of course, many different approaches to statistical
inference (the Bayesian vs. frequentist camps infamous among these), but the
calculations all come down to reasoning about probabilities which is a well
posed but not always easy task. The Julia developers, in their wisdom, have
recognized this, and as such have put together Distributions.jl the purpose of
which is to provide datatypes and utility functions over probability
distributions (plus some vestigial methods about maximum likelihood inference,
which thankfully stay out of the way). If you haven't seen it, check it out.
It's got a nice design, I think.

But there is presently a clear server limitation: the parameterized type
hierarchy. The requirement is that every distribution have support over a set
in which all members are either Univariate, Multivariate or Matrixvariate with
elements in the fields being either Discrete or Continuous. This obviously
misses the general picture of the kinds of sets from which one draws random
variables, which plays into issues that the article mentions (if the
probability distribution datatype can't model the data you have to do ad hoc
things to account for it). Indeed a huge chunk of the issues currently open in
the Distributions github page essential boil down to problems with
representing the sets and/or spaces from which elements in the distribution
are drawn:

[https://github.com/JuliaStats/Distributions.jl/issues/147](https://github.com/JuliaStats/Distributions.jl/issues/147)
[https://github.com/JuliaStats/Distributions.jl/issues/309](https://github.com/JuliaStats/Distributions.jl/issues/309)
[https://github.com/JuliaStats/Distributions.jl/issues/224](https://github.com/JuliaStats/Distributions.jl/issues/224)
[https://github.com/JuliaStats/Distributions.jl/issues/283](https://github.com/JuliaStats/Distributions.jl/issues/283)

End users have a variety of well developed ideas in mind about the sets that
their samples belong to and even the spaces from which they are drawn from
which currently exceed the representational capacity of the existing types. In
my opinion the way to fix this is to start a separate library that focuses on
types and methods for representing and manipulating sets and spaces (i.e.
topological information attached to sets). This could then be consumed by the
Distributions people as well as others modeling things outside of probability.

~~~
johnmyleswhite
I think it would be great to develop support for generic sets in a separate
package that could eventually be merged into Distributions. I'm not sure that
I think all of the issues you've linked to are related to that problem,
though. For example, the issue about using Vector{Rational} as an input to the
Categorical distribution type's constructor is a relatively minor issue since
there are very few cases in which you'd want to evaluate the PDF of a
Categorical variable using rational numbers rather than floating-point
numbers. As such, it's not clear that you'd want to have a
Categorical{Rational} type that's separate from the current Categorical type
(which could be called the Categorical{Float64} type).

~~~
gajomi
>I'm not sure that I think all of the issues you've linked to are related to
that problem, though

It's probably true that many of these issues could be resolved without the
need for this generic sets fantasy of mine. The resolution that you propose
for Categorical{Rational} I think makes the right choice, given the tradeoffs
that one currently has to make in designing these things.

The particular reason that I considered this issue to be related to the
problem of the representation of sets is that if it is true that both the
support of such a distribution and the values of the pdf were rationals one
could support exact computation of moments and/or cumulants and/or
coefficients of generating functions and the like. The "exactness" of these
procedures becomes important if there is important information in the
factorization of said moments/coefficients/whatever. In fact in these cases
one actually might care to know not just that they are Rationals but Rationals
with such and such restrictions, which is where more general type parameters
would really start to shine. So perhaps someday somebody will want it...

~~~
johnmyleswhite
That's true if Rational means Rational{BigInt}, but the default in Julia is
for Rational to mean Rational{Int}, which can potentially produce less
reliable answers that using Float64.

