
Why Julia's DataFrames Are Still Slow - johnmyleswhite
http://www.johnmyleswhite.com/notebook/2015/11/28/why-julias-dataframes-are-still-slow/
======
jpfr
Relevant video from juliacon on the type system core:
[https://www.youtube.com/watch?v=xUP3cSKb8sI](https://www.youtube.com/watch?v=xUP3cSKb8sI)

Julia's type system is geared towards JIT compilation. Methods are compiled
when they are first called. Then, of course, full type information for the
arguments is available. That's quite enough for Matlab-style code with the
occasional JITted method. But Julia has one glaring disadvantage for
everything beyond that: The return type of methods cannot be specified /
enforced.

1) With "black box" methods (where the return type cannot be inferred as in
this DataFrames article) the code becomes horribly slow. And you have to dig
into the internal method representations to become aware of the type inference
results.

2) It hurts the ability of Julia to produce binary executables. When the types
are not 100% inferrable, the entire JIT infrastructure needs to be dragged
along.

3) Types are not only an aid for the compiler, but also an aid for the
programmer. With SIUnits [1] and method return types, Julia could even tell
when the physics represented in the code is flawed!!

If Julia's type system were stronger, it could become a prime platform to
develop Computer Algebra Systems (CAS). That could lead to a great unification
of symbolic and numerical "computation platforms". However, current Julia is
unable to represent the mathematics encoded in the type system of open source
CAS like Axiom [2]. Also note the github issue on Julia and dependent typing
[3].

Imho, there is still a great potential for the Julia type system without
breaking existing code.

[1] [https://github.com/Keno/SIUnits.jl](https://github.com/Keno/SIUnits.jl)

[2]
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.2331&rep=rep1&type=pdf)

[3]
[https://github.com/JuliaLang/julia/issues/6113](https://github.com/JuliaLang/julia/issues/6113)

~~~
thenoether
> The return type of methods cannot be specified / enforced.

That's true in one sense - typed functions aren't really a thing yet (functor
types are, but they're more clumsy to use in a lot of cases).

With regards to your first point, you can actually specify the types of
returned values by utilizing assertions:

    
    
      julia> h{T}(x::T) = f(g(x))::Complex{T}
      h (generic function with 1 method)
    

Type inference should totally pick up on the return type declared here, and an
error is thrown if `f(x)` does not return something of type `Complex{T}`.

Assertions obviously don't address all of the concerns you listed, but I find
they still help a lot to address type inference "failures" in some places, and
seem to be underutilized in a few cases.

This doesn't help any memory boxing/slowness inherent to `f` or `g`, but
downstream methods (e.g. callers of `h`) can definitely benefit from the
explicit information.

~~~
thenoether
> if `f(x)` does not return something of type `Complex{T}`

Sorry, meant to say "if `f(g(x)` does not return something of type
`Complex{T}`"

------
sevensor
Is there no way to take advantage of the fact that most columns, most of the
time, are filled with doubles? This is both the expected case and the thing we
want to go faster. I don't know compiler design, which is why I ask.

~~~
johnmyleswhite
I don't see how this could be done without putting information about the
internal representation of DataFrames into the compiler. Such an approach
would seem to require breaking important abstraction barriers and would couple
the language to a specific back-end for data representation that might need to
be replaced in the future to handle things like out-of-core data frames.

------
IndianAstronaut
One thing I would really like to see happen is out of core data and statistics
in Julia just like SAS. Not possible in either R or Python.

~~~
heydenberk
If I understand you, this is indeed possible and has been recently implemented
in Python. Take a look at dask:
[http://dask.pydata.org/en/latest/](http://dask.pydata.org/en/latest/)

> Users interact with dask either by making graphs directly or through the
> dask collections which provide larger-than-memory counterparts to existing
> popular libraries:

~~~
digitalzombie
Just to clarify, OP statement of "out of the core" means concurrent processes?

~~~
chubot
I think he means operating on data that doesn't fit in memory, and as
mentioned I think R and Python both have packages for that.

------
jbssm
Is there any good alternative library to DataFrames?

