
Rust Dataframe: Update 1 - nevi-me
https://github.com/nevi-me/rust-dataframe/blob/master/notes/update-01__04-04-2020.md
======
quotemstr
One of the things that most pisses me off about numpy, pandas, and the whole
scipy ecosystem is that everything is "immediate mode": all operations
evaluate _instantly_ to new arrays or dataframes. There's no opportunity for
any component to evaluate a whole expression tree and optimize it, e.g., by
loop hoisting.

The _right_ way to design a data analysis DSL is to do it the way Python's
dask does it: build an operation graph and execute it as the last step. The
trouble is that Dask doesn't get it right either because, as part of its graph
formation, it computes sizes of operands, and computing operand sizes can
involve huge amounts of computation, and so dask, _in effect_ , is also an
"immediate mode" system.

What I really want to see is something _lazy_ that can do sane _query
planning_ and that can work within limited system resources. Maybe one day
I'll open source the work I've done in this space. Query languages are
infinitely nicer for analytics than data processing libraries.

~~~
bobbylarrybobby
It really is remarkable that Python's numerical computing libraries have such
poor performance. When doing chains of elementwise operations on large arrays
(such as toy example `elementwise cos(sqrt(sin(pow(array, 2))))`), Julia
appears to outperform Python by a factor 2! Numpy cannot avoid computing each
intermediate array, which means it has to allocate a ton of wasteful memory.
Meanwhile Julia does the smart thing and coalesces all operations into one and
applies that single operation elementwise, allocating only a single new array.

Pandas does not also defer computations, which means computing Boolean
functions that include the same data multiple times must make multiple passes
over said data. Absurd.

~~~
a_t48
numpy appears to have optional arguments for the storage location of outputs:
[https://docs.scipy.org/doc/numpy/reference/generated/numpy.c...](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cos.html)
\- Could you elaborate a little further? The syntax might not be as nice as
another language or framework, but it's not "unavoidable".

Disclaimer: have never used numpy, have used python fairly extensively.

~~~
bobbylarrybobby
You are right about the `out` argument, I'd forgotten about that. But even
avoiding the wasteful memory allocations, numpy _still_ is about 60% slower
than Julia, as it makes multiple passes over the input data. (If there's a way
to get numpy to just make a single pass over the data and remain performant,
I'd love to know.)

~~~
a_t48
The first thing I'd reach for is to refactor into a list comprehension. Looks
like this is the proper way to do it:

    
    
      for x in np.nditer(a, op_flags = ['readwrite']):
        x[...] = cos(sqrt(sin(pow(x, 2))))
    

That's some knarly syntax though. I've never seen that ellipsis operator?

Edit: just read about `Ellipsis`. I'm a fan, even if it's sort of nonstandard
across libraries. Those readwrite flags are a travesty though, but at least
you can paper over it with a helper function.

Something like:

    
    
      def np_apply(a, f):
        for x in np.nditer(a, op_flags = ['readwrite']):
          x[...] = f(x)
    
      np_apply(x, lambda x: return cos(sqrt(sin(pow(x, 2))))))
    

or

    
    
      def np_apply(a, *argv):
        for x in np.nditer(a, op_flags = ['readwrite']):
          for f in argv:
            x = f(x)
          x[...] = x
    
      np_apply(lambda x: return pow(x, 2), sin, sqrt, cos)
    

Edit3: There's a way to turn this into "pythonic" list comprehension code, but
it would probably only make it look prettier rather than more performant.

------
j88439h84
Some thoughts on dataframe.

\- Don't put methods on it like pandas.DataFrame. You won't get the API right
the first many times and you'll end up with a million methods.

\- Make chaining easy.

\- Use pure functions. No mutation.

\- Get inspired by R's dplyr and data.table.

~~~
aldanor
"Pure functions" is usually a bad idea with large numeric data structures if
they are non-lazy - because you'll end up copying the data.

One of the common pure alternatives is to make them lazy - i.e., all those
pure methods don't do anything but rather just collect the information on what
to do with your data. But then you need to write an execution engine for
arbitrary computation graph, which ain't easy.

~~~
nightcracker
> "Pure functions" is usually a bad idea with large numeric data structures if
> they are non-lazy - because you'll end up copying the data.

This isn't necessarily true due to Rust's strong ownership. Methods can take
`self` (without reference) as an argument which means the method takes
ownership of itself. There is no copy for 'double' in the below example, yet
it is pure:

    
    
        #[derive(Clone)]
        struct A {
            v: Vec<i32>,
        }
        
        impl A {
            fn double(mut self) -> Self {
                for x in &mut self.v {
                    *x *= 2;
                }
                self
            }
        }
        
        fn main() {
            let a = A { v: vec![2, 3, 4] };
            let b = a.double();
            dbg!(b.v[1]);
        }
    

If you now tried to access `a` the compiler would error out, saying you're
trying to access a moved-from variable. If you still wanted to keep the
original `a` around you simply write `let b = a.clone().double()`.

------
Icathian
A good Rust equivalent to Pandas is really the main thing keeping me from
switching to Rust for about half of my day job. This is incredibly promising,
and I am following your work with great interest. Thank you for working on
this!

~~~
nevi-me
Out of interest, what are the common tasks that you'd be looking to achieve?
We can't replace Pandas and its ecosystem in the short-term, so for me a Rust-
backed backend to a dataframe that is compatible with Pandas would be a win.

~~~
Icathian
My workflow usually reads something like this:

1\. Import data using hand-written SQL from SQL server (read_sql)

2\. Perform various filtering, aggregation, math (loc, iterrows/itertuples,
groupby, agg)

3\. Push results either out to delivery file or back to source server (to_csv,
to_excel, to_sql).

Really bread and butter stuff, but the relative ease and stability of using
pandas to do it is the attraction.

------
lmeyerov
Getting a fast & safe Rust UDF layer that targets SPIR/CUDA/PTX would be quite
interesting wrt enabling RAPIDS.ai (libcudf + python bindings) as well. It'd
enable getting rid of slow & quirky numba etc - I remember Mozilla had GPGPU
RUST codegen experiments here awhile back...

~~~
peterhj
rustc does have a working nvptx target today, though it’s not supported nearly
as well as the mainstream cpu targets, and some things you would really want
for gpu programming (e.g. shared memory address space) are not currently
exposed in the rust language. But kernels written in rust can compile to ptx;
you’ll still need to write glue code.

~~~
lmeyerov
Yeah this would be about extending it to columnar analytics funcs, like
`df['x'].apply(f)`, `df.query("x > 10 && y < 10"). I realized I may be wrong
about the compiler speed part, not sure if it'd be faster than numba for
codegen nowadays :)

------
forrestthewoods
What is a dataframe? I wish libraries would define their key terminology.
Especially when the term is rather generic.

~~~
macawfish
It's kinda like a spreadsheet with a programmatic interface. Check out pandas
for a nice introduction!

~~~
CameronNemo
Am I the only one who finds pandas to be terrible compared to dplyr?

~~~
Icathian
Everyone does. On the other hand I, and many others I assume, aren't willing
to move to R for the sake of the one superior tool over python. Two if you
count ggplot2, I suppose.

~~~
CameronNemo
Yeah but for giving someone an intro to what dataframes are, I think pandas
might leave a sour taste where dplyr would allow someone to learn comfortably.

Although, one might run into a "true level" situation. This is when Morty
feels what true level is like, then says "everything is crooked, reality is
poison" when he has to live in the "fake level" world.

------
nevi-me
TL:DR; I've been writing a Rust dataframe library for a while (on and off when
I have time). This is my first update, to motivate why I'm writing it.

~~~
nestorD
Nice, there are definitely people trying to push more of their datascience/ML
stack to Rust and a good dataframe implementation would be useful.

As a side note, a small usage example in the readme would be good.

~~~
nevi-me
Thanks for the feedback. I was thinking that Update 2 would be in the form of
examples of what can currently be done. I'll also add that to the README

------
amelius
What bothers me is: why can't Rust developers just import a C++ library that
does the job? What novelty would a Rust version of the same thing bring
really? Why not focus on real innovation, and use wrappers for things that
were already built in another language a decade ago?

~~~
adev_
I name that the isolated island syndrome.

If you wait long enough, any stable working solutions will be reinvented in
its own Language.

It's completely unproductive. But there is reason to that: compatibility with
the language toolchain and understandability.

~~~
nevi-me
it has to be done at some point. I still have vcpkg on my Windows machine,
because I needed to install it and a few GB of other things, just to use
PostgreSQL in Rust a few years ago (I think Diesel).

If someone hadn't implemented the libpq protocol in Rust, I likely wouldn't
have written in binary copy support on the dataframe in one afternoon.

This reduces the barrier for people to productively use their favourite
languages. It might be unproductive at first, but we see the benefits as time
goes on.

~~~
adev_
> This reduces the barrier for people to productively use their favourite
> languages. It might be unproductive at first, but we see the benefits as
> time goes on.

It is vastly unproductive to reproduce anything useful in 10 different
language just because our cross-language tooling awfuly suxxe.

The lack of cross-language dev-oriented lead us in the ridiculous situation
where every language has its own package manager and his completely unable to
use a package made for an other language.

There is reasons to that... but much more political than technical and that's
the sad state we are in.

------
FridgeSeal
Oh this is really cool! I could definitely make use of something like this.

