
Show HN: Vaex - Out of Core Dataframes for Python and Fast Visualization - maartenbreddels
https://medium.com/vaex/vaex-out-of-core-dataframes-for-python-and-fast-visualization-12c102db044a
======
themmes
First of all, great to see more powertools to choose from for my ds workflow!

However, I am suprised to see no mention of Dask in the article. How do these
libraries compare?

~~~
maartenbreddels
Dask and vaex are not 'competing', they are orthogonal. Vaex could use dask to
do the computations, but when this part of vaex was built, dask didn't exist.
I recently tried using dask, instead of vaex' internal computation model, but
it gave a serious performance hit.

There is some overlap with dask.dataframe, I think they are closer to pandas
than vaex is. Vaex has a strong focus on large datasets, statistics on N-d
grids and visualization as well. For instance calculating a 2d histogram for a
billion row can be done in < 1 second, which can be used for visualization or
exploration. The expression system is really nice, it allows you to store the
computations itself, calculate gradients, do Just-In-Time compilation, and
will be the backbone for our automatic pipelines for machine learning. So vaex
feels like Pandas for the basics, but adds new ideas that are useful for
really large datasets.

~~~
themmes
How could I've missed you being the author. Thanks for your extensive answer,
will definitely try the library! And thanks again for Ipyvolume, has been very
useful so far.

~~~
maartenbreddels
thanks!

------
JPKab
Such phenomenal work.

BTW, for anyone on a Windows machine, getting this to work is very trivial.

There is a unix only library for locking files (fcntl) which prevents it from
working on Windows. I mocked it in the path and made a function that returns 0
to test it.

Obviously adding a check for os and switching to a cross platform file locker
would be a great contribution. I'll see if I can make that happen in the next
week.

~~~
maartenbreddels
There is an issue open for this:
[https://github.com/vaexio/vaex/issues/93](https://github.com/vaexio/vaex/issues/93)
It should have been fixed, some more detailed report (version numbers
installed) would be good to know.

~~~
maartenbreddels
Oh, and thanks for the kind words!

------
rax
It looks quite nice, and I will have to explore the performance comparisons
with Dask more.

I have recently started using Xarray for some projects, and really appreciate
the usability of multidimensional labelled data. Are the memory mapping
techniques used for speedup here only applicable to tabular data?

The support for Apache arrow is quite nice. Have you considered any other
formats, such as Zarr?

~~~
maartenbreddels
Thank you. Memory mapping could be used for other data as well, and I have
looked into zarr (even opened an issue for that [https://github.com/zarr-
developers/zarr/issues](https://github.com/zarr-developers/zarr/issues) ).
Memory mapping of contiguous data makes life much easier (for the application
as well as OS), chunked data could be supported, but is more bookkeeping.

~~~
ah-
I'll need to have a closer look later, but would vaex fit in with somewhat
indexed mapped files?

E.g. parquet supports column indexes now:
[https://issues.apache.org/jira/browse/PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201)

------
sevensor
Uses HDF5, which itself is a great file format, well suited for big tables of
numbers. Good for similar reasons as SQLite3, but for different applications.
Not a relational database, columns are more strongly typed. Better suited when
you have hundreds or thousands of columns, worse when you're trying to query a
particular row.

------
angelmass
Very interesting! I will share it with my DS friends.

One thing I have struggled with optimizing is visualization and coordinate
calculation of network graphs with 10s of millions of edges + nodes using
networkX and most visualization tools. Have you looked into this utility for
Vaex? Reading your article it sounds like it would be well-suited for it.

~~~
bayesian_horse
The bigger question is what you want to achieve by visualizing so many nodes.
If you want a map that can be zoomed in to view individual nodes, you mainly
need to compute coordinates for every node. Finding the arrangement of the
node is probably what gets you in trouble, so you probably need a custom
algorithm which scales better (and does poorer, probably).

More interesting may be to identify clusters and either group them together or
visualize these clusters as nodes themselves.

------
ah-
Great to see that you're supporting Apache Arrow! That makes it so much easier
to gradually switch over.

~~~
wesm
Note: Vaex has its own memory model. If you input Arrow, it converts to the
Vaex data representation. Details here:

[https://github.com/vaexio/vaex/blob/master/packages/vaex-
arr...](https://github.com/vaexio/vaex/blob/master/packages/vaex-
arrow/vaex_arrow/convert.py)

One of the primary objectives of Apache Arrow is to have a common data
representation for computational systems, and avoid serialization /
conversions altogether.

~~~
maartenbreddels
That is not correct, I just refer to the buffers/memory, 0 copying going on.
Vaex is not really opinionated about the memory model actually. The only
exception is the bitmasks that are being copied for now because of an
incompatibility with numpy. But if I get a 50GB Arrow dataset, vaex leaves the
structure intact. Thanks for your work on Arrow, I hope to support and
contribute more to it in the future.

~~~
wesm
I'm looking at the code I linked, and you are serializing in the general case,
it is not zero copy. Unpacking a bitmap is not free.

~~~
maartenbreddels
the 'convert' name is misleading perhaps, maybe we can agree the proof is in
the execution time
[https://youtu.be/TlTcQJPUL3M?t=478](https://youtu.be/TlTcQJPUL3M?t=478)
Anyway, let us celebrate a wider adoption of Arrow! :)

------
wenc
Nice work. This looks like it could add a lot of value to a DS's toolbox.

Exploratory data analysis of large (but not huge) datasets have always been a
slow and frustrating experience.

In the enterprise, we have plenty of datasets that are 100s of millions to a
few billion rows (and many columns), so big enough to make conventional tools
sluggish but not quite big enough for distributed computing. It sounds like
vaex can help with EDA of these types of datasets on a single machine. I'd be
interested in exploring the out-of-core functionality, which I hope means it
will continue chugging along without throwing "out of memory" errors.

~~~
maartenbreddels
That is exactly the sweet spot for vaex, and with a familiar DataFrame API
(read pandas like) the transition does not hurt so much. It may sound cool to
set up a cluster, but in many cases it is overkill, and vaex can get these
kinds of jobs done.

------
aw3c2
> For example, it takes about a second to calculate the mean of a column in
> regular bins even when the dataset contains a billion rows (yes, 1 billion
> rows per second!).

A billion 32 bit floating point numbers are 4 Gigabytes. How can that be
processed in one second unless there was any preprocessing?

~~~
fulafel
Desktop PCs have about 35 GB/s of memory bandwidth and can do compute at ~200
Gflops, so this is just ~10% of peak bw and leaves you a budget of 200 flops
computation per float value. If all 4 columns are accessed, there is still
enough bandwidth (no idea of the data here was columnar layout or not).

The relevance to big data or out-of-core computation is left hazy, which would
make this I/O bound in most cases? 4 GB fits easily in memory and is just
mmap'ed from the OS disk cache if the data was recently touched. I guess with
4 columns you get to 16 GB which might be pushing it on a laptop.

~~~
maartenbreddels
You are right, I'm actually underselling it. 1 second is the typical
performance for doing a 2d histogram (or other binned statistics) since it
involves writing to memory as well.

I just ran a quick benchmark: In [7]: %timeit -r3 -n3 df.mean(df.ra) 330 ms +-
5.46 ms per loop (mean +- std. dev. of 3 runs, 3 loops each) In [11]:
f'{len(df):,}' Out[11]: '1,692,919,135' In [12]: 330/len(df) _1e9 Out[12]:
194.92957057278463

so it is 0.2second for 1.7 billion rows, which is:

In [15]: (len(df)_8/1024 __3) /0.2 Out[15]: 63.066152296960354

63 GB/s. (this is a high end machine, on my laptop I get ~12GB/s)

We do not use float32 much in science since you really should know how not to
screw up. It does give some extra performance boost (not much though), and
also saves you on memory cache.

~~~
aw3c2
Is this cold data? Or already in RAM? What about a billion rows that are not
in RAM yet?

How does it compare to plain numpy or pandas?

------
stestagg
This is big news.

I've used similar proprietary libraries before, and virtual operations can be
really powerful

~~~
maartenbreddels
Thank you, yes they give much more flexibility: optimization (JIT),
derivatives, checking your calculations afterwards, sending them to a remote
server etc. Glad you like that :)

------
colobas
Does it have python3 support? Tried installing it on a python3.7 environment
and it failed

EDIT: I then tried a python3.6 environment and it worked. I guess it answers
my question

~~~
maartenbreddels
Absolutely, I think nowadays the question should be: 'does it still support
Python2?' (it does btw)

My question is to you is, would you be so kind to open an issue to decribe the
failure on
[https://github.com/vaexio/vaex/issues](https://github.com/vaexio/vaex/issues)
? Please share which OS, which Python distribution (anaconda maybe) and/or the
installation steps and error msg.

