
Show HN: How to analyse 100 GB of data on your laptop with Python - maartenbreddels
https://towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94
======
darkstar999
> 99.97% of the passengers that pay by cash do not leave tips.

Correction - 99.97% of cash tips don't go recorded. Why pay tax if they never
saw it happen?

Compare this to:

> 3.00% of the passengers that pay by card do not leave tips.

~~~
jovan31
Yes you are absolutely correct! The intention was to just show the data as it.
But I agree with your interpretation :)

(The author of the article).

------
isoprophlex
How would this compare to ingesting this in a locally running rdbms, like
postgres, and applying some judicious indexing? Or maybe a local spark
cluster?

I'm tempted to benchmark this against better known tech, maybe anyone has some
insight to share?

~~~
drblah
I am currently doing an internship as part of my masters degree where I am
analyzing ~30 GB of data. I'm using Postgres + Python and it is working quite
well, even on my 2014 MacBook Air.

It would indeed be interesting to see how this approach with Vaex compares to
Postgres. Though, I would be quite sad giving up SQL in favour of Pandas
DataFrame indexing and Python looping :)

~~~
maartenbreddels
No Python looping happening in Vaex :), otherwise, we wouldn't get this
performance.

We are also working on GraphQL support, with a Hasura-like API:
[https://docs.vaex.io/en/latest/example_graphql.html](https://docs.vaex.io/en/latest/example_graphql.html)

I think GraphQL is easier in combinations with front end development, and you
can tab-complete your way out. Early days for this sub-project, but I think
very promising.

~~~
drblah
Ah, of cause. It makes sense you don't loop in Python.

This all seems pretty interesting. I will give it a go.

------
Aardwolf
I found that python is very slow to use when analyzing billions of entries, if
you do a function call per entry, because the overhead of a function call is
so large in python (and may be much slower than what's actually inside the
function). Even JavaScript can do this much faster.

Is there any way around that?

~~~
jofer
Use the scientific stack in python for that type of analysis.

Yes, anything with one function call per item is going to be slow. That, along
with memory inefficiency of lists, is the reason why numpy exists.

~~~
uoaei
For simply-vectorizable analysis, sure, but in my work I often have to apply a
nontrivial transformation to the data en masse and I'd rather define a single
function which defines the transformation on each row to trying to wrestle the
problem into one of matrix multiplications and additions.

~~~
Akababa
Does numba help? In the article they use it to calculate distance taking into
account the curvature of the earth (lol).

~~~
maartenbreddels
Indeed, numba, Pythran or cupy can be really useful. In the example of the
article, it is 'simple' vectorized math. But in general any function can be
added in Vaex. Those that go through numba or Pythran usually release the GIL
and can get you an enormous speed boost.

------
Jackie1122
Thanks for sharing...

------
deepakkhealani1
Sorry, I don't have any information related to it but I appreciate for your
nice question.

------
jononor
Vaex seems to be very similar to Dask and Xarray. Which one to choose?

~~~
maartenbreddels
It is not similar to Dask, but similar to dask.dataframe. Dask.dataframe is
built on top of Pandas, but that also means it inherits its issue, like memory
usage, and performance. (BTW, totally a fan of Pandas).

Xarray is more about nd-arrays, less about ~tabular data.

Vaex is built from the ground up with the idea that you can never copy the
dataset (1TB dataset was quite common). We also never needed distributed
computing, because it was always fast enough, and thus never had to use dask
(although we're eager to support it fully).

Also, vaex is lazy but tries to hide it from you. For instance, if you add a
new column to your dataframe, it will only be computed when needed (taking up
0 memory). However, in practice, you're not really aware of that. This means
it feels more like pandas (immediate results) than dask.dataframe (no
.compute()/.persist() needed).

I would say they are all complementary, with small amount of overlap. Small
data: use Pandas. Out of memory error: move to vaex. Crazy amount of data
(100TB?) that will never fit onto 1 computer: dask.dataframe, or help us
implement full dask support.

------
floki999
Dask’ s compatibility with Pandas makes it ideal. Very easy to use.

------
slowenough
What sort of person hours went into this?

~~~
maartenbreddels
What do you mean by that?

~~~
make3
stylised way of asking how long it took

~~~
maartenbreddels
Thanks :)

The first commit was Jan 2014, when it between 50-80% of my time until 2018 I
think where I mostly developed it myself. After that, it is more difficult to
say how much time was spend on it, and it wasn't only my time. Although I do
most of the development, ideas and discussions with data scientists can be
more important than just pure dev time. But it took some time :)

~~~
Breza
Your commitment and the end result are both impressive

