Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: How to analyse 100 GB of data on your laptop with Python (towardsdatascience.com)
230 points by maartenbreddels 5 days ago | hide | past | web | favorite | 24 comments





> 99.97% of the passengers that pay by cash do not leave tips.

Correction - 99.97% of cash tips don't go recorded. Why pay tax if they never saw it happen?

Compare this to:

> 3.00% of the passengers that pay by card do not leave tips.


Yes you are absolutely correct! The intention was to just show the data as it. But I agree with your interpretation :)

(The author of the article).


How would this compare to ingesting this in a locally running rdbms, like postgres, and applying some judicious indexing? Or maybe a local spark cluster?

I'm tempted to benchmark this against better known tech, maybe anyone has some insight to share?


Think you'd want to compare with a columnar database since this is an analytic workload, not a transactional one.

And a 4 node Spark cluster with Parquet or Arrow files and a Scala job will compare favorably to this because it's also a lazy evaluator and the benchmark problems here are embarrassingly parallel.

PS this is a very cool project!


I've never benchmarked against postgres, but would be interested about the results. I once tried monetdb, and it was orders of magnitude slower for simple calculations, so I stopped looking at RDMS'es after that.

I think they solve a different problem, smaller data, data/relational integrity, data mutations. Dataframes can take shortcuts here.

If you want to do a serious benchmark, feel free to contact us. Github: https://github.com/vaexio/vaex/issues Email: (I'm easy to google).


I am currently doing an internship as part of my masters degree where I am analyzing ~30 GB of data. I'm using Postgres + Python and it is working quite well, even on my 2014 MacBook Air.

It would indeed be interesting to see how this approach with Vaex compares to Postgres. Though, I would be quite sad giving up SQL in favour of Pandas DataFrame indexing and Python looping :)


No Python looping happening in Vaex :), otherwise, we wouldn't get this performance.

We are also working on GraphQL support, with a Hasura-like API: https://docs.vaex.io/en/latest/example_graphql.html

I think GraphQL is easier in combinations with front end development, and you can tab-complete your way out. Early days for this sub-project, but I think very promising.


Ah, of cause. It makes sense you don't loop in Python.

This all seems pretty interesting. I will give it a go.


Ive been switching an reporting system that did analysis in postgres, to analysis in pandas (mostly business-stats type summaries).

It feels like growing wings and a jetpack. Almost everthing is waaay easier and faster.


I found that python is very slow to use when analyzing billions of entries, if you do a function call per entry, because the overhead of a function call is so large in python (and may be much slower than what's actually inside the function). Even JavaScript can do this much faster.

Is there any way around that?


Use the scientific stack in python for that type of analysis.

Yes, anything with one function call per item is going to be slow. That, along with memory inefficiency of lists, is the reason why numpy exists.


For simply-vectorizable analysis, sure, but in my work I often have to apply a nontrivial transformation to the data en masse and I'd rather define a single function which defines the transformation on each row to trying to wrestle the problem into one of matrix multiplications and additions.

Does numba help? In the article they use it to calculate distance taking into account the curvature of the earth (lol).

Indeed, numba, Pythran or cupy can be really useful. In the example of the article, it is 'simple' vectorized math. But in general any function can be added in Vaex. Those that go through numba or Pythran usually release the GIL and can get you an enormous speed boost.

Indeed, numpy for numerical calculations. For strings, we have our own data structure based on Apache Arrow, but we plan/hope to move to Apache Arrow (in combination with numpy), since that's kind of the numpy++ for data science work.

Sorry, I don't have any information related to it but I appreciate for your nice question.

Vaex seems to be very similar to Dask and Xarray. Which one to choose?

It is not similar to Dask, but similar to dask.dataframe. Dask.dataframe is built on top of Pandas, but that also means it inherits its issue, like memory usage, and performance. (BTW, totally a fan of Pandas).

Xarray is more about nd-arrays, less about ~tabular data.

Vaex is built from the ground up with the idea that you can never copy the dataset (1TB dataset was quite common). We also never needed distributed computing, because it was always fast enough, and thus never had to use dask (although we're eager to support it fully).

Also, vaex is lazy but tries to hide it from you. For instance, if you add a new column to your dataframe, it will only be computed when needed (taking up 0 memory). However, in practice, you're not really aware of that. This means it feels more like pandas (immediate results) than dask.dataframe (no .compute()/.persist() needed).

I would say they are all complementary, with small amount of overlap. Small data: use Pandas. Out of memory error: move to vaex. Crazy amount of data (100TB?) that will never fit onto 1 computer: dask.dataframe, or help us implement full dask support.


Dask’ s compatibility with Pandas makes it ideal. Very easy to use.

What sort of person hours went into this?

What do you mean by that?

stylised way of asking how long it took

Thanks :)

The first commit was Jan 2014, when it between 50-80% of my time until 2018 I think where I mostly developed it myself. After that, it is more difficult to say how much time was spend on it, and it wasn't only my time. Although I do most of the development, ideas and discussions with data scientists can be more important than just pure dev time. But it took some time :)


number of people * hours taken per person



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: