Correction - 99.97% of cash tips don't go recorded. Why pay tax if they never saw it happen?
Compare this to:
> 3.00% of the passengers that pay by card do not leave tips.
(The author of the article).
I'm tempted to benchmark this against better known tech, maybe anyone has some insight to share?
And a 4 node Spark cluster with Parquet or Arrow files and a Scala job will compare favorably to this because it's also a lazy evaluator and the benchmark problems here are embarrassingly parallel.
PS this is a very cool project!
I think they solve a different problem, smaller data, data/relational integrity, data mutations. Dataframes can take shortcuts here.
If you want to do a serious benchmark, feel free to contact us.
Email: (I'm easy to google).
It would indeed be interesting to see how this approach with Vaex compares to Postgres. Though, I would be quite sad giving up SQL in favour of Pandas DataFrame indexing and Python looping :)
We are also working on GraphQL support, with a Hasura-like API:
I think GraphQL is easier in combinations with front end development, and you can tab-complete your way out. Early days for this sub-project, but I think very promising.
This all seems pretty interesting. I will give it a go.
It feels like growing wings and a jetpack. Almost everthing is waaay easier and faster.
Is there any way around that?
Yes, anything with one function call per item is going to be slow. That, along with memory inefficiency of lists, is the reason why numpy exists.
Xarray is more about nd-arrays, less about ~tabular data.
Vaex is built from the ground up with the idea that you can never copy the dataset (1TB dataset was quite common). We also never needed distributed computing, because it was always fast enough, and thus never had to use dask (although we're eager to support it fully).
Also, vaex is lazy but tries to hide it from you. For instance, if you add a new column to your dataframe, it will only be computed when needed (taking up 0 memory). However, in practice, you're not really aware of that. This means it feels more like pandas (immediate results) than dask.dataframe (no .compute()/.persist() needed).
I would say they are all complementary, with small amount of overlap. Small data: use Pandas. Out of memory error: move to vaex. Crazy amount of data (100TB?) that will never fit onto 1 computer: dask.dataframe, or help us implement full dask support.
The first commit was Jan 2014, when it between 50-80% of my time until 2018 I think where I mostly developed it myself. After that, it is more difficult to say how much time was spend on it, and it wasn't only my time. Although I do most of the development, ideas and discussions with data scientists can be more important than just pure dev time. But it took some time :)