
A Billion Rows per Second: Metaprogramming Python for Big Data [video] - BuffaloSweat
https://thenewcircle.com/s/post/1540/a_billion_rows_per_second_metaprogramming_python_for_big_data_ville_tuulos_video
======
pjvds
My takeaway is, that the reason that this is possible is that they care about
data structure. A language can give you an order of magnitude performance, but
- according to Ville - you can almost get infinitive improvement if you
rethink the algorithms and data structure.

~~~
mathattack
Structure of the database matters a lot. Going column-oriented [1] will
improve performance dramatically before you scale up the processing power.

[1] [http://en.wikipedia.org/wiki/Column-
oriented_DBMS](http://en.wikipedia.org/wiki/Column-oriented_DBMS)

------
iskander
Cool use of Numba. Does anyone know if there's more information available
about their query language and what kinds of Python expressions they
dynamically generate?

~~~
lcampbell
From my understanding of the video, the queries are generated by the analysts
via a frontend-generated or hand-written SQL query. The SQL query is parsed by
PostgreSQL which forwards it via FDW[1] to Multicorn[2]. Their custom data
storage and processing backend implements the API expected by Multicorn (e.g.,
you implement the multicorn.ForeignDataWrapper interface); this is where they
transform from the parsed, serialized SQL into their custom DSL (the
metaprogramming bit) which compiles to LLVM.

\--

[1]
[http://wiki.postgresql.org/wiki/Foreign_data_wrappers](http://wiki.postgresql.org/wiki/Foreign_data_wrappers)

[2] [http://multicorn.org/](http://multicorn.org/)

~~~
dialtone
Yeah, that's pretty much how it works. The frontend is anything that supports
PostgreSQL as a database. Right now we use Tableau but we also used to have a
custom WebUI on top of this service, it was very functional-inspired.

Unfortunately Ville is on vacation right now, otherwise he'd be glad to dive
more into the details of how that piece worked.

~~~
vtuulos
Author here: Fortunately HN works even in the Finnish countryside. I am happy
to answer any questions.

~~~
dev360
Very inspiring talk! Is it possible to deal with continuous data in those
matrices or is it more oriented around discrete values?

~~~
vtuulos
Thanks! Our approach supports both discrete and continuous values. It is
mainly optimized for the use case where you want to aggregate continuous
variables over discrete filters.

------
rtkwe
So in essence they do tons of pre-processing on their data. I wonder how long
the pre-processing takes compared to the amount of speed gains it produced for
them.

~~~
dialtone
Doesn't take so long actually, takes about 1 hour per day of data and every
day we process about 10TB of uncompressed log files. The result of this can be
stored and reused as many times as you need.

~~~
aborochoff
Interesting, you're processing ~500GB (10TB / 24 hour) of uncompressed
loglines in 1 hour? Is the set up the same as the presentation?

