

Metaprogramming Python For Big Data - fixxer
http://tuulos.github.io/sf-python-meetup-sep-2013/#/

======
vtuulos
Author here: i just checked this morning - we have 100B+ rows in our deliroll
matrices, hosted on a single machine.

I'm happy to answer any questions.

~~~
storrgie
What did you use to make these slides?

~~~
vtuulos
Reveal.js
[https://github.com/hakimel/reveal.js/](https://github.com/hakimel/reveal.js/)

------
Fede_V
Awesome to see tech like numba being used like this. Are you guys going to
write a more 'techy' version of how you accomplish all of this?

~~~
vtuulos
Yes, definitely. We might be able to open-source some parts of the system
eventually as well.

~~~
Fede_V
Excellent. For sparse storage, do you use Scipy Sparse, or you have your own
custom built solution?

I know continuum was working on a version of memmapped numpy (blaze, iirc)
arrays which looked really interesting.

~~~
vtuulos
I started with Scipy Sparse. I ditched it for a custom solution so I could
support variable-length integers, run-length encoding etc.

------
3pt14159
This is next level. I've been doing this type of stuff manually for years I
don't know why it never occurred to me to build out the general solution.

------
eliteraspberrie
Numba is impressive! If you use NumPy regularly, check it out:
[http://numba.pydata.org/](http://numba.pydata.org/)

------
parrotdoxical
This is awesome. While it's great how performant the approach is, I also
really dig how elegant the whole solution is -- using Postgres FDW with Numba
is very pragmatic and clean, while at the same time potentially extensible to
GPGPU. I might try and give this a go for some DSP stuff at some point.

------
mathattack
Thanks for sharing. It's interesting when people walk through their though
process.

------
wedesoft
I did something similar with Ruby. I used GCC and on-the-fly linking for JIT
compilation: [http://www.wedesoft.de/hornetseye-
api/](http://www.wedesoft.de/hornetseye-api/)

------
virmundi
It looks really neat. Isn't 660 GB not that much really? I grant that the
slides say they've used an optimized binary for the storage, but how does this
compare to pandas?

~~~
vtuulos
660GB was just a small benchmark. The real thing uses more than a petabyte of
raw data.

Pandas uses NumPy internally. You could use Deliroll as a replacement for
NumPy in Pandas to get a nice interactive environment for amounts of data that
can't be easily handled with plain NumPy.

~~~
virmundi
This is interesting. My current project is a fraud detection system. We
currently leverage Cascading/Hadoop. But I wanted to make sure the system is
not Hadoop-centric. So I made a point of having the system be language
agnostic. It looks like there might be a fit for this tool.

I passed the slides along to my team to see what they think. If they just
impress upon the team that we need to store something other than just 0x0A
delimited text files, I'll consider it a win.

------
CraigJPerry
So is numba more like a Cython replacement? I know nothing about it, just
assumed it was a NumPy replacement?

~~~
pwang
You can think of it like that. It's a Python -> Machine Code compiler based on
LLVM, and it uses Numpy types (and Blaze types) to do type inference on
numerical and data transformation functions.

[http://numba.pydata.org/](http://numba.pydata.org/)

------
doug1001
really impressive work; i like how you guys (apparently) began with a blank
page, and set aside at least a few stale assumptions that most consider
inviolate principles in DW design--eg, denormalization, star/snowflake schema.

------
meowface
Very cool. Impressive that they got that kind of efficiency with Python.

------
agumonkey
Beautiful work. Also, I read lisp on the last two slides.

------
cbsmith
Or just use FastBit-Python directly...

