Ibis: Scaling the Python Data Experience

tdicola · on July 20, 2015

"A pandas-like data expression system providing comprehensive coverage of the functionality already provided by Impala. It is composable and semantically complete; if you can write it with SQL, you can write it with Ibis, often with substantially less code."

This sounds really interesting, but my biggest gripe with pandas is that I often know exactly the query I want to run with SQL but have to jump through a ton of hoops and a weird join syntax to figure out how to make the query in pandas. IMHO if you want to make data processing language as full-featured as SQL, why not just use SQL as the query language...

wesm · on July 20, 2015

Having both designed pandas and suffered from some of the rough edges of pandas when used in a database-like way, I tried to be thoughtful in designing the Ibis API to be semantically complete with SQL (i.e. so you shouldn't have to write any SQL) and a great deal more productive and reusable than SQL. I suggest you give it a serious try before passing judgment!

laserson · on July 20, 2015

And separate from the issue of whether it's a good or bad idea, it will shortly be possible to send raw SQL queries to the backend.

https://github.com/cloudera/ibis/issues/412

ForHackernews · on July 20, 2015

How is this different/better than PySpark? https://spark.apache.org/docs/latest/programming-guide.html#...

laserson · on July 20, 2015

That's directly addressed here: http://www.ibis-project.org/faq.html

perone · on July 20, 2015

What are the main differences of this architecture when compared with the Apache Spark ? Something that I see as a nice advantage is the Python -> LLVM IR, but I can't see what are the main advantages over Spark.

wesm · on July 20, 2015

Makes most sense to compare Impala and Spark architecturally. Ibis will eventually integrate with Spark. We've been focusing on Impala integration for reasons cited here: http://blog.cloudera.com/blog/2015/07/getting-started-with-i...

In particular, we're working on byte-level shared-memory integration with Impala (which is implemented in C++ with LLVM runtime codegen — the project's tech lead, Marcel Kornacker, was the tech lead for Google F1's query engine) to run user-defined logic without data serialization / memory usage overhead. This also opens up Python's HPC / scientific computing stack and existing data libraries to be run in a Hadoop setting without Python-JVM interoperability issues.

infinite8s · on July 20, 2015

Are you planning on leveraging numba, or will this be a new way to generate LLVM bytecode from python?

Lofkin · on July 20, 2015

I was wondering this also

perone · on July 20, 2015

Now I got it, thanks for the explanation Wes, sounds very interesting indeed. Congratulations for the project.