
Ibis: Scaling the Python Data Experience - jkestelyn
http://www.ibis-project.org/
======
tdicola
"A pandas-like data expression system providing comprehensive coverage of the
functionality already provided by Impala. It is composable and semantically
complete; if you can write it with SQL, you can write it with Ibis, often with
substantially less code."

This sounds really interesting, but my biggest gripe with pandas is that I
often know exactly the query I want to run with SQL but have to jump through a
ton of hoops and a weird join syntax to figure out how to make the query in
pandas. IMHO if you want to make data processing language as full-featured as
SQL, why not just use SQL as the query language...

~~~
wesm
Having both designed pandas and suffered from some of the rough edges of
pandas when used in a database-like way, I tried to be thoughtful in designing
the Ibis API to be semantically complete with SQL (i.e. so you shouldn't
_have_ to write any SQL) and a great deal more productive and reusable than
SQL. I suggest you give it a serious try before passing judgment!

------
ForHackernews
How is this different/better than PySpark?
[https://spark.apache.org/docs/latest/programming-
guide.html#...](https://spark.apache.org/docs/latest/programming-
guide.html#tab_python_0)

~~~
laserson
That's directly addressed here: [http://www.ibis-
project.org/faq.html](http://www.ibis-project.org/faq.html)

------
perone
What are the main differences of this architecture when compared with the
Apache Spark ? Something that I see as a nice advantage is the Python -> LLVM
IR, but I can't see what are the main advantages over Spark.

~~~
wesm
Makes most sense to compare Impala and Spark architecturally. Ibis will
eventually integrate with Spark. We've been focusing on Impala integration for
reasons cited here: [http://blog.cloudera.com/blog/2015/07/getting-started-
with-i...](http://blog.cloudera.com/blog/2015/07/getting-started-with-ibis-
and-how-to-contribute/)

In particular, we're working on byte-level shared-memory integration with
Impala (which is implemented in C++ with LLVM runtime codegen — the project's
tech lead, Marcel Kornacker, was the tech lead for Google F1's query engine)
to run user-defined logic without data serialization / memory usage overhead.
This also opens up Python's HPC / scientific computing stack and existing data
libraries to be run in a Hadoop setting without Python-JVM interoperability
issues.

~~~
infinite8s
Are you planning on leveraging numba, or will this be a new way to generate
LLVM bytecode from python?

~~~
Lofkin
I was wondering this also

