Hacker News new | past | comments | ask | show | jobs | submit login
Ibis: Scaling the Python Data Experience (ibis-project.org)
81 points by jkestelyn on July 20, 2015 | hide | past | favorite | 10 comments



"A pandas-like data expression system providing comprehensive coverage of the functionality already provided by Impala. It is composable and semantically complete; if you can write it with SQL, you can write it with Ibis, often with substantially less code."

This sounds really interesting, but my biggest gripe with pandas is that I often know exactly the query I want to run with SQL but have to jump through a ton of hoops and a weird join syntax to figure out how to make the query in pandas. IMHO if you want to make data processing language as full-featured as SQL, why not just use SQL as the query language...


Having both designed pandas and suffered from some of the rough edges of pandas when used in a database-like way, I tried to be thoughtful in designing the Ibis API to be semantically complete with SQL (i.e. so you shouldn't have to write any SQL) and a great deal more productive and reusable than SQL. I suggest you give it a serious try before passing judgment!


And separate from the issue of whether it's a good or bad idea, it will shortly be possible to send raw SQL queries to the backend.

https://github.com/cloudera/ibis/issues/412


How is this different/better than PySpark? https://spark.apache.org/docs/latest/programming-guide.html#...


That's directly addressed here: http://www.ibis-project.org/faq.html


What are the main differences of this architecture when compared with the Apache Spark ? Something that I see as a nice advantage is the Python -> LLVM IR, but I can't see what are the main advantages over Spark.


Makes most sense to compare Impala and Spark architecturally. Ibis will eventually integrate with Spark. We've been focusing on Impala integration for reasons cited here: http://blog.cloudera.com/blog/2015/07/getting-started-with-i...

In particular, we're working on byte-level shared-memory integration with Impala (which is implemented in C++ with LLVM runtime codegen — the project's tech lead, Marcel Kornacker, was the tech lead for Google F1's query engine) to run user-defined logic without data serialization / memory usage overhead. This also opens up Python's HPC / scientific computing stack and existing data libraries to be run in a Hadoop setting without Python-JVM interoperability issues.


Are you planning on leveraging numba, or will this be a new way to generate LLVM bytecode from python?


I was wondering this also


Now I got it, thanks for the explanation Wes, sounds very interesting indeed. Congratulations for the project.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: