Hacker News new | past | comments | ask | show | jobs | submit | alamb's comments login

It would be amazing if the code for working with arrow on GPUs could be made open source -- I think that would drive a significant amount of adoption


So great to see another project built on DataFusion @!


Thanks Andrew! I'm looking forward to contributing back to DataFusion as well.


The Apache Arrow PMC is pleased to announce the donation of the Comet project, a native Spark SQL Accelerator built on Apache Arrow DataFusion.



To complete, the youtube link is here: https://www.youtube.com/playlist?list=PLSE8ODhjZXjYzlLMbX3cR... the advanced course is wonderful, it explains how everything is implemented.


Perfect, will take a look. Thanks!


BTW you can see a version of what an industrial strength query optimizer / execution engine looks like in Rust https://arrow.apache.org/datafusion/

(can also use it in your own projects)

It is quite similar to what is described in this post


Somewhat similar, see https://substrait.io/

So for example using DuckDB with the Substrait extension, if you create a table

    create table t(a int);
and then query it as in the article, you can see something similar to what is described in the article

    CALL get_substrait_json('select * from t');

    {"relations":[{"root":{"input":{"project":{"input":{"read":{"baseSchema":{"names":["a"],"struct":{"types":[{"i32":{"nullability":"NULLABILITY_NULLABLE"}}],...
DuckDB extension doesn't seem to cover any DDL operations though.

https://duckdb.org/docs/extensions/substrait

Some other related discussions and links that i've collected over the years

https://news.ycombinator.com/item?id=37415494

https://news.ycombinator.com/item?id=34233697

https://news.ycombinator.com/item?id=31981568

https://datastation.multiprocess.io/blog/2022-04-11-sql-pars...

https://tomassetti.me/parsing-sql/


After a quick look, I'm not sure if I would call this “industrial strength”. In particular, the join optimizer (typically the heart of a large-scale SQL optimizer) looks very rudimentary? And the statistics it uses have zero idea about correlation, no histograms beyond min/max…


I was wondering about the same claim. However, I believe that JOIN's are a common weakness among OLAP database engines, and DataFusion is built on top of a columnar storage format - Apache Arrow.


By being columnar, I guess you could say DataFusion has a good executor, but no, not a good optimizer.


Not that I was trying to make any of those claims but just trying to correlate the domain with what appears to be a common problem in it.


There's also this pretty detailed article on StarRocks' query optimizer. (StarRocks is open source - focused on OLAP)

https://medium.com/starrocks-engineering/starrocks-inside-sc...



The following paper describes some of the tradeoffs between different formats

Deep Dive into Common Open Formats for Analytical DBMSs https://www.vldb.org/pvldb/vol16/p3044-liu.pdf


I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.

[1] https://h2oai.github.io/db-benchmark/


DuckDB is a great piece of software if you are

If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)

https://arrow.apache.org/datafusion/

Disclaimer I am a maintainer of DataFusion


Here is another blog post that offers some perspective on the growth of Arrow over the intervening years and future directions: https://www.datawill.io/posts/apache-arrow-2022-reflection/


That's really good, thanks. Better than my blog, actually - the author has a much deeper understanding and I learned a lot by reading it. I was coming at it from the perspective of someone very confused by Arrow, and wrote the blog to help myself understand it!


For completeness, FlightSQL[1] (as mentioned elsewhere in this thread) aims to provide such an HTTP based protocol

https://arrow.apache.org/blog/2019/10/13/introducing-arrow-f...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: