Hacker News new | past | comments | ask | show | jobs | submit login
Arrow DataFusion includes Ballista, which does SIMD and GPU vectorized ops (github.com/apache)
2 points by westurner 86 days ago | hide | past | favorite | 2 comments

From the Ballista README:

> How does this compare to Apache Spark? Ballista implements a similar design to Apache Spark, but there are some key differences.

> - The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.

> - Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.

> - The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.

> - The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Previous article from when Ballista was a separate repo from arrow-datafusion: "Ballista: Distributed compute platform implemented in Rust using Apache Arrow" https://news.ycombinator.com/item?id=25824399

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact