I work on Daft and we’ve been collaborating with the team at Amazon to make this happen for about a year now!
We love Ray, and are excited about the awesome ecosystem of useful + scalable tools that run on it for model training and serving. We hope that Daft can complement the rest of the Ray ecosystem to enable large scale ETL/analytics to also run on your existing Ray clusters. If you have an existing Ray cluster setup, you absolutely should have access to best-in-class ETL/analytics without having to run a separate Spark cluster.
Also, on the nerdier side of things - the primitives that Ray provides gives us a real opportunity to build a solid non-JVM based, vectorized distributed query engine. We’re already seeing extremely good performance improvements here vs Spark, and are really excited about some of the upcoming work to get even better performance and memory stability.
This collaboration with Amazon really battle-tested our framework :) happy to answer any questions if folks have them.
Good to see you here! It's been great working with Daft to further improve data processing on Ray, and the early results of incorporating Daft into the compactor have been very impressive. Also agree with the overall sentiment here that Ray clusters should be able to run best-in-class ETL without requiring a separate cluster maintained by another framework (Spark or otherwise). This also creates an opportunity to avoid many inefficient, high-latency cross-cluster data exchange ops often run out of necessity today (e.g., through an intermediate cloud storage layer like S3).
There’s a lot of interesting work happening in this area (see: XTable).
We are building a Python distributed query engine, and share a lot of the same frustrations… in fact until quite recently most of the table formats only had JVM client libraries and so integrating it purely natively with Daft was really difficult.
We finally managed to get read integrations across Iceberg/DeltaLake/Hudi recently as all 3 now have Python/Rust-facing APIs. Funny enough, the only non-JVM implementation of Hudi was contributed by the Hudi team and currently still lives in our repo :D (https://github.com/Eventual-Inc/Daft/tree/main/daft/hudi/pyh...)
It’s still the case that these libraries still lag behind their JVM counterparts though, so it’s going to be a while before we see full support across the full featureset of each table format. But we’re definitely seeing a large appetite for working with table formats outside of the JVM ecosystem (e.g. in Python and Rust)
Interesting. Daft currently does validation on types/names only at runtime. The flow looks like:
1. Construct a dataframe (performs schema inference)
2. Access (now well-typed) columns and operations on those columns in the dataframe, with associated validations.
Unfortunately step (1) can only happen at runtime and not at type-checking-time since it requires running some schema inference logic, and step (2) relies on step (1) because the expressions of computation are "resolved" against those inferred types.
However, if we can fix (1) to happen at type-checking time using user-provided type-hints in place of the schema inference, we can maybe figure out a way to propagate this information through to mypy.
Would love to continue the discussion further as an Issue/Discussion on our Github!
2. During a global shuffle stage (e.g. sorts, joins, aggregations) network transfer of data between nodes becomes the bottleneck.
This is why the advice is often to stick with a local solution such as DuckDB, Polars or Pandas if you can keep vertically scaling!
However, horizontally scaling does have some advantages:
- Higher aggregate network bandwidth for performing I/O with storage
- Auto-scaling to your workload’s resource requirements
- Scaling to large workloads which may not fit on a single machine. This is more common in Daft usage because we also work with multimodal data such as images, tensors and more for ML data modalities.
Hello! Daft developer here. The benchmarks we performed aren’t directly comparable to the benchmarks on TPC-H’s own page because of differences in hardware, storage etc.
For hardware, we were using AWS i3.2xlarge machines in a distributed cluster. And on the storage side we are reading Parquet files over the network from AWS S3. This is most representative of how users run query engines like Daft.
The TPC-H benchmarks are usually performed on databases which have pre-ingested the data into a single-node server-grade machine that’s running the database.
Note that Daft isn’t really a “database”, because we don’t have proprietary storage. Part of the appeal of using query engines like Daft and Spark is to able to read data “at rest” (as Parquet, CSV, JSON etc). However this will definitely be slower than a database which has pre-ingested the data into indexed storage and proprietary formats!
Hello! Daft developer here - we don’t directly use Polars as an execution engine, but parts of the codebase (e.g. the expressions API) are heavily influenced by Polars code and hence you may see references to Polars in those sections.
We do have a dependency on the Arrow2 crate like Polars does, but that has been deprecated recently so both projects are having to deal with that right now.
Spent some time diving into the Apache Parquet file format, which was surprisingly complicated and nuanced.
There's lots of lore/history in the versioning of the format's various features, and I put together a post to share some of the things I learned by browsing the issues/mailing list and talking to folks from the Parquet community.
We love Ray, and are excited about the awesome ecosystem of useful + scalable tools that run on it for model training and serving. We hope that Daft can complement the rest of the Ray ecosystem to enable large scale ETL/analytics to also run on your existing Ray clusters. If you have an existing Ray cluster setup, you absolutely should have access to best-in-class ETL/analytics without having to run a separate Spark cluster.
Also, on the nerdier side of things - the primitives that Ray provides gives us a real opportunity to build a solid non-JVM based, vectorized distributed query engine. We’re already seeing extremely good performance improvements here vs Spark, and are really excited about some of the upcoming work to get even better performance and memory stability.
This collaboration with Amazon really battle-tested our framework :) happy to answer any questions if folks have them.