Dremio looks very interesting indeed. What would you recommend for interacting with Arrow with more control, as a library? I'm interested in creating new Arrow-based data sources, not using it as an intermediary to other data sources.
On a side note - what other products/projects did you mean?
The Arrow project itself is a set of libraries. One of the things we'll do is try to add more algorithms over time to it so if you want say, a fast arrow sort or arrow predicate application. Full SQL is always far more complex and I can't see the project itself .
The engine inside of Dremio is something we call Sabot (a shoe for modern arrows, see sabot round on wikipedia). We hope to make it modular enough one day to use a library but it isn't there yet.
In regards to your other question re projects/products: Arrow contributors are actively trying to get more adoption of Arrow as an interchange format for several systems. We've had discussions around Kudu (no serious work done yet afaik). Parquet-to-Arrow for multiple languages is now available. Arrow committers include committers from several other projects such as HBase, Cassandra, Phoenix, etc. The goal is ultimately to figure integrations with all.
In most cases, these data storage systems are saddled with slow interfaces for data access. (Think row-by-row, cell-by-cell interfaces.) Arrow, among other things, allows them to communicate through a much faster mechanism (shared memory--or at least shared representation if not node local).
How does dremio differ from PrestoDB? As far as I know, PrestoDB can also virtualize access to many data sources and join data between them. We didn't go deep with PrestoDB because our basic tests for multi-source joins ran very slowly, and it seemed to pull all data from both joined tables into one place. I'm not a Prestodb expert, so maybe there's a better way to do it (all suggestions welcome).
What's the differentiator? Is dremio smarter somehow and avoids copying all data to perform a simple join? Or does it copy the data the same way but Arrow lets it be faster than Presto? What's on your roadmap?
PrestoDB is similar to Impala, Hive and other SQL Engines. Each is designed to do distributed SQL processing. Dremio does embed an OSS distributed SQL processing engine (Sabot, built natively on Arrow) as well but we see that as only a means to an end. Our focus is much more on being a bi & data fabric/service.
At the core of this vision are: very advanced pushdowns (far beyond other OSS systems), a powerful self-service UI for managing, curating and sharing data (designed for analysts, not just engineers) and--most importantly--the first open source implementation of distributed relational caching for all types of data. You can see more details about this last part in a deck I presented at DataEngConf early today: https://www.slideshare.net/dremio/using-apache-arrow-calcite...
Thank you very much for a thorough response. I think I would be happy with a library without SQL support, as long as filtering, grouping would be supported. Seems like that would be Sabot :) Maybe one day I'll be able to use it.
On a side note - what other products/projects did you mean?