
Apache Drill 1.0 – Schema-Free SQL Query Engine for Hadoop and NoSQL - hrisp
http://drill.apache.org/
======
deorus
_._

I wish I had Drill a few years ago, but even more nowadays.

Drill and Dremel approach to data querying is really interesting in the sense
that it's the tool that changes the 'lenses' on how it sees the data, rather
than you have to re-shape the data for how would you like to see it. ++glasses
&& \--photoshop;

Thanks Drill team!

~~~
crb
It's been available as a Google cloud service since 2010:
[https://cloud.google.com/bigquery/what-is-
bigquery](https://cloud.google.com/bigquery/what-is-bigquery)

------
eranation
How does it compare with spark SQL? Does the difference lie in where the query
runs? In spark it loads all data into memory first and then uses SQL to query
it (it parses the query and converts it to a set of spark transformations,
kind of like hive). Does Drill actually translate SQL into, for example, mongo
db native queries and run it on the source data store? If so this is a
whooping amazing feat...

~~~
rxin
Actually Spark SQL doesn't load everything into memory. Its data source API
supports pushing predicates down, and if the data sources implement it, it
even supports running aggregations and joins in the data sources!

------
ShirsenduK
This will help save thousands of hours of developer time querying different
sources and generating reports for biz analytics teams of startups who know
only SQL. Kudos to the team and the project! The lonely junior developer
firing queries late at night thanks you. 😊

------
TallGuyShort
I prefer the term "schema-on-read" to "schema-free" or "schema-less". One
needs to correctly understand the implications of the fact that all data has
SOME schema and that if it changes you need to be able to read the old and
new.

~~~
kiyoto
I agree. My employer Treasure Data
([http://www.treasuredata.com/product](http://www.treasuredata.com/product))
has been offering exactly this (SQL on schema-on-read data) for the last three
years as a cloud service. But in reality, people "get" schema-less much better
than "schema-on-read" thanks to all the marketing that MongoDB folks have done
for "schemaless".

~~~
TallGuyShort
Heard a brilliant quote on this topic at a meetup last year: "If you think
your data's schemaless, what that really means is that you don't know what the
schema is, and then you _really_ have problems".

------
ted_dunning
@polskibus,

Heterogeneous can mean different things. Drill attacks problems where there is
no well defined schema. Other heterogeneous data source tools are more about
accessing different data sources, all relational.

To do what Drill does requires that SQL be changed from a statically typed
language to a dynamically typed language that generates code on the fly. Drill
also does this in parallel.

------
polskibus
How does Apache Drill compare to Power Pivot/ Power Query? As far as I
understand, both aim to solve the problem if heterogeneous data sources. What
I don't understand is how they differ.

~~~
IndianAstronaut
I'm not sure power pivot deals well with heterogeneous data sources. It rests
on OLAP services which rely on a fixed star schemas in a data warehouse. Power
pivot is just a front end for MDX(multi-dimensional expression) queries.

------
tdicola
Wow this sounds pretty interesting. Is there any python library to use it in
ipython/jupyter, etc. yet? I love pandas and the python data 'stack' but
sometimes its syntax for joining and querying data is just extremely obtuse
and odd. Many times I finding myself wishing I could query stuff like a SQL
table.

~~~
vicenteg
Check this out:
[https://github.com/cjmatta/drill_ipython_notebook](https://github.com/cjmatta/drill_ipython_notebook)

Uses the Drill ODBC driver via pyodbc and includes some visualization through
Pandas and matplotlib.

~~~
tdicola
Very cool, thanks!

------
djhworld
How does this differ from say, Apache Hive?

~~~
jaltekruse
Disclosure, I am a committer on the Apache Drill project.

Traditionally Hive has been focused on batch processing, it started as a SQL
query engine built on top of map-reduce. Unfortunately many operations in the
SQL model are not well suited for map-reduce. This is a problem that is being
addressed by the Hive team with their vectorization/tez work, which is a
completely different execution engine from map reduce. Around the time that
this work began a number of new projects were trying to solve similar
problems, making queries more interactive, providing more flexibility and
greater coverage of the SQL language (Hive has supported a healthy subset of
the SQL language, but not full support). It is true that this new Hive engine
is solving some similar problems as Drill, as is the case for Impala,
Facebook's Presto engine and a few other projects.

Drill is focused on providing standard SQL support, instead of the SQL-like
HiveQL language. This allows it to be used to hook up to the analytical tools
and ecosystem that has been developing around the SQL language for the past
few decades.

Drill also has no requirement to define schema ahead of time when querying.
This is best illustrated by the JSON read support, you can write a query
against a bare file, and Drill will discover the schema as it reads. This is
the case for simple files stored on a local or distributed file system or even
for document stores like MongoDB.

While there will certainly be cases where schema will need to be known ahead
of time, many users of Hadoop and semi-structured datastores are having a hard
time with the overhead of trying to fit a wide variety of datasets into a
traditional DB modnel, which is what the Hive metastore provides. In cases
where the number of schemas in your data is manageable, this can give
consumers of the data useful meta-data and structutre. For this reason, Drill
has support for reading from Hive tables for these types of use cases.

However while this model has its place, our discussions with current Hadoop
users has validated the design of Drill. Analysts want to be able to gain
insights from data in Hadoop, and there are cases where the overhead of
managing this data like a traditional DB is too resource intensive. Drill
enables these types of users to explore raw semi-structured data as it enters
the Hadoop cluster, instead of waiting for IT driven ETL.

------
uberneo
How about the performance , does anybody have tried this yet .. any metrics ?

~~~
ddispaltro
I'd be interested to know this too. I know they are using the Calcite[1] query
planner which uses statistics to modify a query plan, it's pretty
sophisticated. But I am curious about the actual execution time on this stuff.

------
lerchmo
Any info on the google cloud data store integration?

------
m_mueller
Sounds very interesting. Is there anything like this for CouchDB? I like couch
because it's easily deployed on mobile as well.

~~~
vicenteg
There is a couchbase plugin for Drill: [https://github.com/jacques-n/drill-
couchbase-plugin](https://github.com/jacques-n/drill-couchbase-plugin)

~~~
m_mueller
Thanks, very interesting!

------
mark_l_watson
This is sort of similar to the Dremel tool used internally at Google, right?
They might mention this on their web site.

~~~
ted_dunning
There is an inspiration lineage from Dremel, but there are also very
substantial differences. Dremel handles nested data, but doesn't like variable
schema and tends to generate code early (AFAIK). Drill handles data that
changes shape and type and generates code very late.

The idea that SQL can be applied to data that is pretty different from
traditional relational data is definitely common between the two.

------
ckluis
Wow. That looks pretty amazing.

------
aggress
Congratulations to the Apache Drill team, Drill's awesome.

------
turingbook
How does it compare with Impala？

------
tshiran
Nice

