
Apache Drill's Future - bsg75
http://mail-archives.apache.org/mod_mbox/drill-user/202005.mbox/%3c41E5B6E8-EE0E-4219-AA0D-9C1060F5EBD5@gmail.com%3e
======
agacera
Apache Drill is an interesting project, from all the MPP engines that appeared
a few years ago, it was the most similar one to BigQuery (the first public
version) and the most flexible.

However, the competion was fierce and each Big Data vendor (MapR, Cloudera and
HortonWorks) was pushing its own solution: Drill, Impala and Hive on Tez.
Competion is always a good thing, but it fragmented the user base too much so
no clear winner emerged.

At the same time, Spark SQL got sufficiently better to replace these tools in
most use cases and Presto (from Facebook) got the traction and the user base
that none of these projects had by being vendor agnostic (and its adoption by
AWS in Athena and EMR also helped boost its popularity).

~~~
qeternity
I've not spent much time, but I've never exactly understood what Presto is. Is
it just map reduce across databases?

~~~
chrisjc
"Presto is an open-source distributed SQL query engine optimized for low-
latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including
complex queries, aggregations, joins, and window functions. Presto can process
data from multiple data sources including the Hadoop Distributed File System
(HDFS) and Amazon S3"

TIL that Presto is available in EMR.

~~~
ztjio
Not only that, but, AWS Athena is basically serverless Presto. It's an
extremely handy tool particularly if you've got structured or semi-structured
data being dumped into S3 and you want a near zero maintenance (only have to
create schemas) way to explore it.

------
vhold
Please correct me if I have this wrong, but my vague understanding is that the
data representation heart of Apache Drill lives on in the rather active Apache
Arrow project.

[https://stackoverflow.com/questions/53533506/what-is-the-
dif...](https://stackoverflow.com/questions/53533506/what-is-the-difference-
between-apache-drills-valuevectors-and-apache-arrow)

[https://github.com/apache/arrow/commit/e6905effbb9383afd2423...](https://github.com/apache/arrow/commit/e6905effbb9383afd2423a4f86cf9a33ca680b9d)

And the platform/tools side of Drill now lives on as Dremio, which uses Apache
Arrow.

[https://github.com/dremio/dremio-oss](https://github.com/dremio/dremio-oss)

So the essence of Drill still lives, but it became half Apache project and
half vendor controlled and supported, and the root of that split is now
orphaned.

------
srl
For those interested, the relevant rules seem to be here:
[https://www.apache.org/foundation/voting.html](https://www.apache.org/foundation/voting.html)

As far as I can tell, the implication is that there are now fewer than three
people interested enough to participate in code reviews, and ASF rules require
at least three +1 votes for basically anything to happen.

~~~
rectang
That's basically the idea though the details are subtly different.

What the ASF won't let you do if you can't muster the votes is actually make a
_release_ — that takes 3 votes from people on the Drill PMC (Project
Management Committee). If you can't get 3 PMC votes, the project cannot even
make security releases and must be retired.

As to who can commit code, from the ASF's standpoint any person with commit
rights can do so at any time. However, the project may impose additional
constraints, such as requiring a code review.

~~~
gopalv
> As to who can commit code, from the ASF's standpoint any person with commit
> rights can do so at any time. However, the project may impose additional
> constraints, such as requiring a code review.

Projects can amend their bylaws around this, but that also goes to the PMC -
lazy consensus can apply, but yeah you can't go through the source release
process without 3 binding votes.

Eventually, you realize that this sort of software is not anybody's hobby
project and takes real dollars to pay for development.

Doesn't help that Dremio would profit from Drill collapsing and leaving that
space open.

------
dmix
Some context

Drill:
[https://en.wikipedia.org/wiki/Apache_Drill](https://en.wikipedia.org/wiki/Apache_Drill)

MapR sold to Hewlett-Packard Enterprise (HPE):
[https://en.wikipedia.org/wiki/MapR](https://en.wikipedia.org/wiki/MapR)

------
cube2222
Sad to see this happen as I really like the idea!

If you're interested in Drill, check out OctoSQL[0]. It shares the same vision
of querying multiple datasources using pure SQL, and pushing down as much
operations as possible to the underlying datasource.

Moreover, there's been a huge rewrite under way this past year, ready to use
on the master branch, yet unreleased however (will be available soon).

It adds Kafka and Parquet support and most importantly first-class unbounded
stream support. Including temporal SQL extensions for working with event time
metadata (instead of system time) so you can use stuff like live updated time
window aggregations on incoming kafka streams. It also now uses on-disk badger
storage as the primary way to store its state, so you can do Group Bys / Joins
with lots of keys, and restarts of OctoSQL won't alter the final result
(exactly-once semantics).

Make sure to check it out, it's also very simple to get going locally!

Disclosure: I'm one of the main contributors.

[0]:[https://github.com/cube2222/octosql](https://github.com/cube2222/octosql)

~~~
bsg75
Does OctoSQL support reading columnar compressed formats (ex. Parquet) from
distributed storage (ex. S3) ?

~~~
cube2222
No, we don't support Excel, JSON, CSV and Parquet datasources other than local
files yet, though that's definitely planned and would be very easy to add.

~~~
cgivre
Drill does ;-) Drill also has a streaming Excel reader that works very well
with large Excel files...

------
cgivre
As the author of the above post, I thought I'd respond to the comments below.
Firstly, I'm glad to see that people noticed... :-)

Firstly, to paraphrase Monty Python, "Drill's not dead yet." There are some
efforts to get corporate sponsorship, but it will take time. HPE's withdrawal
was expected, but disappointing none the less. With that said, we are still
gearing up to release Drill 1.18 which will have a considerable amount of
enhancements, including new formats (SPSS, HDF5, possibly SAS) as well as new
storage plugins to enable Drill to connect to Druid as well as REST APIs.

Personally, I've always felt that Drill was marketed to the wrong audience. As
everyone notes, there are many competitors in the big data analytics space:
Spark, ES, Splunk, Presto, Impala to name a few. Where I see Drill as filling
a rather unique niche in the market is the small-to medium size data analytics
with complex data.

For instance, if you have a CSV file, you use Excel. If you have 100 CSVs you
have to code. Or what if you have an Excel spreadsheet and you want to pull
data from your corporate reference API, which happens to return JSON and uses
OAUTH authentication? Drill is the kind of tool that can bridge this gap and
allow analysts to rapidly get value out of these situations.

As an example, here's a demo of me building a COVID dashboard from a REST API
and spreadsheet in about 15 min with zero data prep:
[https://youtu.be/oEOhFWm3D9A](https://youtu.be/oEOhFWm3D9A)

Another example: incident response. Let's say you have a PCAP file, use
Wireshark. What if you have 10GB of PCAP files? Most likely you'll have to
code up some solution. With Drill however, you can query that w/o coding,
which means that you can get the value out of this data faster.

I know there are people using Drill. I know when I demo Drill to analysts,
they love it. (Ok.. I'm biased here but I would say that's an accurate
representation of the response to my presentations) What Drill lost was an
active developer community. I hope over the next few months, that Drill's
users will step up a bit and contribute to code reviews and/or actual code
contributions. I've been thinking about creating a security focused fork of
Drill as well so we'll see what happens.

If you have ideas/comments/questions, you can email me at cgivre@apache.org.

~~~
swuecho
Your introduction is exactly how I feel about drill. However, the official doc
is not as good as the book (Learning Apache Drill)

------
PaulHoule
How many Apache projects are there in the "Big Data" space? It seems every
time I look around I see a new one.

~~~
chrisjc
49 according to the project list page. Some of which are in attic, some in
incubation.

[https://projects.apache.org/projects.html?category#big-
data](https://projects.apache.org/projects.html?category#big-data)

Were you looking for a single project for all your big-data needs?

~~~
catawbasam
Looking for some continuity after those splashy Strata presentations.

~~~
TallGuyShort
Strata is very corporate. If you want what you see in Strata presentations,
buy it from the vendors (Cloudera, Microsoft, etc.)

What you see on apache.org is what gets put in presentations at say, the
Hadoop Summit.

edit: On a more helpful note, Apache Spark is probably as close as you can get
to a single project for all your big data needs if that is what one wants out-
of-the-box from an open-source project. It includes a SQL framework, streaming
framework, either bundles or improves upon more general work done in Hadoop,
etc. It can be pretty vendor-controlled at times, but it's birth was in
academia, making it pretty different from the other projects that were mostly
born as components in already established commercial platforms. There are pros
and cons to that, of course.

------
moandcompany
This is unfortunate.

For anyone curious about Apache Drill, it was inspired by Dremel which was
used internally at Google and once the engine beneath GCP's BigQuery.

[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf)

In the era of Hadoop, it was most closely aligned with the MapR distribution,
while Hortonworks aligned with Hive, and Cloudera offered Impala as their
solution.

History of Apache Drill: [http://radar.oreilly.com/2015/09/apache-drill-
tracking-its-h...](http://radar.oreilly.com/2015/09/apache-drill-tracking-its-
history-as-an-open-source-community.html)

~~~
pulisse
> Dremel which was [...] once the engine beneath GCP's BigQuery

Is there a white paper describing what replaced Dremel in BigQuery's
architecture?

~~~
moandcompany
Clarification -- There's the Dremel paper, the Google-internal implementation,
and BigQuery, a public-facing implementation that evolved from Dremel :)

Here's some interesting reads on BigQuery:
[https://cloud.google.com/files/BigQueryTechnicalWP.pdf](https://cloud.google.com/files/BigQueryTechnicalWP.pdf)

Also, Happy 10th Birthday, BigQuery!
[https://cloud.google.com/blog/products/data-
analytics/bigque...](https://cloud.google.com/blog/products/data-
analytics/bigquery-turns-10)

------
greendisc7
Check out Dremio. The interface and speed were awesome when I tested on some
pretty large datasets.

------
kkwteh
I was thinking of using Apache Drill for converting CSVs to Parquet. What else
do people use for that?

~~~
lurker458
I've also been looking for that. In an ideal world there would be a small,
fast, standalone cli tool that can convert csv to parquet. There is a (sadly,
unfinished) parquet writer Rust library in the Arrow repository that looks
promising. All approaches I've tried so far (spark, pyarrow, drill, ...)
require everything and the kitchen sink. So far I've settled on a java cli
tool that uses jackson + org.apache.parquet internally, but it's cpu bound and
has a huge amount of maven dependencies.

~~~
meritt
pandas + fastparquet fairly lightweight. but yes I would love to see a simple
c++/golang binary that's just a simple csv2parq call.

~~~
MrPowers
Newer versions of Pandas don't even need fastparquet anymore. This code works:

import pandas as pd

df = pd.read_csv('data/us_presidents.csv')

df.to_parquet('tmp/us_presidents.parquet')

~~~
meritt
Nice! Does that work alongside reading in via chunks and writing via
row_groups? If I have a 500GB CSV will it work?

------
timClicks
I'm really impressed with the misson of Drill - write SQL for disparate data
sources - but I've actually never installed it. When I have a bunch of
parquet/csv/... files sitting around, I can normally slurp them in with
pandas.

~~~
cgivre
Try it out! I'd agree that if you have 1 CSV you don't need Drill. If you have
more than one, is where it starts to get interesting. Also there is PyDrill
which enables you to execute a Drill query and put the results directly into a
pandas data frame. If you're an R person, there's Sergeant that does the same.

~~~
timClicks
Are you Charles Givre? I have your book!

~~~
cgivre
I am!! Glad to hear you liked the book!

------
chrisjc
Sad to hear this since I've been following Drill for a while. From what I
understand Drill was based on the google Dremel paper, hence the name.

[https://research.google/pubs/pub36632/](https://research.google/pubs/pub36632/)

Wondering if maybe Spark REPL or Apache Zeppelin might be a decent replacement
for Drill.

~~~
rhacker
One thing I like about Spark is the number of data sources it supports,
including its ability to write out to a table or whatever. If you want to set
up a JDBC or ODBC server you can also access spark SQL :

[https://spark.apache.org/docs/latest/sql-distributed-sql-
eng...](https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html)

From an analytics perspective, a lot of people like the ability to connect to
a JDBC or ODBC source.

------
josep2
I really like Apache Drill. I used it about 4 years ago as an MPP SQL on
Anything Engine. Presto took a lot of that momentum away and MapR failed to
adapt to the new age kind of meant EOL for the project. I hope someone else
picks it up.

------
bradhe
I wonder how many apache projects actually survive this sort of event?

------
jerdavis
Just use bigquery.

------
SrslyJosh
Still waiting for Apache dril.

~~~
cgivre
For what?

