
Google’s Dremel Makes Big Data Look Small - arunabh
http://www.wired.com/wiredenterprise/2012/08/google-dremel-versus-hadoop/
======
kamaal
A small note: Its great to see so many great tools coming up to solve the kind
of problems which were earlier difficult/impossible to solve.

But however please check your bid data use cases many times before using big
data tools. Because frankly 'big data' is becoming a just cool must use tool
regardless of use cases people have these days. I've even seen data sizes as
small as 10 MB being considered for bid data use cases. Often this gets
subjected to a monstrously complex architecture for no good reason.

Generally most of these cases can be addressed and solved with as simple a
tool like sqlite! And all you generally need is something like Perl with
sqlite and ability to write simple SQL queries.

People get deceived very easily, When they look at GB scale XML files they
think that is what big data is. Yet most of that generally and easily goes
into a traditional RDBMS. And the performance is generally is in pretty
acceptable limits. Mark up eats a lot of space and data size. When converted
to flat file structures like csv's, tsv's and then imported to a RDBMS the
data sizes are way smaller. I've some times seen an order of 10x difference.

Another annoying thing is abuse of NoSQL databases. Perfectly relational data
is being de normalized, force fed in NoSQL databases and access data
interfaces are generally bad buggy sub implementations of SQL.

This is almost like, people who don't understand SQL are condemned to
implement it badly.

~~~
fkdjs
"Woops you want to query that column? Better wait 12 hours while we add an
index to petabytes of data! Oh shoot, and that column too? Another 12 hours,
and you have to use both columns in the where to use the index. "

If you are storing xml files for your column-oriented DB then you should be
shot. I imagine protocol buffers or something like that might be smarter. IMO.

Your other stuff is way off base as well, but I'm lazy.

~~~
garysieling
He's talking about people mis-applying big data tools for things which are
clearly not big data - there are a lot of applications that don't have, or
will ever have petabytes of data.

The point about XML is comparing the storage size on disk of the same data in
an XML files vs the equivalent data in a database, not stored as XML in a
database.

~~~
fkdjs
Which is a false dichotomy, if you use the right storage mechanism for your
column-oriented DB, the comparable relational DB storage will actually be
larger since it has to index every column to provide the equivalent of what
you can do with column-oriented storage. Also, another myth he brings up is
that you can't have complex joins with column-oriented DBs. It's completely
possible.

~~~
garysieling
For practical purposes, you usually don't index every column. The index is a
value-add for performance gains, which as far as I've seen requires custom
implementation in map-reduce databases. When it is implemented, you would have
the same storage problem. As an example you can look at Common Crawl, which
has a public index of web page data. They provide a hadoop database of page
source, and a smaller data set of page text. The page text database serves a
similar function to an index. Using the text dataset instead of full HTML
would be like an "Index Scan" in database optimizer terms.

I don't think he said you can't do complex joins; he said people tend to
denormalize the data before putting it into NoSQL databases.

~~~
fkdjs
If you compare the two, then you must compare apples with apples. That is, you
must compare relational DBs where every column is indexed since with column-
oriented DBs, you can search by arbitrary columns without having to worry
about which column is indexed. You can say that in practice you don't need
this, so you reduce functionality but you're no longer comparing apples to
apples. Besides, it's nice to search by any column, just because relational
DBs limit you doesn't mean it's not useful. map-reduce databases are something
entirely different, although you can do map-reduce via joins if need be.

You denormalize because, among other things, you don't have complex joins.
With column-oriented DBs that can perform joins, denormalization is not
necessary. NoSql is something entirely different.

------
d99kris
Link to the paper describing Dremel [PDF]:
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf)

~~~
tonfa
And there's already a potentially better column oriented datastore used at
Google: <http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf>

------
egillie
There's an apache version of this in the works: [http://www.itworld.com/big-
datahadoop/290026/new-apache-proj...](http://www.itworld.com/big-
datahadoop/290026/new-apache-project-will-drill-big-data-near-real-time)

~~~
moondowner
Nice, I see that MapR are initiators of this project, which can be seen as a
good thing.

Here's the direct link to the project.

<https://wiki.apache.org/incubator/DrillProposal>

"Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by Google's Dremel."

------
iskander
If I remember correctly, BigQuery only lets you import data via local csv
files, uploaded one at a time. That makes importing data sets of relevant size
quite a pain.

~~~
bockris
Oddly enough, I found out about this restriction just yesterday. I wanted to
play around with Google's ngram dataset. Amazon hosts it for free in S3 making
it a no brainer to run in EC2 or EMR.

Google doesn't have it available in Google Cloud Storage and I would burn a
ton of bandwidth and time just to get it there.

[http://stackoverflow.com/questions/11990839/is-there-a-
way-t...](http://stackoverflow.com/questions/11990839/is-there-a-way-to-
upload-to-cloudstorage-from-a-url)

This makes it too difficult to kick the tires on BigQuery IMO. A file that
will upload in a reasonable amount of time is more or less a toy dataset and
if it's big enough to be a valid test, you will probably spend days trying to
get it uploaded.

~~~
proppy
There is publicdata.samples.trigrams in the public datasets.

~~~
bockris
Yeah, I left a comment on the SO post noting that and my need for bigrams
rather than trigrams.

------
peterwwillis
Every time I see a paper with Web-Scale in the title I throw up in my mouth a
little.

So they're using large amounts of nodes for parallel processing of complex
queries with specific data segregated to individual nodes. The fuck does that
have to do with the world-wide web or scaling the performance of an
application on the web?

~~~
akldfgj
Because the data is the traffic of one of the most heavily used sites on the
web, and the application is an index of the web. So there's two.

~~~
peterwwillis
So I should call my dinner utensils Cow-Scale because they can be used on
roast beef? Hey, I could call my socket wrenches Navy-Destroyer-Class-Scale
because they can be used on boats!

Dremel does not have anything to do with the web _at all_. It's just data
processing. You can use it for anything.

~~~
akldfgj
Dremel was specifically invented to solve the problem of analyzing website
logs for a web search engine. They didn't build it to study particle
accelerator traces and then throw it at weblogs later.

------
sbierwagen
"Dremel" isn't trademarked by the rotary tool folks?

~~~
nodata
Trademarks apply to domains. Dremel have a trademark on the Dremel drill, not
on anything in the whole world called Dremel.

~~~
sbierwagen
Ah, _that_ is true, but if I recall, a "fanciful mark" that's entirely made
up, like Kodak, Google, or Dremel; has to pass a higher bar. If Dremel
produced and sold a power tool named "the Google", then there might be
grounds.

~~~
aneth4
It's an internal name not used in commerce so it is not subject to the same
liability. Not an expert beyond that, but you can nickname your sister Kleenex
without violating a trademark.

------
jdf
Not sure why Cloudera is part of this article, seems like all the attention
here should be on Google and the BigQuery team.

Here is an open source project similar to Dremel:

[http://www.itworld.com/big-datahadoop/290026/new-apache-
proj...](http://www.itworld.com/big-datahadoop/290026/new-apache-project-will-
drill-big-data-near-real-time)

------
majorturd
From TFA "We discuss the core ideas in the context of a read-only system, for
simplicity. Many Dremel queries are one-pass aggregations; there-fore, we
focus on explaining those and use them for experiments in the next section. We
defer the discussion of joins, indexing, up-dates, etc. to future work."
Really, it takes Dremel multiple SECONDS to complete trivial massively
parallelized read queries? It must take hours for an UPDATE or JOIN then. Wake
me up when you move past the trivial, until then, enjoy your hair.

~~~
jrockway
Dremel is a query tool, not a database.

~~~
virmundi
That's true, BUT to query you need to have the ability to perform joins. That
is what makes raw MapReduce such a pain and even higher level abstractions
slow. I like the idea that Dremel is showing, I even downloaded the Google
paper to read tonight, but the Apache implementation needs to have joins
otherwise it's not a "query tool".

~~~
ag3mo
You can join on top of BigQuery with small join tables.
[https://developers.google.com/bigquery/docs/query-
reference#...](https://developers.google.com/bigquery/docs/query-
reference#joins)

~~~
majorturd
_cough_ Here small means less than 8MB of compressed data _cough_

