Google’s Dremel Makes Big Data Look Small

kamaal · on Aug 17, 2012

A small note: Its great to see so many great tools coming up to solve the kind of problems which were earlier difficult/impossible to solve.

But however please check your bid data use cases many times before using big data tools. Because frankly 'big data' is becoming a just cool must use tool regardless of use cases people have these days. I've even seen data sizes as small as 10 MB being considered for bid data use cases. Often this gets subjected to a monstrously complex architecture for no good reason.

Generally most of these cases can be addressed and solved with as simple a tool like sqlite! And all you generally need is something like Perl with sqlite and ability to write simple SQL queries.

People get deceived very easily, When they look at GB scale XML files they think that is what big data is. Yet most of that generally and easily goes into a traditional RDBMS. And the performance is generally is in pretty acceptable limits. Mark up eats a lot of space and data size. When converted to flat file structures like csv's, tsv's and then imported to a RDBMS the data sizes are way smaller. I've some times seen an order of 10x difference.

Another annoying thing is abuse of NoSQL databases. Perfectly relational data is being de normalized, force fed in NoSQL databases and access data interfaces are generally bad buggy sub implementations of SQL.

This is almost like, people who don't understand SQL are condemned to implement it badly.

wmf · on Aug 17, 2012

See Nobody ever got ﬁred for using Hadoop on a cluster: “A single ‘big memory’ (192 GB) server we are using has the performance capability of approximately 14 standard (12 GB) servers. … for about an eighth of the total cost.” http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...

And The Seven Deadly Sins of Cloud Computing Research: "When designing a parallel implementation, its performance should always be compared to an optimized serial implementation, even if only for a small input data set, in order to understand the overheads involved. It is also worth considering whether distribution over multiple machines is required." https://www.usenix.org/conference/hotcloud12/seven-deadly-si...

nrao123 · on Aug 17, 2012

This is one of the most sensible comments I have ever read on Big Data. I sent it to a bunch of clients and people in large organizations and they vigorously nodding their heads.

fkdjs · on Aug 17, 2012

"Woops you want to query that column? Better wait 12 hours while we add an index to petabytes of data! Oh shoot, and that column too? Another 12 hours, and you have to use both columns in the where to use the index. "

If you are storing xml files for your column-oriented DB then you should be shot. I imagine protocol buffers or something like that might be smarter. IMO.

Your other stuff is way off base as well, but I'm lazy.

garysieling · on Aug 17, 2012

He's talking about people mis-applying big data tools for things which are clearly not big data - there are a lot of applications that don't have, or will ever have petabytes of data.

The point about XML is comparing the storage size on disk of the same data in an XML files vs the equivalent data in a database, not stored as XML in a database.

fkdjs · on Aug 17, 2012

Which is a false dichotomy, if you use the right storage mechanism for your column-oriented DB, the comparable relational DB storage will actually be larger since it has to index every column to provide the equivalent of what you can do with column-oriented storage. Also, another myth he brings up is that you can't have complex joins with column-oriented DBs. It's completely possible.

garysieling · on Aug 17, 2012

For practical purposes, you usually don't index every column. The index is a value-add for performance gains, which as far as I've seen requires custom implementation in map-reduce databases. When it is implemented, you would have the same storage problem. As an example you can look at Common Crawl, which has a public index of web page data. They provide a hadoop database of page source, and a smaller data set of page text. The page text database serves a similar function to an index. Using the text dataset instead of full HTML would be like an "Index Scan" in database optimizer terms.

I don't think he said you can't do complex joins; he said people tend to denormalize the data before putting it into NoSQL databases.

fkdjs · on Aug 18, 2012

If you compare the two, then you must compare apples with apples. That is, you must compare relational DBs where every column is indexed since with column-oriented DBs, you can search by arbitrary columns without having to worry about which column is indexed. You can say that in practice you don't need this, so you reduce functionality but you're no longer comparing apples to apples. Besides, it's nice to search by any column, just because relational DBs limit you doesn't mean it's not useful. map-reduce databases are something entirely different, although you can do map-reduce via joins if need be.

You denormalize because, among other things, you don't have complex joins. With column-oriented DBs that can perform joins, denormalization is not necessary. NoSql is something entirely different.

d99kris · on Aug 17, 2012

Link to the paper describing Dremel [PDF]: http://static.googleusercontent.com/external_content/untrust...

tonfa · on Aug 17, 2012

And there's already a potentially better column oriented datastore used at Google: http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf

egillie · on Aug 17, 2012

There's an apache version of this in the works: http://www.itworld.com/big-datahadoop/290026/new-apache-proj...

moondowner · on Aug 17, 2012

Nice, I see that MapR are initiators of this project, which can be seen as a good thing.

Here's the direct link to the project.

https://wiki.apache.org/incubator/DrillProposal

"Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel."

iskander · on Aug 17, 2012

If I remember correctly, BigQuery only lets you import data via local csv files, uploaded one at a time. That makes importing data sets of relevant size quite a pain.

bockris · on Aug 17, 2012

Oddly enough, I found out about this restriction just yesterday. I wanted to play around with Google's ngram dataset. Amazon hosts it for free in S3 making it a no brainer to run in EC2 or EMR.

Google doesn't have it available in Google Cloud Storage and I would burn a ton of bandwidth and time just to get it there.

http://stackoverflow.com/questions/11990839/is-there-a-way-t...

This makes it too difficult to kick the tires on BigQuery IMO. A file that will upload in a reasonable amount of time is more or less a toy dataset and if it's big enough to be a valid test, you will probably spend days trying to get it uploaded.

proppy · on Aug 17, 2012

There is publicdata.samples.trigrams in the public datasets.

bockris · on Aug 17, 2012

Yeah, I left a comment on the SO post noting that and my need for bigrams rather than trigrams.

ryguyrg · on Aug 17, 2012

The binary gsutil does have a multi-threaded mode for uploading many files simultaneously: https://developers.google.com/storage/docs/gsutil_reference_...

@iskander - do you have other approaches for uploading which you'd prefer?

iskander · on Aug 18, 2012

Let me turn the question around. I have about 400GB of data (spread over ~3000 HDF files, each containing 42 compressed columns). The only way I can imagine using BigQuery on this data is if I convert it all to csv files (which will take a while) and upload them to google storage (which will also take a while). Is there any alternative?

sp332 · on Aug 17, 2012

Does Google charge for bandwidth from Google App Engine to this service? It might be cheaper to upload the data to GAE in a smaller form, then transform it to CSV and upload from there.

alooPotato · on Aug 17, 2012

Sadly you must still go through cloud storage to get data from appengine to bigquery. No savings there except maybe time.

ryguyrg · on Aug 17, 2012

FYI - you can technically upload directly to BigQuery and not go through cloud storage. It just tends to be a bit more error-prone in normal implementations. We're working on documenting resumable upload to handle error cases better.

0xbadcafebee · on Aug 17, 2012

Every time I see a paper with Web-Scale in the title I throw up in my mouth a little.

So they're using large amounts of nodes for parallel processing of complex queries with specific data segregated to individual nodes. The fuck does that have to do with the world-wide web or scaling the performance of an application on the web?

akldfgj · on Aug 17, 2012

Because the data is the traffic of one of the most heavily used sites on the web, and the application is an index of the web. So there's two.

0xbadcafebee · on Aug 17, 2012

So I should call my dinner utensils Cow-Scale because they can be used on roast beef? Hey, I could call my socket wrenches Navy-Destroyer-Class-Scale because they can be used on boats!

Dremel does not have anything to do with the web at all. It's just data processing. You can use it for anything.

akldfgj · on Aug 30, 2012

Dremel was specifically invented to solve the problem of analyzing website logs for a web search engine. They didn't build it to study particle accelerator traces and then throw it at weblogs later.

sbierwagen · on Aug 17, 2012

"Dremel" isn't trademarked by the rotary tool folks?

jdf · on Aug 17, 2012

Dremel was Google's internal name. Their public API is called BigQuery:

https://developers.google.com/bigquery/

nodata · on Aug 17, 2012

Trademarks apply to domains. Dremel have a trademark on the Dremel drill, not on anything in the whole world called Dremel.

sbierwagen · on Aug 17, 2012

Ah, that is true, but if I recall, a "fanciful mark" that's entirely made up, like Kodak, Google, or Dremel; has to pass a higher bar. If Dremel produced and sold a power tool named "the Google", then there might be grounds.

aneth4 · on Aug 17, 2012

It's an internal name not used in commerce so it is not subject to the same liability. Not an expert beyond that, but you can nickname your sister Kleenex without violating a trademark.

maxerickson · on Aug 17, 2012

Wikipedia says that Dremel is named after the founder of the company.

nodata · on Aug 17, 2012

Like Coca-Cola.

yaroslavvb · on Aug 17, 2012

It's like "Delta" that makes faucets vs "Delta" that flies planes

ggggit · on Aug 17, 2012

I doubt they registered in a class covering data processing. How many consumers are going to confuse some data processing service from Google with Dremel tools?

A trademark, unless perhaps it's famous, does not cover everything under the sun, right?

mysterywhiteboy · on Aug 17, 2012

Yeah - a common test law for 'passing off' in trademark cases is exactly along those lines. It's referred to as "A moron in a hurry"[1] i.e. even a moron in a hurry wouldn't confuse the two brands.

[1] http://en.wikipedia.org/wiki/A_moron_in_a_hurry

kator · on Aug 17, 2012

LOL I read that as "A mormon in a hurry" and briefly thought to myself "What does religion have to do with trademark law"..

duaneb · on Aug 17, 2012

What are they going to do, sue google for an internal code name?

rachelbythebay · on Aug 17, 2012

http://en.wikipedia.org/wiki/Power_Macintosh_7100#Codename_L...

jrockway · on Aug 17, 2012

``[While] Sagan lost the suit, Apple engineers complied with his demands anyway, renaming the project "BHA" (for Butt-Head Astronomer). Sagan promptly sued Apple for libel over the new name, claiming that it subjected him to contempt and ridicule, but lost this lawsuit as well.''

salmanapk · on Aug 17, 2012

... "Apple's third and final code name for the project was "LaW", short for "Lawyers are Wimps"

haha Apple has changed so much over the years.

archangel_one · on Aug 17, 2012

I suspect the trademark may be genericised; I know what kind of tool a "dremel" is referring to, but didn't realise it was actually a brand name until just now.

jdf · on Aug 17, 2012

Not sure why Cloudera is part of this article, seems like all the attention here should be on Google and the BigQuery team.

Here is an open source project similar to Dremel:

http://www.itworld.com/big-datahadoop/290026/new-apache-proj...

majorturd · on Aug 17, 2012

From TFA "We discuss the core ideas in the context of a read-only system, for simplicity. Many Dremel queries are one-pass aggregations; there-fore, we focus on explaining those and use them for experiments in the next section. We defer the discussion of joins, indexing, up-dates, etc. to future work." Really, it takes Dremel multiple SECONDS to complete trivial massively parallelized read queries? It must take hours for an UPDATE or JOIN then. Wake me up when you move past the trivial, until then, enjoy your hair.

jrockway · on Aug 17, 2012

Dremel is a query tool, not a database.

virmundi · on Aug 17, 2012

That's true, BUT to query you need to have the ability to perform joins. That is what makes raw MapReduce such a pain and even higher level abstractions slow. I like the idea that Dremel is showing, I even downloaded the Google paper to read tonight, but the Apache implementation needs to have joins otherwise it's not a "query tool".

ag3mo · on Aug 17, 2012

You can join on top of BigQuery with small join tables. https://developers.google.com/bigquery/docs/query-reference#...

majorturd · on Aug 17, 2012

cough Here small means less than 8MB of compressed data cough

akldfgj · on Aug 17, 2012

What is the performance of your JOIN system?