A small note: Its great to see so many great tools coming up to solve the kind of problems which were earlier difficult/impossible to solve.
But however please check your bid data use cases many times before using big data tools. Because frankly 'big data' is becoming a just cool must use tool regardless of use cases people have these days. I've even seen data sizes as small as 10 MB being considered for bid data use cases. Often this gets subjected to a monstrously complex architecture for no good reason.
Generally most of these cases can be addressed and solved with as simple a tool like sqlite! And all you generally need is something like Perl with sqlite and ability to write simple SQL queries.
People get deceived very easily, When they look at GB scale XML files they think that is what big data is. Yet most of that generally and easily goes into a traditional RDBMS. And the performance is generally is in pretty acceptable limits. Mark up eats a lot of space and data size. When converted to flat file structures like csv's, tsv's and then imported to a RDBMS the data sizes are way smaller. I've some times seen an order of 10x difference.
Another annoying thing is abuse of NoSQL databases. Perfectly relational data is being de normalized, force fed in NoSQL databases and access data interfaces are generally bad buggy sub implementations of SQL.
This is almost like, people who don't understand SQL are condemned to implement it badly.
See Nobody ever got fired for using Hadoop on a cluster: “A single ‘big memory’ (192 GB) server we are using has the performance capability of approximately 14 standard (12 GB) servers. … for about an eighth of the total cost.” http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...
And The Seven Deadly Sins of Cloud Computing Research: "When designing a parallel implementation, its performance should always be compared to an optimized serial implementation, even if only for a small input data set, in order to understand the overheads involved. It is also worth considering whether distribution over multiple machines is required." https://www.usenix.org/conference/hotcloud12/seven-deadly-si...
This is one of the most sensible comments I have ever read on Big Data. I sent it to a bunch of clients and people in large organizations and they vigorously nodding their heads.
"Woops you want to query that column? Better wait 12 hours while we add an index to petabytes of data! Oh shoot, and that column too? Another 12 hours, and you have to use both columns in the where to use the index. "
If you are storing xml files for your column-oriented DB then you should be shot. I imagine protocol buffers or something like that might be smarter. IMO.
Your other stuff is way off base as well, but I'm lazy.
He's talking about people mis-applying big data tools for things which are clearly not big data - there are a lot of applications that don't have, or will ever have petabytes of data.
The point about XML is comparing the storage size on disk of the same data in an XML files vs the equivalent data in a database, not stored as XML in a database.
Which is a false dichotomy, if you use the right storage mechanism for your column-oriented DB, the comparable relational DB storage will actually be larger since it has to index every column to provide the equivalent of what you can do with column-oriented storage. Also, another myth he brings up is that you can't have complex joins with column-oriented DBs. It's completely possible.
For practical purposes, you usually don't index every column. The index is a value-add for performance gains, which as far as I've seen requires custom implementation in map-reduce databases. When it is implemented, you would have the same storage problem. As an example you can look at Common Crawl, which has a public index of web page data. They provide a hadoop database of page source, and a smaller data set of page text. The page text database serves a similar function to an index. Using the text dataset instead of full HTML would be like an "Index Scan" in database optimizer terms.
I don't think he said you can't do complex joins; he said people tend to denormalize the data before putting it into NoSQL databases.
If you compare the two, then you must compare apples with apples. That is, you must compare relational DBs where every column is indexed since with column-oriented DBs, you can search by arbitrary columns without having to worry about which column is indexed. You can say that in practice you don't need this, so you reduce functionality but you're no longer comparing apples to apples. Besides, it's nice to search by any column, just because relational DBs limit you doesn't mean it's not useful. map-reduce databases are something entirely different, although you can do map-reduce via joins if need be.
You denormalize because, among other things, you don't have complex joins. With column-oriented DBs that can perform joins, denormalization is not necessary. NoSql is something entirely different.
If I remember correctly, BigQuery only lets you import data via local csv files, uploaded one at a time. That makes importing data sets of relevant size quite a pain.
Oddly enough, I found out about this restriction just yesterday. I wanted to play around with Google's ngram dataset. Amazon hosts it for free in S3 making it a no brainer to run in EC2 or EMR.
Google doesn't have it available in Google Cloud Storage and I would burn a ton of bandwidth and time just to get it there.
This makes it too difficult to kick the tires on BigQuery IMO. A file that will upload in a reasonable amount of time is more or less a toy dataset and if it's big enough to be a valid test, you will probably spend days trying to get it uploaded.
Let me turn the question around. I have about 400GB of data (spread over ~3000 HDF files, each containing 42 compressed columns). The only way I can imagine using BigQuery on this data is if I convert it all to csv files (which will take a while) and upload them to google storage (which will also take a while). Is there any alternative?
Does Google charge for bandwidth from Google App Engine to this service? It might be cheaper to upload the data to GAE in a smaller form, then transform it to CSV and upload from there.
FYI - you can technically upload directly to BigQuery and not go through cloud storage. It just tends to be a bit more error-prone in normal implementations. We're working on documenting resumable upload to handle error cases better.
Every time I see a paper with Web-Scale in the title I throw up in my mouth a little.
So they're using large amounts of nodes for parallel processing of complex queries with specific data segregated to individual nodes. The fuck does that have to do with the world-wide web or scaling the performance of an application on the web?
So I should call my dinner utensils Cow-Scale because they can be used on roast beef? Hey, I could call my socket wrenches Navy-Destroyer-Class-Scale because they can be used on boats!
Dremel does not have anything to do with the web at all. It's just data processing. You can use it for anything.
Dremel was specifically invented to solve the problem of analyzing website logs for a web search engine. They didn't build it to study particle accelerator traces and then throw it at weblogs later.
Ah, that is true, but if I recall, a "fanciful mark" that's entirely made up, like Kodak, Google, or Dremel; has to pass a higher bar. If Dremel produced and sold a power tool named "the Google", then there might be grounds.
It's an internal name not used in commerce so it is not subject to the same liability. Not an expert beyond that, but you can nickname your sister Kleenex without violating a trademark.
I doubt they registered in a class covering data processing. How many consumers are going to confuse some data processing service from Google with Dremel tools?
A trademark, unless perhaps it's famous, does not cover everything under the sun, right?
Yeah - a common test law for 'passing off' in trademark cases is exactly along those lines. It's referred to as "A moron in a hurry"[1] i.e. even a moron in a hurry wouldn't confuse the two brands.
``[While] Sagan lost the suit, Apple engineers complied with his demands anyway, renaming the project "BHA" (for Butt-Head Astronomer). Sagan promptly sued Apple for libel over the new name, claiming that it subjected him to contempt and ridicule, but lost this lawsuit as well.''
I suspect the trademark may be genericised; I know what kind of tool a "dremel" is referring to, but didn't realise it was actually a brand name until just now.
From TFA "We discuss the core ideas in the context of a read-only system, for simplicity. Many Dremel queries are one-pass aggregations; there-fore, we focus on explaining those and use them for experiments in the next section. We defer the discussion of joins, indexing, up-dates, etc. to future work." Really, it takes Dremel multiple SECONDS to complete trivial massively parallelized read queries? It must take hours for an UPDATE or JOIN then. Wake me up when you move past the trivial, until then, enjoy your hair.
That's true, BUT to query you need to have the ability to perform joins. That is what makes raw MapReduce such a pain and even higher level abstractions slow. I like the idea that Dremel is showing, I even downloaded the Google paper to read tonight, but the Apache implementation needs to have joins otherwise it's not a "query tool".
But however please check your bid data use cases many times before using big data tools. Because frankly 'big data' is becoming a just cool must use tool regardless of use cases people have these days. I've even seen data sizes as small as 10 MB being considered for bid data use cases. Often this gets subjected to a monstrously complex architecture for no good reason.
Generally most of these cases can be addressed and solved with as simple a tool like sqlite! And all you generally need is something like Perl with sqlite and ability to write simple SQL queries.
People get deceived very easily, When they look at GB scale XML files they think that is what big data is. Yet most of that generally and easily goes into a traditional RDBMS. And the performance is generally is in pretty acceptable limits. Mark up eats a lot of space and data size. When converted to flat file structures like csv's, tsv's and then imported to a RDBMS the data sizes are way smaller. I've some times seen an order of 10x difference.
Another annoying thing is abuse of NoSQL databases. Perfectly relational data is being de normalized, force fed in NoSQL databases and access data interfaces are generally bad buggy sub implementations of SQL.
This is almost like, people who don't understand SQL are condemned to implement it badly.